Text Classification Inference Benchmark: What Actually Happens on CPU vs GPU
A practical inference benchmark comparing DistilBERT performance on CPU vs GPU—measuring latency, throughput, and memory across different batch sizes to understand what actually happens in production.
What This Project Is About
I wanted to understand how inference actually behaves for transformer models. Not how to train them or optimize them—just what really happens when you run inference on CPU vs GPU.
This is using DistilBERT on a sentiment classification task with the IMDB movie reviews dataset. Binary sentiment: positive or negative. Nothing fancy.
Important clarifications:
- I did not fine-tune the model
- The goal was not accuracy improvement
- Only inference behavior
I wanted to look at:
- Latency
- Throughput
- Memory usage
And see how they change:
- On CPU vs GPU
- With different batch sizes
Setup
Model: distilbert-base-uncased-finetuned-sst-2-english (used as-is, no changes)
Dataset: IMDB movie reviews
Environment: Google Colab
- PyTorch
- Hugging Face Transformers
- Hugging Face Datasets
Hardware:
- CPU runtime
- NVIDIA Tesla T4 GPU runtime
CPU and GPU experiments were done in separate Colab sessions. I literally switched the runtime type.
Preprocessing
Tokenization:
- DistilBERT tokenizer
- Truncation enabled
- Fixed max sequence length
Important detail: Tokenization always runs on CPU, even when inference is on GPU. This becomes important later.
Batch Sizes Tested
I didn't go crazy here. Kept it very controlled.
- Batch size = 1: More latency-focused
- Batch size = 32: More throughput-focused
Metrics Tracked
- Latency (ms): Batch-level timing
- Throughput (samples/sec): Batch-level
- GPU memory (MB): Peak allocated memory
Inference Setup
I used:
- The same inference function for both CPU and GPU
- Automatic device selection
torch.no_grad()(no gradients)- CUDA sync for GPU timing (very important for accurate measurements)
Basically, I tried to keep things fair.
What I Observed
GPU with Batch Size 1
- Low latency
- GPU looks fast but overheads still exist
- End-to-end gains aren't crazy because of overhead
GPU with Batch Size 32
- Much higher throughput
- GPU memory usage goes up significantly
- GPU finally starts showing real advantage
The pattern was clear: batching matters a lot.
CPU vs GPU (In Simple Words)
CPU inference:
- Okay for very small workloads
- Bad for larger batch sizes
GPU inference:
- Shines for throughput-heavy workloads
- Small batch sizes don't fully justify GPU overhead
Batch size selection matters more than hardware alone.
This was honestly the biggest takeaway for me.
Key Observations
Some things became very clear:
-
GPU latency is lower than CPU for batch size 1, but end-to-end gains aren't dramatic because of overhead
-
Increasing batch size:
- Improves GPU utilization
- Increases memory usage significantly
-
Throughput gains on GPU are very obvious when batching
-
Tokenization is still a bottleneck—it's CPU-bound and not negligible
Limitations I'm Aware Of
- CPU and GPU runs were in different Colab sessions
- Throughput numbers are batch-level, not absolute truth
- Results are specific to Tesla T4
- Tokenization time is included in latency
So yeah, not a perfect benchmark, but realistic.
How to Run It
If someone wants to replicate this:
- Open the Colab notebook
- Select runtime:
- CPU → CPU benchmarks
- GPU → GPU benchmarks
- Run all cells
- Results print at the end
Final Thoughts
This wasn't about making the model better. It was about understanding reality.
Inference performance depends heavily on:
- Batch size
- Preprocessing
- Workload type
GPU is great, but only when you actually use it properly. And tokenization + data handling can easily mess up your latency numbers if you're not careful.
More benchmarks and experiments coming.