Text Classification Inference Benchmark: What Actually Happens on CPU vs GPU

What This Project Is About

I wanted to understand how inference actually behaves for transformer models. Not how to train them or optimize them—just what really happens when you run inference on CPU vs GPU.

This is using DistilBERT on a sentiment classification task with the IMDB movie reviews dataset. Binary sentiment: positive or negative. Nothing fancy.

Important clarifications:

I did not fine-tune the model
The goal was not accuracy improvement
Only inference behavior

I wanted to look at:

Latency
Throughput
Memory usage

And see how they change:

On CPU vs GPU
With different batch sizes

Setup

Model: distilbert-base-uncased-finetuned-sst-2-english (used as-is, no changes)

Dataset: IMDB movie reviews

Environment: Google Colab

PyTorch
Hugging Face Transformers
Hugging Face Datasets

Hardware:

CPU runtime
NVIDIA Tesla T4 GPU runtime

CPU and GPU experiments were done in separate Colab sessions. I literally switched the runtime type.

Preprocessing

Tokenization:

DistilBERT tokenizer
Truncation enabled
Fixed max sequence length

Important detail: Tokenization always runs on CPU, even when inference is on GPU. This becomes important later.

Batch Sizes Tested

I didn't go crazy here. Kept it very controlled.

Batch size = 1: More latency-focused
Batch size = 32: More throughput-focused

Metrics Tracked

Latency (ms): Batch-level timing
Throughput (samples/sec): Batch-level
GPU memory (MB): Peak allocated memory

Inference Setup

I used:

The same inference function for both CPU and GPU
Automatic device selection
torch.no_grad() (no gradients)
CUDA sync for GPU timing (very important for accurate measurements)

Basically, I tried to keep things fair.

What I Observed

GPU with Batch Size 1

Low latency
GPU looks fast but overheads still exist
End-to-end gains aren't crazy because of overhead

GPU with Batch Size 32

Much higher throughput
GPU memory usage goes up significantly
GPU finally starts showing real advantage

The pattern was clear: batching matters a lot.

CPU vs GPU (In Simple Words)

CPU inference:

Okay for very small workloads
Bad for larger batch sizes

GPU inference:

Shines for throughput-heavy workloads
Small batch sizes don't fully justify GPU overhead

Batch size selection matters more than hardware alone.

This was honestly the biggest takeaway for me.

Key Observations

Some things became very clear:

GPU latency is lower than CPU for batch size 1, but end-to-end gains aren't dramatic because of overhead
Increasing batch size:
- Improves GPU utilization
- Increases memory usage significantly
Throughput gains on GPU are very obvious when batching
Tokenization is still a bottleneck—it's CPU-bound and not negligible

Limitations I'm Aware Of

CPU and GPU runs were in different Colab sessions
Throughput numbers are batch-level, not absolute truth
Results are specific to Tesla T4
Tokenization time is included in latency

So yeah, not a perfect benchmark, but realistic.

How to Run It

If someone wants to replicate this:

Open the Colab notebook
Select runtime:
- CPU → CPU benchmarks
- GPU → GPU benchmarks
Run all cells
Results print at the end

Colab Notebook Link

Final Thoughts

This wasn't about making the model better. It was about understanding reality.

Inference performance depends heavily on:

Batch size
Preprocessing
Workload type

GPU is great, but only when you actually use it properly. And tokenization + data handling can easily mess up your latency numbers if you're not careful.

More benchmarks and experiments coming.