2026-02-05

Text Classification Inference Benchmark: What Actually Happens on CPU vs GPU

A practical inference benchmark comparing DistilBERT performance on CPU vs GPU—measuring latency, throughput, and memory across different batch sizes to understand what actually happens in production.

TransformersInferencemachinelearningDistilBERTPyTorch

What This Project Is About

I wanted to understand how inference actually behaves for transformer models. Not how to train them or optimize them—just what really happens when you run inference on CPU vs GPU.

This is using DistilBERT on a sentiment classification task with the IMDB movie reviews dataset. Binary sentiment: positive or negative. Nothing fancy.

Important clarifications:

  • I did not fine-tune the model
  • The goal was not accuracy improvement
  • Only inference behavior

I wanted to look at:

  • Latency
  • Throughput
  • Memory usage

And see how they change:

  • On CPU vs GPU
  • With different batch sizes

Setup

Model: distilbert-base-uncased-finetuned-sst-2-english (used as-is, no changes)

Dataset: IMDB movie reviews

Environment: Google Colab

  • PyTorch
  • Hugging Face Transformers
  • Hugging Face Datasets

Hardware:

  • CPU runtime
  • NVIDIA Tesla T4 GPU runtime

CPU and GPU experiments were done in separate Colab sessions. I literally switched the runtime type.

Preprocessing

Tokenization:

  • DistilBERT tokenizer
  • Truncation enabled
  • Fixed max sequence length

Important detail: Tokenization always runs on CPU, even when inference is on GPU. This becomes important later.

Batch Sizes Tested

I didn't go crazy here. Kept it very controlled.

  • Batch size = 1: More latency-focused
  • Batch size = 32: More throughput-focused

Metrics Tracked

  • Latency (ms): Batch-level timing
  • Throughput (samples/sec): Batch-level
  • GPU memory (MB): Peak allocated memory

Inference Setup

I used:

  • The same inference function for both CPU and GPU
  • Automatic device selection
  • torch.no_grad() (no gradients)
  • CUDA sync for GPU timing (very important for accurate measurements)

Basically, I tried to keep things fair.

What I Observed

GPU with Batch Size 1

  • Low latency
  • GPU looks fast but overheads still exist
  • End-to-end gains aren't crazy because of overhead

GPU with Batch Size 32

  • Much higher throughput
  • GPU memory usage goes up significantly
  • GPU finally starts showing real advantage

The pattern was clear: batching matters a lot.

CPU vs GPU (In Simple Words)

CPU inference:

  • Okay for very small workloads
  • Bad for larger batch sizes

GPU inference:

  • Shines for throughput-heavy workloads
  • Small batch sizes don't fully justify GPU overhead

Batch size selection matters more than hardware alone.

This was honestly the biggest takeaway for me.

Key Observations

Some things became very clear:

  1. GPU latency is lower than CPU for batch size 1, but end-to-end gains aren't dramatic because of overhead

  2. Increasing batch size:

    • Improves GPU utilization
    • Increases memory usage significantly
  3. Throughput gains on GPU are very obvious when batching

  4. Tokenization is still a bottleneck—it's CPU-bound and not negligible

Limitations I'm Aware Of

  • CPU and GPU runs were in different Colab sessions
  • Throughput numbers are batch-level, not absolute truth
  • Results are specific to Tesla T4
  • Tokenization time is included in latency

So yeah, not a perfect benchmark, but realistic.

How to Run It

If someone wants to replicate this:

  1. Open the Colab notebook
  2. Select runtime:
    • CPU → CPU benchmarks
    • GPU → GPU benchmarks
  3. Run all cells
  4. Results print at the end

Colab Notebook Link

Final Thoughts

This wasn't about making the model better. It was about understanding reality.

Inference performance depends heavily on:

  • Batch size
  • Preprocessing
  • Workload type

GPU is great, but only when you actually use it properly. And tokenization + data handling can easily mess up your latency numbers if you're not careful.

More benchmarks and experiments coming.

Related Reading

Subscribe to my newsletter

No spam, promise. I only send curated blogs that match your interests — the stuff you'd actually want to read.

Interests (optional)

Unsubscribe anytime. Your email is safe with me.