2026-06-18

Speculative Decoding — The Intern Trick That Makes LLMs 13x Faster

How speculative decoding makes LLM inference 2-13x faster by having a small draft model propose tokens and a big model verify them in parallel — with zero quality loss.

speculative-decodingllm-inferencetransformersoptimizationcursormachine-learninglearning-in-public

I kept staring at Cursor's code completion speed and wondering — how is this thing generating 1,000 tokens per second on a 70-billion parameter model? That's not normal. Standard autoregressive generation on a model that size gives you maybe 30-80 tokens per second. Something else is happening.

Turns out, the answer is one of the most elegant optimization tricks in modern AI — speculative decoding. And once you understand it, you'll see it everywhere.

The Problem — Why LLMs Are Slow

Here's something that surprises people: LLMs generate text one token at a time. Every. Single. Token. Requires a full forward pass through the entire model.

For a 70B parameter model, that means loading 70 billion weights from memory, doing the matrix multiplications, getting one token, and then doing it all over again for the next one. This is autoregressive generation — each token depends on all the tokens before it.

The bottleneck isn't compute. Modern GPUs have insane compute power. The bottleneck is memory bandwidth — how fast you can load those 70 billion parameters from GPU memory for each token. Your GPU spends most of its time waiting for data to arrive, not actually doing math.

This is called being "memory-bound." And it means that most of your GPU's compute capacity is sitting idle during text generation. What a waste.

The Insight — Verification Is Cheaper Than Generation

Here's the key insight that makes speculative decoding work:

Generating one token requires one forward pass. But verifying five tokens ALSO requires just one forward pass.

Read that again. When you generate token-by-token, each token needs its own forward pass. But if someone hands you five candidate tokens and says "check if these are correct" — the model can verify all five simultaneously in a single pass, because it can process the entire sequence in parallel.

Generation is sequential. Verification is parallel. That asymmetry is everything.

The Trick — Draft and Verify

Speculative decoding works exactly like having an intern write a first draft and a senior engineer review it:

  1. Draft phase: A small, fast model (the "draft model" — maybe 1-7B parameters) generates K tokens quickly. It's fast because it's tiny. It's inaccurate because it's tiny. That's fine.

  2. Verify phase: The big model (the "target model" — 70B+) checks all K draft tokens in a single forward pass. It compares what the draft model predicted against what it would have generated itself.

  3. Accept or reject: Starting from the first draft token, accept every token where the draft model's prediction matches (or is close enough to) the target model's prediction. The moment you hit a mismatch, reject that token and everything after it. Replace the rejected token with what the target model actually wants.

  4. Repeat: Start the next round from where you left off.

If the draft model gets 4 out of 5 tokens right — you just generated 4 tokens for the cost of 1 target model forward pass (plus the cheap draft model passes). That's a massive speedup.

The Math — Why It's Lossless

This is the part that blows people's minds. Speculative decoding produces exactly the same output as if you'd generated every token with the target model alone. It's mathematically lossless.

How? Through rejection sampling. For each draft token, the acceptance probability is:

accept_probability = min(1, P_target(token) / P_draft(token))

If the target model thinks a token is MORE likely than the draft model does — automatic accept. If the target model thinks it's LESS likely — accept with probability proportional to the ratio.

When a token is rejected, you sample from an adjusted distribution that corrects for the draft model's bias. The math guarantees the final distribution over tokens is identical to the target model's distribution.

Not approximately identical. Exactly identical. The original paper proves this rigorously.

Real Numbers — How Fast Is It?

Typical acceptance rates: 60-80%. Out of 5 draft tokens, 3-4 get accepted on average.

Expected speedup formula: If each token has acceptance probability α and you draft K tokens, the expected tokens per round is:

E[tokens] = (1 - α^(K+1)) / (1 - α)

With α = 0.7 and K = 5, that's about 3.2 tokens per round instead of 1. A 3.2x speedup — for free. No quality loss.

Production numbers:

  • NVIDIA H200 GPUs: 3.6x throughput improvement
  • vLLM with speculative decoding: 2-3x latency reduction, 19% cost savings
  • General production deployments: 2-3x is the realistic range

Cursor's Twist — Speculative Edits

Here's where it gets wild. Cursor took speculative decoding and cranked it to 13x.

Standard speculative decoding uses a small draft model to propose tokens. Cursor realized something — when you're editing code, you already have the original file. And most of the output will be the same as the original file with small changes.

So instead of using a draft model, they use the original file itself as the "speculation." The existing code is the draft. The fine-tuned Llama-3-70B model just needs to verify which parts stay the same and which parts change.

Their API call includes a prediction field containing the original file content. The server finds the longest prefix of the prediction that matches what the model would generate with temperature=0. For unchanged code regions — which is most of the file — every token is a match. The model only needs to actually "think" about the changed parts.

Result: ~1,000 tokens per second on a 70B model. That's ~3,500 characters per second. A 400-line file gets rewritten in under a second.

They fine-tuned Llama-3-70B specifically for this "fast apply" task using synthetic data. Their custom model nearly matched Claude Opus and outperformed GPT-4 Turbo in code editing benchmarks — and runs at 13x the speed.

Why Not Just Use a Small Model?

Fair question. If the draft model is fast, why not just use it directly?

Because it's dumb. A 1B model will write plausible-looking code that has subtle bugs, wrong function names, and incorrect logic. The target model catches these. The draft model provides speed, the target model provides quality. You get both.

The acceptance rate depends on how similar the draft and target models are. If they agree on most tokens (high α), speculative decoding is incredibly efficient. If they disagree a lot (low α), you're doing extra work for little gain.

That's why the model pairing matters:

  • Same model family, different sizes (Llama-3-8B drafting for Llama-3-70B): High acceptance rate, great speedup
  • Different architectures (random small model for GPT-4): Lower acceptance, less benefit
  • Original file as draft (Cursor's approach): Very high acceptance for code edits specifically, hence the 13x

Where You'll See It

Speculative decoding is everywhere now:

  • vLLM and TensorRT-LLM: Native support built in
  • LM Studio: Toggle it on for local models
  • Fireworks AI: Speculative decoding API for production deployments
  • Apple MLX: Supports it for on-device inference
  • Google's Gemini: Uses it internally (they co-authored the original paper)

NVIDIA released SPEED-Bench in February 2026 — the first standardized benchmark specifically for speculative decoding, testing across diverse prompts, sequence lengths, batch sizes, and concurrency levels. This tells you it's not experimental anymore. It's infrastructure.

The Mental Model

Think of it like this. You're proofreading a document. Reading it word-by-word (autoregressive generation) is slow. But if someone hands you a rough draft that's 70% correct, you can scan through it much faster — your eyes jump over the correct parts and only stop on the mistakes. Same document, same final quality, fraction of the time.

That's speculative decoding. The draft model writes fast. The target model reads fast. Nobody generates slow.

One-line summary: Speculative decoding lets a small model write the first draft and a big model verify it in parallel — same output quality, 2-13x faster, because checking is cheaper than creating.

peace. see you in the next one.

Related Reading

Subscribe to my newsletter

No spam, promise. I only send curated blogs that match your interests — the stuff you'd actually want to read.

Interests (optional)

Unsubscribe anytime. Your email is safe with me.