KV Cache — Why LLMs Don't Reread the Entire Conversation Every Time
A clear, visual explanation of the KV Cache — the optimization that makes autoregressive text generation fast by storing Key and Value vectors instead of recomputing them for every new token.
KV Cache — Why LLMs Don't Reread the Entire Conversation Every Time
Here's something that bothered me when I first learned about Transformers.
We know attention is computed like this:
Attention(Q, K, V) = softmax(QKᵀ / √dk) · V
Every token attends to every other token. Clean, parallel, beautiful. But then I thought about generation — how GPT actually produces text — and something felt off.
During generation, the model outputs one token at a time. You give it a prompt, it produces token 1. Then it looks at the prompt + token 1 and produces token 2. Then prompt + token 1 + token 2 → token 3. And so on.
So for every new token, is it recomputing attention over the entire sequence from scratch?
Yes. Unless you use a KV Cache.
The Naive Problem
Say your prompt is 200 tokens long and the model has generated 100 tokens so far. To generate token 301, the naive approach runs full attention over all 300 tokens — recomputing Keys and Values for all 300 of them again.
Token 302? Full attention over 301 tokens. Recompute everything again.
This is O(n²) computation per token. It gets brutal fast.
Token 1: compute attention over 1 token
Token 2: compute attention over 2 tokens
Token 3: compute attention over 3 tokens
...
Token 500: compute attention over 500 tokens ← recomputing 499 tokens' work
There's a massive amount of repeated, unnecessary computation here. Keys and Values for token 1 don't change between step 2 and step 500. Why recompute them?
The Key Insight
Inside self-attention, every token produces three vectors: Query, Key, Value.
- The Query of a token changes depending on what it's asking — it's specific to the current decoding step.
- The Key and Value of a token depend only on that token's embedding and the weight matrices
Wk,Wv. Once computed, they never change.
So here's the idea: compute K and V once per token, then cache them.
When token 301 is being generated:
- Retrieve cached K and V for tokens 1–300 from memory
- Compute Q, K, V only for the new token 301
- Append new K and V to the cache
- Run attention: new Q against all 300 cached Ks → weights → weighted sum of all 301 Vs
That's it. One token's worth of new computation per step, not n tokens' worth.
Visualizing It
WITHOUT KV Cache (naive):
Step 1: [T1] → compute Q,K,V for T1
Step 2: [T1, T2] → recompute Q,K,V for T1 (wasted) + compute for T2
Step 3: [T1, T2, T3] → recompute Q,K,V for T1, T2 (wasted) + compute for T3
...very slow as sequence grows...
WITH KV Cache:
Step 1: compute Q,K,V for T1 → store K1,V1 in cache
cache = [K1,V1]
Step 2: compute Q,K,V for T2 → store K2,V2 in cache
attention: Q2 × [K1,K2] → weights → [V1,V2]
cache = [K1,V1 | K2,V2]
Step 3: compute Q,K,V for T3 → store K3,V3 in cache
attention: Q3 × [K1,K2,K3] → weights → [V1,V2,V3]
cache = [K1,V1 | K2,V2 | K3,V3]
...only 1 token of new compute per step, regardless of sequence length...
How Much Memory Does It Use?
Here's the honest tradeoff: you're trading compute for memory.
For a single sequence, the KV cache stores 2 vectors (K and V) per token, per layer, per attention head. For a model like LLaMA-2 7B:
KV cache size ≈ 2 × num_layers × num_heads × head_dim × sequence_length × bytes_per_value
For LLaMA-2 7B with a 4096-token context in float16:
≈ 2 × 32 layers × 32 heads × 128 head_dim × 4096 tokens × 2 bytes
≈ ~2 GB just for the KV cache
For longer contexts or larger models, this explodes. A 100K token context on a large model can need tens of GBs just for the cache. This is why memory, not compute, is the primary bottleneck during LLM inference.
And this is exactly why the next problem — how do you store all this cache efficiently? — matters so much. Which is what paged attention solves. More on that in the next blog.
When Does KV Cache Not Help?
KV cache only helps during autoregressive generation (decoding one token at a time).
During the prefill phase — processing your initial prompt — the model still runs full attention over all prompt tokens in parallel. That's actually fast because it's a single big matrix operation on GPU.
The cache kicks in during the decode phase, where tokens are generated one by one. That's where naive recomputation would kill performance.
Also worth noting: if you're running batch inference (multiple prompts simultaneously), each prompt in the batch has its own KV cache. Memory usage multiplies by batch size. This is one reason why serving LLMs at scale is hard — you're constantly juggling compute vs. memory vs. batch size tradeoffs.
The Mental Model That Stuck With Me
Think of it like reading a book to answer questions.
Without KV cache: Every time someone asks you a question about the book, you reread the entire book from page 1 before answering.
With KV cache: You read the book once, take detailed notes, and answer every subsequent question from your notes. You only read new pages when new content arrives.
Your notes are the KV cache. The "reading" is the Key and Value computation. You only do it once per token, store it, and refer back to it forever.
Key Takeaways
- During autoregressive generation, naive attention recomputes Keys and Values for the entire sequence on every step — O(n²) and extremely wasteful
- Keys and Values for a token are fixed once computed — only the Query changes per step
- KV cache stores K and V vectors in memory after first computation and reuses them, reducing per-step compute to just one new token
- The tradeoff is memory — cache size grows linearly with sequence length, layers, and heads
- For long contexts and large models, KV cache memory can reach tens of GBs per sequence
- Prefill (processing the prompt) still runs full attention; KV cache only accelerates the decode phase
What's Next
KV cache solves the compute problem. But storing it naively — as one big growing block of memory per sequence — creates a new mess: fragmentation, inefficient GPU memory use, and difficulty batching multiple sequences together.
The fix is Paged Self-Attention — an OS-inspired trick that organizes KV cache into fixed-size pages, the same way virtual memory works in operating systems. That's the next blog.
Part of an ongoing series on how Transformers actually work under the hood.