Featured2026-04-14

Paged Self-Attention — The OS Trick That Makes LLM Memory Management Fast

How vLLM's paged attention borrows virtual memory concepts from operating systems to solve the KV cache memory fragmentation problem — making LLM inference faster and more memory-efficient at scale.

transformerspaged-attentionvllminferencellmmemoryoptimization

Paged Self-Attention — The OS Trick That Makes LLM Memory Management Fast

In the last blog, we established that KV cache solves the compute problem in autoregressive generation. Instead of recomputing Keys and Values for every token on every step, we store them and reuse them.

But then I hit the next question: how do you actually store all that cache?

Turns out, naive storage creates its own mess. And the fix — paged attention — is one of those ideas that's obvious in hindsight but genuinely clever when you see it for the first time.


The New Problem: Memory Fragmentation

When a user sends a prompt and starts generating, the server allocates a block of GPU memory for that sequence's KV cache. The problem is you don't know in advance how long the sequence will be.

Maybe the user generates 50 tokens. Maybe 2000. You have to allocate memory upfront or keep reallocating as the sequence grows.

The naive approach — one big contiguous memory block per sequence — causes three real problems:

1. Internal fragmentation You pre-allocate for the maximum possible length (say, 4096 tokens) but most sequences end at 200 tokens. The rest of that block sits empty and wasted.

2. External fragmentation After many sequences finish and free their memory, GPU memory looks like swiss cheese — lots of small free gaps between used blocks. A new long sequence might need 2GB contiguous memory, but even though 3GB is technically free, none of it is contiguous. Allocation fails.

3. No memory sharing across sequences If two users send the same system prompt, the KV cache for those shared tokens gets duplicated in memory for each user. Complete waste.

GPU Memory (naive approach):

[ Seq A: 4096 slots allocated, 300 used  |  WASTED  ]
[ Seq B: 4096 slots allocated, 800 used  |  WASTED  ]
[ FREE GAP ]
[ Seq C: 4096 slots allocated, 150 used  |  WASTED  ]
[ FREE GAP ]
[ Seq D: needs 2000 slots — can't fit! Even though total free > 2000 ]

This is the exact same problem operating systems solved decades ago with virtual memory and paging.


The Fix: Borrow the OS Idea

If you've studied OS internals, this will feel familiar. If not — here's the gist.

Operating systems don't give processes one big contiguous block of RAM. Instead, they divide memory into fixed-size pages (usually 4KB). A process's memory is spread across many non-contiguous pages. A page table tracks which virtual address maps to which physical page. The process sees a clean, contiguous memory space. The OS handles the messy physical reality underneath.

Paged attention applies this exact idea to KV cache.


How Paged Attention Works

Instead of one contiguous block per sequence, the KV cache is divided into fixed-size blocks (pages). Each block holds K and V vectors for a fixed number of tokens — say, 16 tokens per block.

KV Cache (paged):

Physical memory:
[ Block 0: K,V for tokens 1-16   ]
[ Block 1: K,V for tokens 17-32  ]
[ Block 2: K,V for tokens 1-16   ]  ← used by a different sequence
[ Block 3: K,V for tokens 33-48  ]
[ Block 4: FREE                  ]
[ Block 5: K,V for tokens 1-16   ]  ← shared system prompt
...

Block table (logical → physical mapping):
Seq A: [block 0] → [block 1] → [block 3]
Seq B: [block 2] → [block 4 when needed]
Seq C: [block 5] → ...        ← shares block 5 with Seq D (same system prompt)
Seq D: [block 5] → ...        ← physical block 5 shared, no duplication

Each sequence has a block table — a mapping from logical token positions to physical memory blocks. This is the page table equivalent.

When a new token is generated and the current block is full, the system simply allocates a new free block and appends it to the sequence's block table. No reallocation. No copying. Just point to a new page.


What Changes in the Attention Computation?

Mathematically — nothing. The attention formula is identical:

Attention(Q, K, V) = softmax(QKᵀ / √dk) · V

The difference is purely in how K and V are read during computation. Instead of reading from one contiguous array, the attention kernel reads block by block, following the block table to find where each chunk lives in physical memory.

The GPU kernel handles this lookup transparently. From the model's perspective, it's still attending over all previous tokens. Under the hood, it's hopping across non-contiguous memory pages.


The Real Wins

Near-zero memory waste Blocks are only allocated when needed. A sequence using 47 tokens wastes at most 1 block (the partially-filled last block) instead of thousands of empty slots.

No external fragmentation Any free block can serve any sequence. There's no requirement for contiguous allocation. That swiss-cheese GPU memory? Fully usable now.

Memory sharing (prefix caching) If multiple sequences share a common prefix — like the same system prompt — their KV cache blocks for that prefix can point to the same physical blocks. No duplication. This is called prefix caching and it's a huge win for production serving where thousands of users share the same system prompt.

Without paged attention:
User 1: [system prompt KV] [user message KV]  ← full copy
User 2: [system prompt KV] [user message KV]  ← full copy (duplicate!)
User 3: [system prompt KV] [user message KV]  ← full copy (duplicate!)

With paged attention + prefix caching:
User 1: → [shared block] → [user 1 block]
User 2: → [shared block] → [user 2 block]   ← same physical block
User 3: → [shared block] → [user 3 block]   ← same physical block

Better batching Serving systems like vLLM can pack more sequences into the same GPU memory because they're not over-allocating per sequence. More sequences in a batch = better GPU utilization = higher throughput.


vLLM: Where This Was Introduced

Paged attention was introduced in the vLLM paper (Kwon et al., 2023) — one of the most practically impactful LLM systems papers in recent years.

Before vLLM, systems like HuggingFace's generate() used contiguous KV cache allocation. vLLM's paged attention achieved 2–4× higher throughput on the same hardware just from better memory management — no changes to the model itself.

Today, paged attention (or variants of it) is used in virtually every serious LLM serving framework: vLLM, TGI (Text Generation Inference), TensorRT-LLM, and others.


Putting It Together: KV Cache + Paged Attention

These two optimizations solve two different layers of the same problem:

PROBLEM: Attention over growing sequences is expensive to compute AND store

KV Cache:
  → Don't recompute K and V for tokens you've already processed
  → Reduces per-step compute from O(n) new computations to O(1)
  → Tradeoff: memory grows linearly with sequence length

Paged Attention:
  → Don't store the growing KV cache in one fragmented, wasteful block
  → Organize into fixed pages with a block table
  → Enables sharing, eliminates waste, makes batching efficient

Together they're what makes LLMs fast enough to serve millions of requests in production. Neither is particularly complicated conceptually — both are clever applications of ideas that already existed. KV cache is just memoization. Paged attention is just virtual memory.

That's what I find most interesting about this. The big wins in ML systems often come not from new math, but from recognizing that a problem you're staring at is actually an old problem in disguise.


Key Takeaways

  • Naive contiguous KV cache allocation causes internal fragmentation, external fragmentation, and no memory sharing — all causing GPU memory waste
  • Paged attention divides the KV cache into fixed-size blocks (pages) managed via a block table — exactly like virtual memory in operating systems
  • Attention math doesn't change — only how K and V are physically read in memory
  • Key wins: near-zero waste, no fragmentation, prefix caching (shared physical blocks for common prefixes), better batching
  • vLLM introduced this and achieved 2–4× throughput improvements with no model changes
  • Every serious LLM serving framework today uses paged attention or a variant

Part of an ongoing series on how Transformers actually work under the hood. Previous: KV Cache

Related Reading

Subscribe to my newsletter

No spam, promise. I only send curated blogs that match your interests — the stuff you'd actually want to read.

Interests (optional)

Unsubscribe anytime. Your email is safe with me.