Context Windows Are a Lie — Why 1 Million Tokens Doesn't Mean What You Think
Context windows keep getting bigger — 200K, 1M, 1.5M tokens. But models forget stuff in the middle, attention scales quadratically, and bigger often means worse. Here's what actually happens inside.
GPT-5.6 is rumored to have a 1.5 million token context window. Gemini already has 1 million. Claude sits at 200K. Every model release, the number goes up and everyone loses their mind.
But here's what nobody tells you: a bigger context window doesn't mean the model actually uses all of it. In fact, research shows models straight up forget stuff in the middle — and the bigger the window, the worse this problem gets.
I kept seeing people dump entire codebases into Claude and wonder why it missed obvious things. So I dug into what context windows actually are, why they're expensive, and why bigger is often worse.
What a Context Window Actually Is
A context window is everything the model can see in a single request. Every token of input — your question, the system prompt, conversation history, pasted files, tool results — all of it has to fit inside this window.
Think of it like RAM, not storage. It's volatile. When the conversation ends, it's gone. And just like RAM, the model doesn't have equal access to everything in it.
Quick math on tokens: one token is roughly 4 characters in English, or about 0.75 words. So:
- 128K tokens ≈ 96,000 words ≈ a 300-page book
- 200K tokens ≈ 150,000 words ≈ the entire Harry Potter and the Order of the Phoenix
- 1M tokens ≈ 750,000 words ≈ the Lord of the Rings trilogy plus the Silmarillion
- 1.5M tokens ≈ 1.1 million words ≈ the entire Game of Thrones series
Sounds amazing. But "fitting" isn't "understanding."
The Dirty Secret: Lost in the Middle
In 2023, researchers at Stanford published a paper called "Lost in the Middle" that changed how we think about long-context models. The finding was brutal:
Models reliably use information at the beginning and end of the context window — but accuracy drops 20-40% for information buried in the middle.
Imagine reading a 300-page book where you remember chapter 1 and chapter 30, but chapters 10-20 are a blur. That's what your model does with a full context window.
This isn't a bug — it's a consequence of how attention works. The self-attention mechanism in Transformers creates a bias toward positions at the edges. Tokens at the start get disproportionate attention (primacy effect). Tokens at the end get recency boost. Everything in the middle competes for what's left.
A production team at Delty (YC X25) confirmed this in 2026 — running conversations spanning millions of tokens, they found LLMs "notoriously ignore the details that live in the middle of the context." Their fix? Don't rely on raw context. Use retrieval tools to grab specific snippets and inject them at the end of the conversation where the model actually pays attention.
Why Bigger Windows Cost Quadratically More
Here's the part nobody wants to hear.
Self-attention — the core mechanism in every Transformer — compares every token to every other token. If your context has n tokens, the attention computation is O(n²).
What that means in plain numbers:
| Context Length | Attention Operations | Relative Cost |
|---|---|---|
| 4K tokens | 16 million | 1x |
| 32K tokens | 1 billion | 64x |
| 128K tokens | 16 billion | 1,024x |
| 1M tokens | 1 trillion | 62,500x |
Going from 4K to 1M tokens doesn't make things 250x more expensive. It makes them 62,500x more expensive in raw attention compute.
In practice, models use tricks to reduce this — KV caching, sparse attention, sliding windows — but the fundamental scaling pressure remains. Every token you add makes the entire computation more expensive, not just proportionally, but quadratically.
This is why API pricing scales with input tokens. More context = more cost per request. It's also why inference latency goes up — the model has to attend to more stuff before generating each output token.
The Techniques That Make Long Context Possible
Models don't actually do naive O(n²) attention anymore. Here's how they cheat:
Sliding Window Attention (SWA): Each token only attends to a fixed local window instead of the entire sequence. If the window size is 512 and the sequence is 10,000 tokens — that's 5 million operations instead of 100 million. Nearly 20x reduction. Mistral popularized this.
RoPE (Rotary Position Embedding): Instead of fixed position encodings, RoPE encodes positions using rotation matrices. The clever part — it can be rescaled after training to extend the context window without full retraining. LongRoPE2 extended LLaMA3-8B from its training length to 128K tokens with minimal quality loss.
KV Cache: Instead of recomputing attention for the entire conversation every time the model generates a token, you cache the key-value pairs from previous tokens. I wrote a whole blog on this — it's how models avoid re-reading the entire conversation for every single output token.
Sparse Attention: Only compute attention for a subset of token pairs — local neighbors, global anchor tokens, and randomly sampled positions. Longformer and BigBird pioneered this.
RAG (Retrieval Augmented Generation): Don't put everything in the context. Store documents in a vector database, retrieve only the relevant chunks, and inject them into a small context window. Often outperforms dumping everything into a million-token window.
The Needle-in-a-Haystack Test
This is how people actually test whether a model uses its full context. You bury a random fact — the "needle" — at various positions in a huge block of text — the "haystack" — and ask the model to recall it.
The results are humbling. Most models score near-perfectly when the needle is at the beginning or end. But stick it at 40-60% depth and accuracy crumbles. Some models drop to 50-60% recall on information placed right in the middle of their advertised context window.
When someone tells you their model has a 1M token context — ask them how it performs on needle-in-a-haystack at 500K depth. That's the real number.
What This Means in Practice
Don't dump everything into context. Just because it fits doesn't mean the model will use it. A 10K token context with precisely the right information will outperform a 500K token context where the answer is buried on page 200.
Put important stuff at the edges. System prompts go at the start. Critical constraints go at the end (recency zone). Never bury your most important instructions in the middle of a long conversation.
Use retrieval for large datasets. If you're working with codebases, documentation, or legal documents — use RAG or tool-based retrieval to pull specific chunks instead of pasting everything in.
Watch the cost. A 200K token request costs roughly 50x more than a 4K token request. If you're building a product with LLM calls in a loop, context length is your biggest cost driver. Aggressive summarization and context management pay for themselves.
Effective context < advertised context. The number on the model card is the maximum. The actual useful context — where the model reliably uses information — is usually 30-60% of that number, depending on the task.
The One-Liner
A context window isn't memory. It's a spotlight — bright at the edges, dim in the middle, and the bigger it gets, the more expensive it is to keep lit.
Next time someone brags about their model's 1.5M token context — ask them what happens at token 750K. That's where the real story is.
Go build something that fits in 32K tokens and actually works.