Why Tokenization Is More Important Than You Think
Why tokenization is the most underrated part of LLMs—how tokens aren't words, why they affect cost and performance, and why bad tokenization breaks everything downstream.
What I Used to Think
Okay, so today I was thinking about this one thing.
We always talk about:
- Embeddings
- LLMs
- Transformers
- Attention
But almost no one talks about tokenization properly.
And honestly... I think tokenization is way more important than people think.
Earlier, I thought:
Tokenization = split sentence into words
Like:
"I love AI" → ["I", "love", "AI"]
Simple. Done. Move on.
What Actually Happens
But in reality, it's not like that.
Models don't see words. They see tokens.
And tokens are not always words.
Example:
"unbelievable" → ["un", "believ", "able"]
Or even weirder:
"ChatGPT" → ["Chat", "G", "PT"]
Sometimes even spaces matter.
So tokenization is not:
- ❌ Word splitting
It's:
- ✅ Pattern-based segmentation learned from data
Why This Matters (This Clicked Late for Me)
Everything depends on tokens.
- Embeddings → generated per token
- Context length → measured in tokens
- Cost → based on tokens
- Latency → depends on tokens
So if tokenization is inefficient:
- More tokens → more compute
- More tokens → higher cost
- More tokens → slower response
Example That Made Me Think
Take this:
"hello"
vs
"h e l l o"
The second one = way more tokens.
Same meaning for us. Completely different for the model.
So tokenization directly affects performance.
Tokenization and Embeddings
This was another realization.
Embeddings are not created for raw text.
The pipeline is:
text → tokens → embeddings → model
So if tokenization changes:
- Embedding changes
- Meaning representation changes
Tokenization is literally the first transformation layer.
Weird Thing About Languages
English works fine.
But for:
- Hindi
- Japanese
- Code
- Emojis
Tokenization becomes messy.
Sometimes:
- One word = multiple tokens
- One token = weird fragment
So multilingual models struggle partly because of tokenization.
Tokenization and Context Window
When people say:
"This model supports 8k context"
That's 8k tokens, not words.
And tokens ≠ words.
So:
- Long text → might exceed limit faster than expected
- Inefficient tokenization → wastes context
This is why the same model can handle different amounts of "content" depending on the language or writing style.
One More Thing (Important)
Tokenization is fixed.
The model is trained with a specific tokenizer. You can't just change it later easily.
So:
- Tokenizer choice = design decision
- Affects everything downstream
This is why you can't just swap tokenizers between models. The entire model was trained expecting specific token IDs.
What I Realized
Tokenization is not preprocessing.
It's part of the model.
Like:
- Weights matter
- Architecture matters
- Tokenization also matters
Simple Mental Model
Think like this:
Encoder-decoder, LLM, everything... starts from:
👉 Tokens, not text
The model never sees the word "unbelievable." It sees ["un", "believ", "able"].
That's its reality.
Final Thought
Earlier I used to jump directly into:
- Embeddings
- Vector DB
- MCP
- Agents
Now I feel:
If tokenization is bad, everything on top of it is slightly broken.
It's like building a house on a wonky foundation. Everything works, but there's always a subtle inefficiency you can't quite fix later.
Still exploring this. But yeah... tokenization is underrated.