2026-03-21

Why Tokenization Is More Important Than You Think

Why tokenization is the most underrated part of LLMs—how tokens aren't words, why they affect cost and performance, and why bad tokenization breaks everything downstream.

TokenizationLLMNLPMachine LearningEmbeddingsLearning In Public

What I Used to Think

Okay, so today I was thinking about this one thing.

We always talk about:

  • Embeddings
  • LLMs
  • Transformers
  • Attention

But almost no one talks about tokenization properly.

And honestly... I think tokenization is way more important than people think.

Earlier, I thought:

Tokenization = split sentence into words

Like:

"I love AI" → ["I", "love", "AI"]

Simple. Done. Move on.


What Actually Happens

But in reality, it's not like that.

Models don't see words. They see tokens.

And tokens are not always words.

Example:

"unbelievable" → ["un", "believ", "able"]

Or even weirder:

"ChatGPT" → ["Chat", "G", "PT"]

Sometimes even spaces matter.

So tokenization is not:

  • ❌ Word splitting

It's:

  • Pattern-based segmentation learned from data

Why This Matters (This Clicked Late for Me)

Everything depends on tokens.

  • Embeddings → generated per token
  • Context length → measured in tokens
  • Cost → based on tokens
  • Latency → depends on tokens

So if tokenization is inefficient:

  • More tokens → more compute
  • More tokens → higher cost
  • More tokens → slower response

Example That Made Me Think

Take this:

"hello"

vs

"h e l l o"

The second one = way more tokens.

Same meaning for us. Completely different for the model.

So tokenization directly affects performance.


Tokenization and Embeddings

This was another realization.

Embeddings are not created for raw text.

The pipeline is:

text → tokens → embeddings → model

So if tokenization changes:

  • Embedding changes
  • Meaning representation changes

Tokenization is literally the first transformation layer.


Weird Thing About Languages

English works fine.

But for:

  • Hindi
  • Japanese
  • Code
  • Emojis

Tokenization becomes messy.

Sometimes:

  • One word = multiple tokens
  • One token = weird fragment

So multilingual models struggle partly because of tokenization.


Tokenization and Context Window

When people say:

"This model supports 8k context"

That's 8k tokens, not words.

And tokens ≠ words.

So:

  • Long text → might exceed limit faster than expected
  • Inefficient tokenization → wastes context

This is why the same model can handle different amounts of "content" depending on the language or writing style.


One More Thing (Important)

Tokenization is fixed.

The model is trained with a specific tokenizer. You can't just change it later easily.

So:

  • Tokenizer choice = design decision
  • Affects everything downstream

This is why you can't just swap tokenizers between models. The entire model was trained expecting specific token IDs.


What I Realized

Tokenization is not preprocessing.

It's part of the model.

Like:

  • Weights matter
  • Architecture matters
  • Tokenization also matters

Simple Mental Model

Think like this:

Encoder-decoder, LLM, everything... starts from:

👉 Tokens, not text

The model never sees the word "unbelievable." It sees ["un", "believ", "able"].

That's its reality.


Final Thought

Earlier I used to jump directly into:

  • Embeddings
  • Vector DB
  • MCP
  • Agents

Now I feel:

If tokenization is bad, everything on top of it is slightly broken.

It's like building a house on a wonky foundation. Everything works, but there's always a subtle inefficiency you can't quite fix later.


Still exploring this. But yeah... tokenization is underrated.

Related Reading

Subscribe to my newsletter

No spam, promise. I only send curated blogs that match your interests — the stuff you'd actually want to read.

Interests (optional)

Unsubscribe anytime. Your email is safe with me.