Quantization Isn't Scary: What I Wish Someone Told Me Earlier
Breaking down quantization from scary optimization technique to simple concept—how reducing bit precision makes models smaller and faster, and why calibration matters more than the math.
The Initial Fear
Okay, so when I first heard the word "quantization," honestly it sounded scary. Like one of those deep systems/low-level optimization things you're supposed to avoid until you're "senior enough."
But turns out... you really can't understand model deployment, or even fine-tuning properly, without at least getting this idea straight.
The Simple Truth
At a very basic level, quantization is not some black magic.
Neural networks are just full of numbers—weights, biases, activations, everything is numbers. And most of the time, those numbers don't need to be 32-bit floating point.
That's it. That's the motivation.
Fewer bits →
- Smaller models
- Faster inference
- Cheaper to run
That's literally why quantization exists.
What Actually Happens
The thing that clicked for me is this: quantization is basically a linear mapping. Nothing fancy.
You take a float value, you:
- Scale it
- Maybe shift it with something called a zero-point
- Store it as an integer
During inference, most of the math is integer math, and you apply scaling at the edges.
So yeah, it's not magical. It's just very careful bookkeeping.
Two Main Types
There are mainly two types, which sounds complicated but isn't:
Symmetric Quantization
Here:
- Zero stays exactly zero
- You use signed integers
- Assumes positive and negative values are kind of balanced
This is clean and fast, which is why it's great for:
- Weights
- Matrix multiplication
This one just feels nice mathematically.
Asymmetric Quantization
Here:
- You introduce a zero-point offset
- Usually use unsigned integers
- Works better when values are mostly positive (like activations after ReLU)
But yeah, it's:
- Slower
- Messier for math
So tradeoffs.
Calibration: The Part Nobody Talks About
One big realization for me: quantization without calibration is useless.
Calibration is basically just... observing reality.
You run the model on some representative data and see:
- What values tensors actually take
- What ranges they live in
Those stats decide:
- Scale
- Zero-point
And if your calibration data is bad, your quantization will also be bad—even if the math is "correct."
That part really stuck with me.
Different Modes (And When to Use Them)
Then there are modes of quantization, and this actually matters a lot in practice:
Dynamic Quantization
- Quick and dirty
- Weights are quantized beforehand, activations at runtime
Static Quantization
- Faster inference
- But needs proper calibration
Quantization-Aware Training (QAT)
- This is next level
- You train the model knowing it'll live in low precision
Weight-Only Quantization (for LLMs)
- This one makes a lot of sense
- Compress the huge weights
- Keep activations higher precision
Honestly, for LLMs, this feels like the sweet spot.
What Surprised Me Most
Almost nothing about the model logic changes.
- Architecture stays the same
- Math stays the same
- Training ideas stay the same
We're just changing:
- How numbers are represented
- When scaling happens
Most of the difficulty isn't math. It's systems thinking.
The Real Insight
Quantization isn't some hack. It's a design decision.
And once this clicked, things like:
- Fine-tuning
- Inference optimization
- Deployment tradeoffs
All started feeling way less mysterious.
That's basically it.