Softmax Demystified — How Raw Scores Become Attention Weights
A deep dive into the softmax function — why it's used in self-attention, how it converts raw dot product scores into probabilities, and why the numerically stable variant (subtracting the max) matters in practice.
In the previous blog, we computed raw attention scores by taking dot products between query and key vectors. For "please", those scores came out as [1, 0, 1]. For "study": [2, 4, 4]. For "man": [2, 2, 3].
But raw numbers aren't useful on their own. We need to know: out of 100%, how much attention should each word pay to every other word?
That's exactly what softmax does. And understanding it properly — not just mechanically applying the formula — will make you a better practitioner.
Why Not Just Normalize Directly?
The simplest way to convert scores to percentages is regular normalization: divide each score by the sum of all scores.
For [1, 0, 1]: sum = 2, so weights = [0.5, 0, 0.5]
But there's a problem. The score 0 gives weight 0 — meaning "please" would completely ignore "study." In practice, we want every word to have some influence, even if small. Attention should be soft, not hard.
Also, negative scores (which happen all the time in real models) would produce negative weights — which don't make sense as probabilities.
Softmax solves both problems.
The Softmax Formula
softmax(xᵢ) = e^xᵢ / Σⱼ e^xⱼ
Where e ≈ 2.718 is Euler's number.
What this does:
- Exponentiates every score — turning negatives into small positives, amplifying large values
- Divides by the sum — normalizing to probabilities that sum to 1
Key properties:
- Output is always between 0 and 1
- All outputs sum to 1
- Larger inputs get exponentially more weight (sharpening)
- No input ever maps to exactly 0 (softness)
Worked Example: "please" scores [1, 0, 1]
Step 1 — Compute exponentials:
e^1 = 2.718
e^0 = 1
e^1 = 2.718
Step 2 — Sum:
2.718 + 1 + 2.718 = 6.436
Step 3 — Normalize:
2.718 / 6.436 ≈ 0.422
1 / 6.436 ≈ 0.155
2.718 / 6.436 ≈ 0.422
Result: [0.422, 0.155, 0.422]
Notice: even though "study" had a raw score of 0, it still gets 15.5% attention — not zero. This is the "soft" in softmax.
Worked Example: "study" scores [2, 4, 4]
Step 1 — Exponentials:
e^2 ≈ 7.389
e^4 ≈ 54.598
e^2 ≈ 7.389
Step 2 — Sum:
7.389 + 54.598 + 7.389 = 69.376
Step 3 — Normalize:
7.389 / 69.376 ≈ 0.106
54.598 / 69.376 ≈ 0.787
7.389 / 69.376 ≈ 0.106
Result: [0.106, 0.787, 0.106]
"study" attends to itself 78.7% — the higher scores get amplified disproportionately by the exponent.
Worked Example: "man" scores [2, 2, 3]
Step 1 — Exponentials:
e^2 ≈ 7.389
e^2 ≈ 7.389
e^3 ≈ 20.086
Step 2 — Sum:
7.389 + 7.389 + 20.086 = 34.864
Step 3 — Normalize:
7.389 / 34.864 ≈ 0.212
7.389 / 34.864 ≈ 0.212
20.086 / 34.864 ≈ 0.576
Result: [0.212, 0.212, 0.576]
The Numerically Stable Version
In real implementations, scores can be very large — especially in high-dimensional models where embeddings have 512 or 1024 dimensions. e^1000 overflows to infinity in float32. Your loss becomes NaN and training explodes.
The fix: subtract the maximum value before exponentiating.
Mathematically, softmax is shift-invariant — subtracting a constant from all inputs doesn't change the output:
softmax(x - max(x)) = softmax(x)
Proof:
e^(xᵢ - c) / Σ e^(xⱼ - c)
= e^xᵢ · e^(-c) / Σ e^xⱼ · e^(-c)
= e^xᵢ / Σ e^xⱼ ✓
So for "man" with scores [2, 2, 3]:
max = 3
Shifted: [2-3, 2-3, 3-3] = [-1, -1, 0]
e^-1 ≈ 0.368
e^-1 ≈ 0.368
e^0 = 1
Sum = 1.736
0.368 / 1.736 ≈ 0.212
0.368 / 1.736 ≈ 0.212
1 / 1.736 ≈ 0.576
Same result — but now the largest exponent is always e^0 = 1, so no overflow is possible. Every deep learning framework uses this under the hood.
The √dk Scaling Factor
In the original Transformer paper, scores are scaled before softmax:
Attention(Q, K, V) = softmax(QKᵀ / √dk) · V
Why divide by √dk?
As the dimension dk grows, dot products grow proportionally. For dk = 64, dot products can easily reach values like 50 or 60. When those go through softmax:
e^60is astronomically large- One score completely dominates — weights become
[~0, ~0, ~1.0] - The gradient of softmax near these extremes is near zero
- Training stalls — the vanishing gradient problem resurfaces inside attention
Dividing by √dk keeps scores in a reasonable range regardless of embedding size.
For dk = 2 (our toy example), √2 ≈ 1.41. In practice with dk = 512, you'd divide by √512 ≈ 22.6.
Sharp vs. Flat Attention
Softmax has an interesting sharpening property. Compare two score vectors:
Flat scores: [1, 1, 1] → softmax → [0.333, 0.333, 0.333]
Uniform attention — no preference.
Sharp scores: [1, 1, 10] → softmax → [~0.0001, ~0.0001, ~0.9998]
Almost all attention goes to the third token.
This means the model can learn to be focused (attend mostly to one word) or diffuse (spread attention broadly) just by adjusting the magnitude of its scores through Wq and Wk. The sharpness of attention is an emergent property learned during training.
Softmax in the Full Pipeline
To recap where softmax sits in the self-attention pipeline:
1. Embeddings → multiply by Wq, Wk, Wv → Q, K, V vectors
2. Q · Kᵀ → raw attention scores
3. scores / √dk → scaled scores
4. softmax(scaled scores) → attention weights ← YOU ARE HERE
5. weights · V → contextual output vectors
Softmax is the step that converts mathematical similarity into interpretable probabilities — the bridge between "how similar are these two vectors" and "how much should I mix their information."
Key Takeaways
- Softmax converts raw dot product scores to probabilities that sum to 1
- It's "soft" — no input ever maps to exactly 0, so every token has some influence
- The numerically stable version (subtract max) prevents overflow and is always used in practice
- The
√dkscaling factor prevents gradient vanishing as embedding dimensions grow - Larger scores get amplified exponentially — the model can learn sharp (focused) or flat (diffuse) attention patterns
What's Next
Now we understand scores → weights. Next: Contextual Embeddings — how those weights get multiplied by value vectors to produce each token's final, context-aware representation, and what it means that a word's embedding changes based on the sentence it's in.
This is part 4 of a series on Transformer architecture.