Featured2026-03-28

Softmax Demystified — How Raw Scores Become Attention Weights

A deep dive into the softmax function — why it's used in self-attention, how it converts raw dot product scores into probabilities, and why the numerically stable variant (subtracting the max) matters in practice.

transformerssoftmaxself-attentiondeep-learningnlpmath

In the previous blog, we computed raw attention scores by taking dot products between query and key vectors. For "please", those scores came out as [1, 0, 1]. For "study": [2, 4, 4]. For "man": [2, 2, 3].

But raw numbers aren't useful on their own. We need to know: out of 100%, how much attention should each word pay to every other word?

That's exactly what softmax does. And understanding it properly — not just mechanically applying the formula — will make you a better practitioner.


Why Not Just Normalize Directly?

The simplest way to convert scores to percentages is regular normalization: divide each score by the sum of all scores.

For [1, 0, 1]: sum = 2, so weights = [0.5, 0, 0.5]

But there's a problem. The score 0 gives weight 0 — meaning "please" would completely ignore "study." In practice, we want every word to have some influence, even if small. Attention should be soft, not hard.

Also, negative scores (which happen all the time in real models) would produce negative weights — which don't make sense as probabilities.

Softmax solves both problems.


The Softmax Formula

softmax(xᵢ) = e^xᵢ / Σⱼ e^xⱼ

Where e ≈ 2.718 is Euler's number.

What this does:

  1. Exponentiates every score — turning negatives into small positives, amplifying large values
  2. Divides by the sum — normalizing to probabilities that sum to 1

Key properties:

  • Output is always between 0 and 1
  • All outputs sum to 1
  • Larger inputs get exponentially more weight (sharpening)
  • No input ever maps to exactly 0 (softness)

Worked Example: "please" scores [1, 0, 1]

Step 1 — Compute exponentials:

e^1 = 2.718
e^0 = 1
e^1 = 2.718

Step 2 — Sum:

2.718 + 1 + 2.718 = 6.436

Step 3 — Normalize:

2.718 / 6.436 ≈ 0.422
1     / 6.436 ≈ 0.155
2.718 / 6.436 ≈ 0.422

Result: [0.422, 0.155, 0.422]

Notice: even though "study" had a raw score of 0, it still gets 15.5% attention — not zero. This is the "soft" in softmax.


Worked Example: "study" scores [2, 4, 4]

Step 1 — Exponentials:

e^2 ≈ 7.389
e^4 ≈ 54.598
e^2 ≈ 7.389

Step 2 — Sum:

7.389 + 54.598 + 7.389 = 69.376

Step 3 — Normalize:

7.389  / 69.376 ≈ 0.106
54.598 / 69.376 ≈ 0.787
7.389  / 69.376 ≈ 0.106

Result: [0.106, 0.787, 0.106]

"study" attends to itself 78.7% — the higher scores get amplified disproportionately by the exponent.


Worked Example: "man" scores [2, 2, 3]

Step 1 — Exponentials:

e^2 ≈ 7.389
e^2 ≈ 7.389
e^3 ≈ 20.086

Step 2 — Sum:

7.389 + 7.389 + 20.086 = 34.864

Step 3 — Normalize:

7.389  / 34.864 ≈ 0.212
7.389  / 34.864 ≈ 0.212
20.086 / 34.864 ≈ 0.576

Result: [0.212, 0.212, 0.576]


The Numerically Stable Version

In real implementations, scores can be very large — especially in high-dimensional models where embeddings have 512 or 1024 dimensions. e^1000 overflows to infinity in float32. Your loss becomes NaN and training explodes.

The fix: subtract the maximum value before exponentiating.

Mathematically, softmax is shift-invariant — subtracting a constant from all inputs doesn't change the output:

softmax(x - max(x)) = softmax(x)

Proof:

e^(xᵢ - c) / Σ e^(xⱼ - c)
= e^xᵢ · e^(-c) / Σ e^xⱼ · e^(-c)
= e^xᵢ / Σ e^xⱼ   ✓

So for "man" with scores [2, 2, 3]:

max = 3
Shifted: [2-3, 2-3, 3-3] = [-1, -1, 0]

e^-1 ≈ 0.368
e^-1 ≈ 0.368
e^0  = 1

Sum = 1.736

0.368 / 1.736 ≈ 0.212
0.368 / 1.736 ≈ 0.212
1     / 1.736 ≈ 0.576

Same result — but now the largest exponent is always e^0 = 1, so no overflow is possible. Every deep learning framework uses this under the hood.


The √dk Scaling Factor

In the original Transformer paper, scores are scaled before softmax:

Attention(Q, K, V) = softmax(QKᵀ / √dk) · V

Why divide by √dk?

As the dimension dk grows, dot products grow proportionally. For dk = 64, dot products can easily reach values like 50 or 60. When those go through softmax:

  • e^60 is astronomically large
  • One score completely dominates — weights become [~0, ~0, ~1.0]
  • The gradient of softmax near these extremes is near zero
  • Training stalls — the vanishing gradient problem resurfaces inside attention

Dividing by √dk keeps scores in a reasonable range regardless of embedding size.

For dk = 2 (our toy example), √2 ≈ 1.41. In practice with dk = 512, you'd divide by √512 ≈ 22.6.


Sharp vs. Flat Attention

Softmax has an interesting sharpening property. Compare two score vectors:

Flat scores: [1, 1, 1] → softmax → [0.333, 0.333, 0.333] Uniform attention — no preference.

Sharp scores: [1, 1, 10] → softmax → [~0.0001, ~0.0001, ~0.9998] Almost all attention goes to the third token.

This means the model can learn to be focused (attend mostly to one word) or diffuse (spread attention broadly) just by adjusting the magnitude of its scores through Wq and Wk. The sharpness of attention is an emergent property learned during training.


Softmax in the Full Pipeline

To recap where softmax sits in the self-attention pipeline:

1. Embeddings → multiply by Wq, Wk, Wv → Q, K, V vectors
2. Q · Kᵀ → raw attention scores
3. scores / √dk → scaled scores
4. softmax(scaled scores) → attention weights  ← YOU ARE HERE
5. weights · V → contextual output vectors

Softmax is the step that converts mathematical similarity into interpretable probabilities — the bridge between "how similar are these two vectors" and "how much should I mix their information."


Key Takeaways

  • Softmax converts raw dot product scores to probabilities that sum to 1
  • It's "soft" — no input ever maps to exactly 0, so every token has some influence
  • The numerically stable version (subtract max) prevents overflow and is always used in practice
  • The √dk scaling factor prevents gradient vanishing as embedding dimensions grow
  • Larger scores get amplified exponentially — the model can learn sharp (focused) or flat (diffuse) attention patterns

What's Next

Now we understand scores → weights. Next: Contextual Embeddings — how those weights get multiplied by value vectors to produce each token's final, context-aware representation, and what it means that a word's embedding changes based on the sentence it's in.


This is part 4 of a series on Transformer architecture.

Related Reading

Subscribe to my newsletter

No spam, promise. I only send curated blogs that match your interests — the stuff you'd actually want to read.

Interests (optional)

Unsubscribe anytime. Your email is safe with me.