Contextual Embeddings — How Transformers Make Words Context-Aware
How self-attention produces contextual embeddings by computing a weighted sum of value vectors — and what it means that the same word gets a different representation depending on the sentence it appears in.
Static word embeddings like Word2Vec give every word a single fixed vector. "bank" always maps to the same point in space — whether you're talking about a river bank or a savings bank.
Transformers don't work this way. After self-attention, the word "bank" has a different representation depending on every other word in the sentence. This is what contextual embeddings means — and it's one of the most powerful properties of the Transformer architecture.
In this blog, we'll see exactly how they're computed.
The Final Step of Self-Attention
Recap of where we are in the pipeline:
1. Embeddings → Wq, Wk, Wv → Q, K, V vectors
2. Q · Kᵀ → raw scores
3. softmax(scores / √dk) → attention weights
4. weights × V → contextual output ← THIS BLOG
We have our attention weights from the previous blog:
please → [0.422, 0.155, 0.422] (how much it attends to please, study, man)
study → [0.106, 0.787, 0.106]
man → [0.212, 0.212, 0.576]
And our value vectors:
please → [1, 0]
study → [2, 2]
man → [2, 1]
Now we compute the context vector for each token.
The Formula: Weighted Sum of Values
context_vector = Σᵢ (attention_weightᵢ × valueᵢ)
For a token with attention weights [w1, w2, w3] over tokens with values V1, V2, V3:
context = w1×V1 + w2×V2 + w3×V3
This is a weighted average of all value vectors, where the weights are the attention probabilities.
Computing Context Vector for "please"
Weights: [0.422, 0.155, 0.422] Values: [1,0], [2,2], [2,1]
0.422 × [1, 0] = [0.422, 0]
0.155 × [2, 2] = [0.310, 0.310]
0.422 × [2, 1] = [0.844, 0.422]
Add dimension by dimension:
dim 1: 0.422 + 0.310 + 0.844 = 1.576
dim 2: 0 + 0.310 + 0.422 = 0.732
Context vector for "please": [1.576, 0.732]
How is "please" influenced?
- 42.2% by itself → pulls in its own value [1,0]
- 15.5% by "study" → pulls in [2,2]
- 42.2% by "man" → pulls in [2,1]
The final vector is a blend of all three — dominated by "please" and "man" equally.
Computing Context Vector for "study"
Weights: [0.106, 0.787, 0.106] Values: [1,0], [2,2], [2,1]
0.106 × [1, 0] = [0.106, 0]
0.787 × [2, 2] = [1.574, 1.574]
0.106 × [2, 1] = [0.212, 0.106]
Add:
dim 1: 0.106 + 1.574 + 0.212 = 1.892
dim 2: 0 + 1.574 + 0.106 = 1.680
Context vector for "study": [1.892, 1.680]
"study" is heavily self-influenced (78.7%) — its output mostly reflects its own value vector, with modest contributions from "please" and "man."
Computing Context Vector for "man"
Weights: [0.212, 0.212, 0.576] Values: [1,0], [2,2], [2,1]
0.212 × [1, 0] = [0.212, 0]
0.212 × [2, 2] = [0.424, 0.424]
0.576 × [2, 1] = [1.152, 0.576]
Add:
dim 1: 0.212 + 0.424 + 1.152 = 1.788
dim 2: 0 + 0.424 + 0.576 = 1.000
Context vector for "man": [1.788, 1.000]
The Final Output Matrix
This matrix is the complete output of one self-attention head:
dim1 dim2
please [ 1.576, 0.732 ]
study [ 1.892, 1.680 ]
man [ 1.788, 1.000 ]
Three rows, one per input token. Each row is a contextual embedding — shaped by every other token in the sentence.
Why This Is Powerful: The Same Word, Different Vectors
Imagine the word "bank" in two sentences:
"I deposited money in the bank." "The boat drifted toward the river bank."
In a static embedding (Word2Vec, GloVe), "bank" has the same vector in both sentences. The model can't distinguish the meaning.
In a Transformer, "bank" generates different attention patterns in each sentence:
- In sentence 1: high attention to "money", "deposited" → the value blend pulls toward financial semantics
- In sentence 2: high attention to "river", "boat" → the value blend pulls toward geographical semantics
The output context vector for "bank" is different in each sentence, even though the initial embedding was the same. The self-attention mechanism dynamically adjusts representations based on context.
This is why BERT-style models beat Word2Vec on virtually every NLP task — their representations are contextual, not static.
Formal Notation
In the original paper, this entire process is written as:
Attention(Q, K, V) = softmax(QKᵀ / √dk) · V
The · V at the end is the step we computed in this blog — matrix multiply of the attention weight matrix with the value matrix, producing the contextual output matrix.
In our case:
Attention weights matrix (3×3):
[0.422, 0.155, 0.422]
[0.106, 0.787, 0.106]
[0.212, 0.212, 0.576]
Value matrix (3×2):
[1, 0]
[2, 2]
[2, 1]
Output (3×2) = attention_weights × values:
[1.576, 0.732]
[1.892, 1.680]
[1.788, 1.000]
One matrix multiply. That's the entire aggregation step.
What Changes in Multi-Head Attention?
In real Transformers, this computation doesn't happen just once — it happens h times in parallel, with different Wq, Wk, Wv matrices for each head.
Each head learns to attend to different types of relationships:
- Head 1 might learn syntactic dependencies (subject ↔ verb)
- Head 2 might learn coreference (pronoun ↔ noun)
- Head 3 might learn positional proximity
- Head 4 might learn semantic similarity
The outputs from all heads are concatenated and projected back to the original dimension:
MultiHead(Q,K,V) = Concat(head₁, ..., headₕ) · Wₒ
The Wₒ matrix learns how to combine information from all heads into a single coherent representation.
Key Takeaways
- The contextual output vector for each token is a weighted sum of all value vectors, where weights come from softmax attention
- This is what makes Transformer embeddings contextual — the same token gets a different representation in different sentences
- The entire aggregation step is a single matrix multiply:
attention_weights × V - In multi-head attention, this runs in parallel h times with different learned projections, each capturing different relationship types
- Static embeddings (Word2Vec) can't do this — contextual embeddings are a fundamental upgrade
What's Next
We've now fully traced a token through a single self-attention head. Next: Encoder vs Decoder — we'll look at how these single-head blocks are assembled into multi-head attention, stacked into encoder and decoder layers, and how cross-attention connects them.
This is part 5 of a series on Transformer architecture.