Featured2026-03-29

Contextual Embeddings — How Transformers Make Words Context-Aware

How self-attention produces contextual embeddings by computing a weighted sum of value vectors — and what it means that the same word gets a different representation depending on the sentence it appears in.

transformersembeddingsself-attentiondeep-learningnlpai

Static word embeddings like Word2Vec give every word a single fixed vector. "bank" always maps to the same point in space — whether you're talking about a river bank or a savings bank.

Transformers don't work this way. After self-attention, the word "bank" has a different representation depending on every other word in the sentence. This is what contextual embeddings means — and it's one of the most powerful properties of the Transformer architecture.

In this blog, we'll see exactly how they're computed.


The Final Step of Self-Attention

Recap of where we are in the pipeline:

1. Embeddings → Wq, Wk, Wv → Q, K, V vectors
2. Q · Kᵀ → raw scores
3. softmax(scores / √dk) → attention weights
4. weights × V → contextual output  ← THIS BLOG

We have our attention weights from the previous blog:

please → [0.422, 0.155, 0.422]   (how much it attends to please, study, man)
study  → [0.106, 0.787, 0.106]
man    → [0.212, 0.212, 0.576]

And our value vectors:

please → [1, 0]
study  → [2, 2]
man    → [2, 1]

Now we compute the context vector for each token.


The Formula: Weighted Sum of Values

context_vector = Σᵢ (attention_weightᵢ × valueᵢ)

For a token with attention weights [w1, w2, w3] over tokens with values V1, V2, V3:

context = w1×V1 + w2×V2 + w3×V3

This is a weighted average of all value vectors, where the weights are the attention probabilities.


Computing Context Vector for "please"

Weights: [0.422, 0.155, 0.422] Values: [1,0], [2,2], [2,1]

0.422 × [1, 0] = [0.422, 0]
0.155 × [2, 2] = [0.310, 0.310]
0.422 × [2, 1] = [0.844, 0.422]

Add dimension by dimension:

dim 1: 0.422 + 0.310 + 0.844 = 1.576
dim 2: 0     + 0.310 + 0.422 = 0.732

Context vector for "please": [1.576, 0.732]

How is "please" influenced?

  • 42.2% by itself → pulls in its own value [1,0]
  • 15.5% by "study" → pulls in [2,2]
  • 42.2% by "man" → pulls in [2,1]

The final vector is a blend of all three — dominated by "please" and "man" equally.


Computing Context Vector for "study"

Weights: [0.106, 0.787, 0.106] Values: [1,0], [2,2], [2,1]

0.106 × [1, 0] = [0.106, 0]
0.787 × [2, 2] = [1.574, 1.574]
0.106 × [2, 1] = [0.212, 0.106]

Add:

dim 1: 0.106 + 1.574 + 0.212 = 1.892
dim 2: 0     + 1.574 + 0.106 = 1.680

Context vector for "study": [1.892, 1.680]

"study" is heavily self-influenced (78.7%) — its output mostly reflects its own value vector, with modest contributions from "please" and "man."


Computing Context Vector for "man"

Weights: [0.212, 0.212, 0.576] Values: [1,0], [2,2], [2,1]

0.212 × [1, 0] = [0.212, 0]
0.212 × [2, 2] = [0.424, 0.424]
0.576 × [2, 1] = [1.152, 0.576]

Add:

dim 1: 0.212 + 0.424 + 1.152 = 1.788
dim 2: 0     + 0.424 + 0.576 = 1.000

Context vector for "man": [1.788, 1.000]


The Final Output Matrix

This matrix is the complete output of one self-attention head:

         dim1    dim2
please [  1.576,  0.732 ]
study  [  1.892,  1.680 ]
man    [  1.788,  1.000 ]

Three rows, one per input token. Each row is a contextual embedding — shaped by every other token in the sentence.


Why This Is Powerful: The Same Word, Different Vectors

Imagine the word "bank" in two sentences:

"I deposited money in the bank." "The boat drifted toward the river bank."

In a static embedding (Word2Vec, GloVe), "bank" has the same vector in both sentences. The model can't distinguish the meaning.

In a Transformer, "bank" generates different attention patterns in each sentence:

  • In sentence 1: high attention to "money", "deposited" → the value blend pulls toward financial semantics
  • In sentence 2: high attention to "river", "boat" → the value blend pulls toward geographical semantics

The output context vector for "bank" is different in each sentence, even though the initial embedding was the same. The self-attention mechanism dynamically adjusts representations based on context.

This is why BERT-style models beat Word2Vec on virtually every NLP task — their representations are contextual, not static.


Formal Notation

In the original paper, this entire process is written as:

Attention(Q, K, V) = softmax(QKᵀ / √dk) · V

The · V at the end is the step we computed in this blog — matrix multiply of the attention weight matrix with the value matrix, producing the contextual output matrix.

In our case:

Attention weights matrix (3×3):
[0.422, 0.155, 0.422]
[0.106, 0.787, 0.106]
[0.212, 0.212, 0.576]

Value matrix (3×2):
[1, 0]
[2, 2]
[2, 1]

Output (3×2) = attention_weights × values:
[1.576, 0.732]
[1.892, 1.680]
[1.788, 1.000]

One matrix multiply. That's the entire aggregation step.


What Changes in Multi-Head Attention?

In real Transformers, this computation doesn't happen just once — it happens h times in parallel, with different Wq, Wk, Wv matrices for each head.

Each head learns to attend to different types of relationships:

  • Head 1 might learn syntactic dependencies (subject ↔ verb)
  • Head 2 might learn coreference (pronoun ↔ noun)
  • Head 3 might learn positional proximity
  • Head 4 might learn semantic similarity

The outputs from all heads are concatenated and projected back to the original dimension:

MultiHead(Q,K,V) = Concat(head₁, ..., headₕ) · Wₒ

The Wₒ matrix learns how to combine information from all heads into a single coherent representation.


Key Takeaways

  • The contextual output vector for each token is a weighted sum of all value vectors, where weights come from softmax attention
  • This is what makes Transformer embeddings contextual — the same token gets a different representation in different sentences
  • The entire aggregation step is a single matrix multiply: attention_weights × V
  • In multi-head attention, this runs in parallel h times with different learned projections, each capturing different relationship types
  • Static embeddings (Word2Vec) can't do this — contextual embeddings are a fundamental upgrade

What's Next

We've now fully traced a token through a single self-attention head. Next: Encoder vs Decoder — we'll look at how these single-head blocks are assembled into multi-head attention, stacked into encoder and decoder layers, and how cross-attention connects them.


This is part 5 of a series on Transformer architecture.

Related Reading

Subscribe to my newsletter

No spam, promise. I only send curated blogs that match your interests — the stuff you'd actually want to read.

Interests (optional)

Unsubscribe anytime. Your email is safe with me.