Query, Key, Value — The Database Analogy That Makes Self-Attention Click
A deep intuitive breakdown of the Q, K, V mechanism in self-attention — using a database retrieval analogy and real weight matrix math to show exactly how Transformers decide which words to attend to.
Query, Key, Value — The Database Analogy That Makes Self-Attention Click
In the last blog, we saw that the Transformer's big idea is letting every word directly attend to every other word simultaneously. But how does a word know which other words to pay attention to? How does it decide relevance?
That's where Query, Key, and Value come in.
Q, K, V is the mechanism underneath self-attention. Once you truly get this, the rest of the Transformer makes sense.
The Problem We're Solving
When a model processes the sentence "please study man", each word needs to figure out: which other words in this sentence are most relevant to understanding me?
The word "man" needs to know whether "study" is more important context than "please." The word "please" needs to decide how much attention to give to "man."
We need a way to compute relevance scores between every pair of words. Q, K, V is how we do it.
The Database Analogy
Think of self-attention as a soft database lookup.
In a traditional database:
- You send a query (what you're looking for)
- The database checks your query against keys (labels/descriptions of each row)
- When it finds a match, it returns the value (the actual content of that row)
It's a hard match — either the key matches your query or it doesn't.
Self-attention does the same thing, but softly. Instead of a binary match, every key partially matches every query to some degree. The model computes a similarity score between the query and every key, then uses those scores to retrieve a weighted blend of all values.
This is the core loop:
Query → "What am I looking for?"
Key → "What does each word advertise it contains?"
Value → "What does each word actually give back once matched?"
Where Q, K, V Come From
Here's what a lot of explanations skip: Q, K, V are not the word embeddings themselves. They are learned linear transformations of the embeddings.
Every token's embedding gets projected through three separate weight matrices:
Q = embedding × Wq (weight matrix for queries)
K = embedding × Wk (weight matrix for keys)
V = embedding × Wv (weight matrix for values)
These weight matrices — Wq, Wk, Wv — are learned during training. The model learns what kind of "questions" to ask (Wq), what kind of "labels" to advertise (Wk), and what kind of "content" to return (Wv).
This is crucial: the same word embedding produces three different vectors — one for when it's asking a question (query role), one for when it's being asked about (key role), and one for when it's contributing content (value role).
A Concrete Example
Let's use the sentence: "please study man"
Say our input embeddings are:
please → [1, 0]
study → [0, 2]
man → [1, 1]
And our weight matrices are:
Wq = [[1, 0], Wk = [[1, 1], Wv = [[1, 0],
[0, 1]] [0, 1]] [0, 1]]
Step 1: Calculating Query Vectors
Each embedding is multiplied by Wq:
please query = [1,0] · Wq = [1×1 + 0×0, 1×0 + 0×1] = [1, 0]
study query = [0,2] · Wq = [0×1 + 2×0, 0×0 + 2×1] = [0, 2]
man query = [1,1] · Wq = [1×1 + 1×0, 1×0 + 1×1] = [1, 1]
Query vectors:
please → [1, 0]
study → [0, 2]
man → [1, 1]
Step 2: Calculating Key Vectors
Each embedding is multiplied by Wk:
please key = [1,0] · Wk = [1×1 + 0×0, 1×1 + 0×1] = [1, 1]
study key = [0,2] · Wk = [0×1 + 2×0, 0×1 + 2×1] = [0, 2]
man key = [1,1] · Wk = [1×1 + 1×0, 1×1 + 1×1] = [2, 1] ← wait, let me be precise
Actually computing carefully:
please key = [1,0] · [[1,1],[0,1]] = [1×1+0×0, 1×1+0×1] = [1, 1]
study key = [0,2] · [[1,1],[0,1]] = [0×1+2×0, 0×1+2×1] = [0, 2]
man key = [1,1] · [[1,1],[0,1]] = [1×1+1×0, 1×1+1×1] = [1+0, 1+1] = [2, 1] → [1×1+1×0, 1×1+1×1] = [1, 2]
Key vectors:
please → [1, 1]
study → [0, 2]
man → [2, 1]
Step 3: Calculating Value Vectors
Each embedding is multiplied by Wv:
please value = [1,0] · Wv = [1×1+0×0, 1×0+0×1] = [1, 0]
study value = [0,2] · Wv = [0×1+2×0, 0×0+2×1] = [0, 2] → [0×1+2×0, 0×0+2×1] = [0+0, 0+2] = [2, 2]
man value = [1,1] · Wv = [1×1+1×0, 1×0+1×1] = [1, 0+1] = [2, 1]
Value vectors:
please → [1, 0]
study → [2, 2]
man → [2, 1]
Why Three Separate Matrices?
You might ask — why not just use the embedding directly for all three roles? Why learn separate Wq, Wk, Wv?
The reason is role separation. A word needs to behave differently depending on its role in a given moment:
- When "man" is asking for context (query role), it should express what it's looking for — maybe syntactic subjects, maybe semantic roles.
- When "man" is being asked about (key role), it should advertise what information it can provide.
- When "man" is contributing content (value role), it should return its actual semantic content.
These are three different jobs. Learning separate weight matrices lets the model independently optimize how each word plays each role.
If you used the same vector for all three, the model couldn't distinguish between "what I'm looking for" and "what I'm offering." The separation is what gives the mechanism its expressiveness.
The Role of Each Component
Let's be precise about what each component does in the full pipeline:
Query (Q): Acts like a search request from the current word. It's asking: "What information am I looking for from the other words in this sequence?" Each word generates its own query vector when it needs to gather context.
Key (K): Functions like a label or description for each word — advertising the information it contains. The query of one word is compared to the keys of all other words (including itself) to determine a similarity or relevance score.
Value (V): Contains the actual content or meaning of the word that should be retrieved once the relevance is determined. After computing how much to attend to each word (via Q·K scores), the values are the things that actually get mixed together.
What Happens Next
Once we have Q, K, V vectors for every word, the next step is:
- Compute raw attention scores → dot product of each query with every key:
score = Q · Kᵀ - Scale the scores by
√d_kto prevent large values from dominating softmax - Apply softmax to turn scores into probabilities (attention weights)
- Multiply weights by V to get a weighted blend — the contextual output vector
In formal notation:
Attention(Q, K, V) = softmax(QKᵀ / √dk) · V
We'll work through this entire computation with real numbers in the next blog — every dot product, every softmax step, every contextual vector — for all three words in "please study man."
Intuition Check
Before we move on, let's make sure the intuition is solid.
Imagine you're "please" in the sentence "please study man." You generate a query: "I'm a polite request word — who in this sentence tells me what I'm requesting?"
You compare your query against every word's key. "study" has a key that says "I'm an action/verb." "man" has a key that says "I'm the subject." Your query-key similarity scores tell you how much to weight each word's value when building your own contextual representation.
The higher the score between your query and a word's key, the more of that word's value you pull into your final representation.
That's it. That's all of self-attention at the conceptual level.
Key Takeaways
- Q, K, V are not the embeddings themselves — they are learned projections of the embeddings through Wq, Wk, Wv
- Query = what this word is asking for from others
- Key = what this word advertises it can provide
- Value = what this word actually contributes once selected
- The separation of roles is what makes the mechanism expressive — the same word behaves differently when asking vs. being asked vs. contributing
- Self-attention is essentially a soft, differentiable database lookup that is fully parallelizable
What's Next
Now that we have our Q, K, V vectors computed for every word, we're ready to run the full self-attention calculation.
Next up: Self-Attention From Scratch — A Complete Numerical Walkthrough, where we take "please study man" all the way through dot products, scaling, softmax, and weighted value aggregation — step by step, number by number.
This is part 2 of a series on Transformer architecture. Start from part 1 if you haven't already.