Encoder vs Decoder — What Each Half of the Transformer Actually Does
A clear breakdown of what the encoder and decoder each do in a Transformer — their internal structure, how multi-head self-attention works, what cross-attention is, and when you'd use encoder-only vs decoder-only vs full encoder-decoder models.
We've spent five blogs understanding self-attention from the ground up. Now let's zoom out and see how that mechanism fits into the full Transformer architecture — specifically, what the encoder and decoder each do, and why they're structured differently.
This is where everything connects.
The Big Picture
The Transformer is an encoder-decoder architecture. At the highest level:
- The encoder reads and understands the input
- The decoder generates the output
For machine translation (the original use case):
- Encoder reads: "please study man" (English)
- Decoder generates: "por favor estudia hombre" (Spanish)
The encoder produces rich contextual representations of the input. The decoder uses those representations — plus what it has generated so far — to produce the next token.
Stacked Layers
Both encoder and decoder are stacked — the original paper used 6 of each. Modern models use far more.
The intuition behind stacking:
- Lower layers capture basic syntactic relationships — which words are near each other, subject-verb agreement, part-of-speech patterns
- Middle layers understand broader context — pronoun resolution, entity relationships, semantic groupings
- Higher layers capture abstract meaning — sentiment, discourse structure, world knowledge
This emergent specialization isn't explicitly programmed. It arises from the model learning to minimize prediction error across billions of examples.
Each layer takes the output of the previous layer as input — so by the time a token's representation reaches the final encoder layer, it has been contextualized through 6 rounds of attention.
Inside the Encoder Block
Each encoder layer has two sub-layers:
Input (contextual embeddings from previous layer)
↓
┌────────────────────────┐
│ Multi-Head │
│ Self-Attention │
└────────────────────────┘
↓
Add & Norm (residual connection + layer normalization)
↓
┌────────────────────────┐
│ Feed-Forward │
│ Neural Network │
└────────────────────────┘
↓
Add & Norm
↓
Output (richer contextual embeddings)
Multi-Head Self-Attention
This is the full version of the single-head attention we computed across the last four blogs — but run in parallel with h different sets of Wq, Wk, Wv matrices.
Each "head" independently computes attention and produces a contextual output. The outputs are concatenated and projected:
MultiHead(Q,K,V) = Concat(head₁, ..., headₕ) · Wₒ
With 8 heads (as in the original paper) and dk = 64, each head operates in 64-dimensional space, and the concatenated output is projected back to 512 dimensions via Wₒ.
Different heads learn to track different relationship types simultaneously. The encoder can attend to syntactic structure in one head while tracking semantic similarity in another — all from the same layer.
Feed-Forward Network
After multi-head attention, each token's representation is passed through a small feed-forward network independently (no cross-token interaction here):
FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
This is a 2-layer MLP with a ReLU activation. In the original paper, the inner dimension is 2048 (4× the model dimension of 512). This step adds non-linearity and lets the model transform the attended representation before passing it to the next layer.
Residual Connections + Layer Norm
Both sub-layers use residual connections — the input is added to the output before normalization:
output = LayerNorm(x + Sublayer(x))
This is borrowed from ResNets. It helps gradients flow during training and lets the model learn incremental refinements at each layer rather than complete transformations. Without residuals, very deep networks are hard to train.
Inside the Decoder Block
The decoder is more complex — it has three sub-layers per block:
Input (output tokens generated so far)
↓
┌────────────────────────┐
│ Masked Multi-Head │
│ Self-Attention │
└────────────────────────┘
↓
Add & Norm
↓
┌────────────────────────┐
│ Cross-Attention │
│ (Encoder → Decoder) │
└────────────────────────┘
↓
Add & Norm
↓
┌────────────────────────┐
│ Feed-Forward │
│ Neural Network │
└────────────────────────┘
↓
Add & Norm
↓
Output
1. Masked Multi-Head Self-Attention
The decoder attends to its own output sequence — but with a causal mask that prevents it from looking at future tokens.
Why? Because during generation, the decoder produces tokens one at a time. When generating token 5, it can only use tokens 1–4 — it hasn't generated token 6 yet. The mask enforces this by setting future attention scores to -∞ before softmax, which zeroes them out.
During training, this is done all at once (the entire target sequence is fed in), but the mask simulates the autoregressive constraint.
2. Cross-Attention (Encoder-Decoder Attention)
This is the bridge between encoder and decoder — and it's what makes the full encoder-decoder architecture special.
In cross-attention:
- Queries come from the decoder (current generation state)
- Keys and Values come from the encoder output (the encoded input sequence)
CrossAttention(Q_decoder, K_encoder, V_encoder)
This is how the decoder "looks up" information from the input. At each step, the decoder asks: "Given what I've generated so far, which parts of the input are most relevant for my next token?"
For translation, this is how the model learns alignment — which source words correspond to which target words.
3. Feed-Forward Network
Same as in the encoder — a position-wise 2-layer MLP applied independently to each token.
Three Types of Transformer Models
Not all Transformers use both encoder and decoder. Modern models tend to use one or the other depending on the task:
Encoder-only models (e.g. BERT, RoBERTa)
- Only the encoder stack
- Produces rich contextual embeddings for every input token
- Good for understanding tasks: classification, NER, question answering, sentence similarity
- Bidirectional — can attend to tokens both before and after the current position
Decoder-only models (e.g. GPT-4, LLaMA, Claude)
- Only the decoder stack (with causal masking)
- Generates output autoregressively, one token at a time
- Good for generation tasks: text completion, chat, code generation
- Unidirectional — each token only attends to previous tokens
Encoder-decoder models (e.g. T5, BART, original Transformer)
- Full architecture with both components
- Encoder reads the input, decoder generates the output
- Good for seq2seq tasks: translation, summarization, question generation
The choice of architecture is a design decision based on what the model needs to do — not a fundamental limitation.
What Stacked Layers Actually Learn
Research into what individual attention heads and layers learn has produced some fascinating patterns.
In encoder models like BERT, different heads in different layers tend to specialize:
- Some heads learn positional relationships (attending to adjacent tokens)
- Some heads learn syntactic structure (subject ↔ verb, noun ↔ modifier)
- Some heads learn coreference (attending from a pronoun back to its antecedent)
- Some heads seem to attend to punctuation or sentence boundaries
This specialization isn't programmed — it's entirely emergent from training on language modeling objectives.
This is part of what makes Transformers so powerful and so hard to fully interpret.
The Information Flow, End to End
Let's trace a token from raw input to final output:
1. Input text → tokenizer → token IDs
2. Token IDs → embedding lookup → static embeddings
3. Static embeddings + positional encodings → input to encoder
4. Encoder layer 1 → multi-head self-attention → FFN → contextual embeddings (level 1)
5. Encoder layer 2 → ... → contextual embeddings (level 2)
...repeat for all N encoder layers...
6. Final encoder output = rich contextual representations of input
7. Decoder: attends to previous outputs (masked self-attention)
8. Decoder: cross-attends to encoder output (cross-attention)
9. Decoder: FFN → refined decoder state
...repeat for all M decoder layers...
10. Final decoder output → linear projection + softmax → probability over vocabulary
11. Sample/argmax → next token → feed back into decoder → repeat
Every piece of this pipeline — from the attention scores to the residual connections to the feed-forward networks — is differentiable and learned end-to-end from data.
Key Takeaways
- The encoder produces contextual representations of the input; the decoder generates output autoregressively using those representations
- Each encoder layer has 2 sub-layers: multi-head self-attention + FFN
- Each decoder layer has 3 sub-layers: masked self-attention + cross-attention + FFN
- Residual connections and layer normalization are essential for training deep stacks
- Different layers specialize — lower for syntax, middle for context, higher for semantics
- Modern models tend to use encoder-only (BERT) or decoder-only (GPT) rather than the full encoder-decoder, depending on task type
- Cross-attention is the mechanism that connects encoder knowledge to decoder generation
Wrapping Up the Series
This is the final blog in the Transformer series. Here's what we covered:
- ✅ Attention Is All You Need — Why Transformers replaced RNNs and the full architecture overview
- ✅ Query, Key, Value — The database analogy and how Q, K, V are computed
- ✅ Self-Attention From Scratch — Full numerical walkthrough with "please study man"
- ✅ Softmax Demystified — How raw scores become attention weights, numerical stability, scaling
- ✅ Contextual Embeddings — How value-weighted sums produce context-aware representations
- ✅ Encoder vs Decoder — Architecture internals, multi-head attention, cross-attention, model types
You now have a complete bottom-up understanding of how Transformers work — from a single dot product all the way to the architecture powering every major AI model in use today.
This is part 6 of a series on Transformer architecture. Start from part 1 if you're new here.