The Only Blog You Need to Understand Encoder-Decoder Architecture
A complete breakdown of encoder-decoder architectures—how they compress sequences into context vectors, generate outputs step-by-step, why teacher forcing matters, and the four key limitations that led to attention mechanisms.
The Problem We're Solving
Before encoder-decoder models, we had three main types of neural networks:
-
Input → Output (Simple ML, ANNs, basic RNNs)
- Example: Age + Role → Salary prediction
- Works for fixed inputs and outputs
-
Sequence → Label (RNN, LSTM, GRU)
- Example: Email text → Spam or Ham
- Example: Sentence → Emotion classification
- Input is a sequence, output is a single label
-
Sequence → Sequence (Encoder-Decoder)
- Example: English question → Hindi answer
- Example: Long article → Short summary
- Both input AND output are sequences
The third type is where encoder-decoder models come in.
What Actually Happens
An encoder-decoder model first understands the input sequence, then generates an output sequence based on that understanding.
The flow:
Input → Encoder → Context Vector → Decoder → Output
Let's break this down.
What Is an Encoder?
An encoder converts an input sequence into a meaningful representation (context vector).
Think of it like this:
- You give it a sentence: "I love AI"
- It processes word by word: X₁ (I) → X₂ (love) → X₃ (AI)
- At the end, it produces a context vector — a fixed-size numerical representation that captures the "meaning" of the entire input
Encoder Characteristics:
- Can use RNN, LSTM, or GRU architecture
- Reads input from left → right
- No output generation — its only job is to understand
- Creates a context vector at the end
For example, with LSTM:
X₁ (I) → T₁ → h₁
X₂ (love) → T₂ → h₂
X₃ (AI) → T₃ → h₃ (context vector)
That final hidden state h₃ becomes the context vector passed to the decoder.
What Is a Decoder?
A decoder generates the output sequence one element at a time, using the encoder's context vector.
For translation:
- Context vector: [representation of "I love AI"]
- Decoder generates: मैं → एआई → प्यार → करता → हूं (Hindi translation)
Decoder Characteristics:
- Probability-based model — uses softmax at each step
- Depends on previous output — each word depends on what came before
- Behaves differently in training vs prediction
This last point is crucial.
Training vs Prediction: The Teacher Forcing Problem
Decoders work very differently during training and prediction.
Training Phase (Teacher Forcing):
- Decoder already knows the correct output
- At every step, it's given the correct previous word
- Even if it predicts wrong, the next step uses the correct word
Example:
- Target: "once upon a time"
- Step 1: <START> + context → predict "once" (even if wrong, next step gets "once")
- Step 2: "once" + context → predict "upon"
- Step 3: "upon" + context → predict "a"
- Step 4: "a" + context → predict "time"
This is called teacher forcing — the model is "helped" at every step.
Prediction Phase:
- Decoder does NOT know the output
- It uses its own previous predictions
- If it makes a mistake early, errors can compound
Example:
- Step 1: <START> + context → predict "once"
- Step 2: "once" + context → predict "upon" (if Step 1 was wrong, this suffers)
- And so on...
The Architecture in Detail
Here's what happens internally:
Encoder side:
Embeddings → LSTM → LSTM → LSTM → Context Vector (h₃, c₃)
X₁ T₁ T₂ T₃
Each word gets embedded, then processed through LSTM cells. The final hidden state becomes the context vector.
Decoder side:
<START> → LSTM → LSTM → LSTM → LSTM → <END>
↓ ↓ ↓ ↓
softmax softmax softmax softmax
↓ ↓ ↓ ↓
word1 word2 word3 word4
Context vector is fed into every decoder timestep. At each step, softmax produces a probability distribution over the vocabulary.
Key Problems with Encoder-Decoder Models
While this architecture works, it has limitations:
1. Information Bottleneck
- The entire input sequence gets compressed into a fixed-size context vector
- For long sentences, this vector can't capture everything
- Information gets lost
2. No Word-to-Word Alignment
- The decoder doesn't know which input words correspond to which output words
- It just has one context vector for the entire input
3. Teacher Forcing Creates a Gap
- During training, the model is always given correct previous words
- During prediction, it has to use its own (possibly wrong) predictions
- This mismatch can hurt performance
4. Long Sentence Failure
- As sentences get longer, the fixed context vector becomes a bigger bottleneck
- Early words in the input are "forgotten" by the time the decoder runs
(Note: These problems led to the invention of attention mechanisms, which we'll cover in another post.)
What I Learned
Encoder-decoder models are elegant in their simplicity:
- Encoders understand (compress input into meaning)
- Decoders generate (expand meaning into output)
- Context vector bridges the two
But the architecture has real limitations — especially the information bottleneck and lack of alignment. Understanding these problems is key to appreciating why attention mechanisms became so important.
The beauty is in the design: two separate networks, one for understanding and one for generation, connected by a single vector. Simple, powerful, and the foundation for modern sequence-to-sequence models.