2026-03-20

The Only Blog You Need to Understand Encoder-Decoder Architecture

A complete breakdown of encoder-decoder architectures—how they compress sequences into context vectors, generate outputs step-by-step, why teacher forcing matters, and the four key limitations that led to attention mechanisms.

Encoder-DecoderSequence to SequenceLSTMRNNNLPMachine LearningDeep LearningLearning In Public

The Problem We're Solving

Before encoder-decoder models, we had three main types of neural networks:

  1. Input → Output (Simple ML, ANNs, basic RNNs)

    • Example: Age + Role → Salary prediction
    • Works for fixed inputs and outputs
  2. Sequence → Label (RNN, LSTM, GRU)

    • Example: Email text → Spam or Ham
    • Example: Sentence → Emotion classification
    • Input is a sequence, output is a single label
  3. Sequence → Sequence (Encoder-Decoder)

    • Example: English question → Hindi answer
    • Example: Long article → Short summary
    • Both input AND output are sequences

The third type is where encoder-decoder models come in.


What Actually Happens

An encoder-decoder model first understands the input sequence, then generates an output sequence based on that understanding.

The flow:

Input → Encoder → Context Vector → Decoder → Output

Let's break this down.


What Is an Encoder?

An encoder converts an input sequence into a meaningful representation (context vector).

Think of it like this:

  • You give it a sentence: "I love AI"
  • It processes word by word: X₁ (I) → X₂ (love) → X₃ (AI)
  • At the end, it produces a context vector — a fixed-size numerical representation that captures the "meaning" of the entire input

Encoder Characteristics:

  • Can use RNN, LSTM, or GRU architecture
  • Reads input from left → right
  • No output generation — its only job is to understand
  • Creates a context vector at the end

For example, with LSTM:

X₁ (I) → T₁ → h₁
X₂ (love) → T₂ → h₂
X₃ (AI) → T₃ → h₃ (context vector)

That final hidden state h₃ becomes the context vector passed to the decoder.


What Is a Decoder?

A decoder generates the output sequence one element at a time, using the encoder's context vector.

For translation:

  • Context vector: [representation of "I love AI"]
  • Decoder generates: मैं → एआई → प्यार → करता → हूं (Hindi translation)

Decoder Characteristics:

  1. Probability-based model — uses softmax at each step
  2. Depends on previous output — each word depends on what came before
  3. Behaves differently in training vs prediction

This last point is crucial.


Training vs Prediction: The Teacher Forcing Problem

Decoders work very differently during training and prediction.

Training Phase (Teacher Forcing):

  • Decoder already knows the correct output
  • At every step, it's given the correct previous word
  • Even if it predicts wrong, the next step uses the correct word

Example:

  • Target: "once upon a time"
  • Step 1: <START> + context → predict "once" (even if wrong, next step gets "once")
  • Step 2: "once" + context → predict "upon"
  • Step 3: "upon" + context → predict "a"
  • Step 4: "a" + context → predict "time"

This is called teacher forcing — the model is "helped" at every step.

Prediction Phase:

  • Decoder does NOT know the output
  • It uses its own previous predictions
  • If it makes a mistake early, errors can compound

Example:

  • Step 1: <START> + context → predict "once"
  • Step 2: "once" + context → predict "upon" (if Step 1 was wrong, this suffers)
  • And so on...

The Architecture in Detail

Here's what happens internally:

Encoder side:

Embeddings → LSTM → LSTM → LSTM → Context Vector (h₃, c₃)
  X₁          T₁      T₂      T₃

Each word gets embedded, then processed through LSTM cells. The final hidden state becomes the context vector.

Decoder side:

<START> → LSTM → LSTM → LSTM → LSTM → <END>
           ↓       ↓       ↓       ↓
        softmax softmax softmax softmax
           ↓       ↓       ↓       ↓
         word1   word2   word3   word4

Context vector is fed into every decoder timestep. At each step, softmax produces a probability distribution over the vocabulary.


Key Problems with Encoder-Decoder Models

While this architecture works, it has limitations:

1. Information Bottleneck

  • The entire input sequence gets compressed into a fixed-size context vector
  • For long sentences, this vector can't capture everything
  • Information gets lost

2. No Word-to-Word Alignment

  • The decoder doesn't know which input words correspond to which output words
  • It just has one context vector for the entire input

3. Teacher Forcing Creates a Gap

  • During training, the model is always given correct previous words
  • During prediction, it has to use its own (possibly wrong) predictions
  • This mismatch can hurt performance

4. Long Sentence Failure

  • As sentences get longer, the fixed context vector becomes a bigger bottleneck
  • Early words in the input are "forgotten" by the time the decoder runs

(Note: These problems led to the invention of attention mechanisms, which we'll cover in another post.)


What I Learned

Encoder-decoder models are elegant in their simplicity:

  • Encoders understand (compress input into meaning)
  • Decoders generate (expand meaning into output)
  • Context vector bridges the two

But the architecture has real limitations — especially the information bottleneck and lack of alignment. Understanding these problems is key to appreciating why attention mechanisms became so important.

The beauty is in the design: two separate networks, one for understanding and one for generation, connected by a single vector. Simple, powerful, and the foundation for modern sequence-to-sequence models.

Related Reading

Subscribe to my newsletter

No spam, promise. I only send curated blogs that match your interests — the stuff you'd actually want to read.

Interests (optional)

Unsubscribe anytime. Your email is safe with me.