Advanced

Encoder-Decoder & Seq2Seq

Sequence-to-sequence (seq2seq) tasks require mapping a variable-length input to a variable-length output — machine translation, summarisation, speech recognition, and code generation from natural language specifications all fit this pattern. Feedforward networks cannot handle variable-length inputs and outputs without rigid padding assumptions. The encoder-decoder architecture, introduced in 2014, solved this elegantly and became the direct precursor to the transformer.

Seq2Seq: The Original Architecture

Sutskever, Vinyals, and Le (2014) introduced seq2seq with a simple decomposition: use one RNN (the encoder) to read the entire input sequence and compress it into a single fixed-size vector, then use a second RNN (the decoder) to generate the output sequence from that vector, one token at a time.

Input

x₁

x₂

x₃

xₙ

Encoder (RNN)

h₁

h₂

h₃

hₙ → c

Decoder (RNN)

s₁

s₂

s₃

⟨EOS⟩

Classic seq2seq: encoder compresses input to context vector c; decoder generates output from c

The encoder processes the source tokens one by one, updating its hidden state at each step. The final hidden state h_n — the context vector c — summarises the entire input. The decoder is initialised with c as its starting hidden state and generates the output sequence autoregressively: at each step it reads the previously generated token, updates its hidden state, and samples the next output token.

This architecture was trained end-to-end with a single objective — maximum likelihood of the correct output sequence given the input — and achieved state-of-the-art results on English-to-French translation at the time. The elegance of the approach: any RNN could serve as encoder or decoder, and the same training procedure applied regardless of the specific languages or task.

The Fixed-Size Bottleneck

The context vector c must carry all information about the entire source sequence — its vocabulary, syntax, semantics, and pragmatics — compressed into a single vector of fixed dimensionality (typically 512 or 1024 floats). This bottleneck is the architecture's fundamental limitation.

What Breaks at Scale

Long sentences: encoder must overwrite early information to process later tokens
BLEU score degrades sharply for source sentences longer than 30 words
Document-level translation is essentially impossible with a single vector
Decoder has no way to know which part of the source to focus on when generating each word

What Still Works Well

Short sentence translation (under 20 tokens)
Fixed-length sequence classification
Tasks where global summary is sufficient and local alignment is not needed
Transfer learning: encoder output as a feature extractor for downstream classifiers

Attention in Seq2Seq

Bahdanau, Cho, and Bengio (2015) identified the bottleneck and introduced the attention mechanism as the solution. Rather than forcing the decoder to use a single context vector, their architecture gives the decoder access to all encoder hidden states at every generation step — it computes a different context vector for each decoder step by attending selectively to the encoder outputs.

At each decoder step t: eᵢₜ = score(sₜ₋₁, hᵢ) — alignment score: decoder state vs encoder state i αᵢₜ = softmax(eᵢₜ) over all i — attention weights (sum to 1) cₜ = Σᵢ αᵢₜ · hᵢ — context vector: weighted sum of encoder states score functions: dot product, bilinear (Wₐ), additive (vᵀ tanh(W[s,h]))

The attention weight α_it represents how much the decoder should look at encoder position i when generating decoder position t. For English-to-French translation, when generating a French adjective, the model learns to attend to the corresponding English adjective and its noun — aligning the structures of both languages. These alignment matrices, when visualised, show near-diagonal patterns for similar word-order languages and cross-diagonal patterns for language pairs with different word order.

Critically, this attention mechanism is the direct conceptual predecessor to the transformer's self-attention. The "Attention Is All You Need" paper (Vaswani et al., 2017) removed the recurrent connections entirely, computing the alignment scores in parallel across all positions using matrix operations rather than step-by-step RNN updates.

Encoder-Decoder in Transformers

The original transformer (Vaswani et al., 2017) was an encoder-decoder model designed for machine translation. It retains the encoder/decoder split but replaces all recurrence with attention and feedforward layers.

Component	Attention Type	Behaviour
Encoder	Bidirectional self-attention	Every source token attends to every other source token; full context at each position
Decoder (self)	Causal self-attention	Each target position attends only to previous target positions (masked to prevent cheating)
Decoder (cross)	Cross-attention	Target positions attend to all encoder output positions; Q from decoder, K/V from encoder

The encoder stack processes the full source sequence in parallel, producing a rich contextualised representation at each position. The decoder generates the target sequence autoregressively: at each step it uses causal self-attention over its own previously generated tokens, then uses cross-attention to query the encoder's representations, then applies an FFN. This is the direct generalisation of Bahdanau attention, now fully parallelised during training.

Cross-Attention in Depth

Cross-attention is the mechanism by which the decoder reads the encoder's output. It is structurally identical to self-attention except that queries and key-value pairs come from different sequences.

Q = decoder_state · Wq — what the decoder position is looking for K = encoder_output · Wk — what each encoder position "advertises" V = encoder_output · Wv — what each encoder position contributes Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V Shape: Q is (tgt_len × d), K and V are (src_len × d) Output is (tgt_len × d) — each decoder position gets a context vector

Because K and V are computed from the encoder output once (not re-computed at each decoder step during inference), cross-attention is computationally efficient: the encoder runs once, and its K and V projections are cached for the entire generation. Each decoder step only requires computing Q for the new token and attending to the cached encoder K/V.

Three Paradigms: Encoder-Only, Decoder-Only, Encoder-Decoder

Encoder-Only (BERT)

Bidirectional attention over the full input. Produces rich contextualised embeddings at every position. No autoregressive generation — requires a task head for classification, span extraction, or NLU. Pre-trained with masked language modelling.

Examples: BERT, RoBERTa, DeBERTa, E5, BGE

Decoder-Only (GPT)

Causal (unidirectional) attention. Generates text autoregressively token-by-token. Excels at open-ended generation, instruction following, and in-context learning. No explicit encoder — the "input" is simply prepended to the output as a prompt.

Examples: GPT-4, Claude, Llama 3, Gemini, Mistral

Encoder-Decoder (T5)

Bidirectional encoder + causal decoder with cross-attention. Natural for conditional generation: translation, summarisation, code generation from specs. Encoder processes input fully; decoder generates output conditioned on encoder state.

Examples: T5, BART, MarianMT, Whisper, mT5

T5 and BART

T5 (Text-to-Text Transfer Transformer, Raffel et al., 2020) took the encoder-decoder architecture to its logical extreme: every NLP task is reformulated as text-in, text-out. Summarisation, translation, classification, question answering — all framed identically. The model is pre-trained on a span corruption objective (masking spans of text and asking the model to fill them in) on the C4 dataset (750 GB of cleaned Common Crawl text). The 11B parameter variant achieved state-of-the-art across multiple benchmarks. T5 demonstrated that framing breadth — treating every task as the same sequence-to-sequence problem — is as important as architectural novelty.

BART (Lewis et al., 2020) uses a different pretraining strategy specifically designed for generation tasks: it corrupts documents with a variety of noise functions (token masking, sentence permutation, document rotation, text infilling) and trains the encoder-decoder to reconstruct the original. This pretraining is particularly effective for summarisation and text generation tasks where the output must maintain coherence over longer spans. BART fine-tuned on CNN/DailyMail achieved the best single-model summarisation ROUGE scores at the time of publication.

Checklist: Do You Understand This?

Can you draw the original seq2seq architecture (Sutskever et al., 2014) and explain what the context vector c contains and why it is a bottleneck?
Can you explain how Bahdanau attention solves the bottleneck — specifically, what the attention weights α_it represent and how the context vector c_t differs from the original seq2seq?
Can you describe the three attention operations in a transformer encoder-decoder block (encoder self-attention, decoder self-attention, cross-attention) and state where Q, K, and V come from in each?
Can you explain why cross-attention K and V can be cached during inference while Q cannot?
Can you state which architecture (encoder-only, decoder-only, or encoder-decoder) is most natural for each of: sentence classification, open-ended text generation, and machine translation?
Can you describe what T5's "text-to-text" framing means in practice — give two examples of how different tasks are reformulated?