Encoder-Decoder & Seq2Seq
Sequence-to-sequence (seq2seq) tasks require mapping a variable-length input to a variable-length output — machine translation, summarisation, speech recognition, and code generation from natural language specifications all fit this pattern. Feedforward networks cannot handle variable-length inputs and outputs without rigid padding assumptions. The encoder-decoder architecture, introduced in 2014, solved this elegantly and became the direct precursor to the transformer.
Seq2Seq: The Original Architecture
Sutskever, Vinyals, and Le (2014) introduced seq2seq with a simple decomposition: use one RNN (the encoder) to read the entire input sequence and compress it into a single fixed-size vector, then use a second RNN (the decoder) to generate the output sequence from that vector, one token at a time.
Classic seq2seq: encoder compresses input to context vector c; decoder generates output from c
The encoder processes the source tokens one by one, updating its hidden state at each step. The final hidden state hn — the context vector c — summarises the entire input. The decoder is initialised with c as its starting hidden state and generates the output sequence autoregressively: at each step it reads the previously generated token, updates its hidden state, and samples the next output token.
This architecture was trained end-to-end with a single objective — maximum likelihood of the correct output sequence given the input — and achieved state-of-the-art results on English-to-French translation at the time. The elegance of the approach: any RNN could serve as encoder or decoder, and the same training procedure applied regardless of the specific languages or task.
The Fixed-Size Bottleneck
The context vector c must carry all information about the entire source sequence — its vocabulary, syntax, semantics, and pragmatics — compressed into a single vector of fixed dimensionality (typically 512 or 1024 floats). This bottleneck is the architecture's fundamental limitation.
- Long sentences: encoder must overwrite early information to process later tokens
- BLEU score degrades sharply for source sentences longer than 30 words
- Document-level translation is essentially impossible with a single vector
- Decoder has no way to know which part of the source to focus on when generating each word
- Short sentence translation (under 20 tokens)
- Fixed-length sequence classification
- Tasks where global summary is sufficient and local alignment is not needed
- Transfer learning: encoder output as a feature extractor for downstream classifiers
Attention in Seq2Seq
Bahdanau, Cho, and Bengio (2015) identified the bottleneck and introduced the attention mechanism as the solution. Rather than forcing the decoder to use a single context vector, their architecture gives the decoder access to all encoder hidden states at every generation step — it computes a different context vector for each decoder step by attending selectively to the encoder outputs.
The attention weight αit represents how much the decoder should look at encoder position i when generating decoder position t. For English-to-French translation, when generating a French adjective, the model learns to attend to the corresponding English adjective and its noun — aligning the structures of both languages. These alignment matrices, when visualised, show near-diagonal patterns for similar word-order languages and cross-diagonal patterns for language pairs with different word order.
Critically, this attention mechanism is the direct conceptual predecessor to the transformer's self-attention. The "Attention Is All You Need" paper (Vaswani et al., 2017) removed the recurrent connections entirely, computing the alignment scores in parallel across all positions using matrix operations rather than step-by-step RNN updates.
Encoder-Decoder in Transformers
The original transformer (Vaswani et al., 2017) was an encoder-decoder model designed for machine translation. It retains the encoder/decoder split but replaces all recurrence with attention and feedforward layers.
| Component | Attention Type | Behaviour |
|---|---|---|
| Encoder | Bidirectional self-attention | Every source token attends to every other source token; full context at each position |
| Decoder (self) | Causal self-attention | Each target position attends only to previous target positions (masked to prevent cheating) |
| Decoder (cross) | Cross-attention | Target positions attend to all encoder output positions; Q from decoder, K/V from encoder |
The encoder stack processes the full source sequence in parallel, producing a rich contextualised representation at each position. The decoder generates the target sequence autoregressively: at each step it uses causal self-attention over its own previously generated tokens, then uses cross-attention to query the encoder's representations, then applies an FFN. This is the direct generalisation of Bahdanau attention, now fully parallelised during training.
Cross-Attention in Depth
Cross-attention is the mechanism by which the decoder reads the encoder's output. It is structurally identical to self-attention except that queries and key-value pairs come from different sequences.
Because K and V are computed from the encoder output once (not re-computed at each decoder step during inference), cross-attention is computationally efficient: the encoder runs once, and its K and V projections are cached for the entire generation. Each decoder step only requires computing Q for the new token and attending to the cached encoder K/V.
Three Paradigms: Encoder-Only, Decoder-Only, Encoder-Decoder
Bidirectional attention over the full input. Produces rich contextualised embeddings at every position. No autoregressive generation — requires a task head for classification, span extraction, or NLU. Pre-trained with masked language modelling.
Examples: BERT, RoBERTa, DeBERTa, E5, BGE
Causal (unidirectional) attention. Generates text autoregressively token-by-token. Excels at open-ended generation, instruction following, and in-context learning. No explicit encoder — the "input" is simply prepended to the output as a prompt.
Examples: GPT-4, Claude, Llama 3, Gemini, Mistral
Bidirectional encoder + causal decoder with cross-attention. Natural for conditional generation: translation, summarisation, code generation from specs. Encoder processes input fully; decoder generates output conditioned on encoder state.
Examples: T5, BART, MarianMT, Whisper, mT5
T5 and BART
T5 (Text-to-Text Transfer Transformer, Raffel et al., 2020) took the encoder-decoder architecture to its logical extreme: every NLP task is reformulated as text-in, text-out. Summarisation, translation, classification, question answering — all framed identically. The model is pre-trained on a span corruption objective (masking spans of text and asking the model to fill them in) on the C4 dataset (750 GB of cleaned Common Crawl text). The 11B parameter variant achieved state-of-the-art across multiple benchmarks. T5 demonstrated that framing breadth — treating every task as the same sequence-to-sequence problem — is as important as architectural novelty.
BART (Lewis et al., 2020) uses a different pretraining strategy specifically designed for generation tasks: it corrupts documents with a variety of noise functions (token masking, sentence permutation, document rotation, text infilling) and trains the encoder-decoder to reconstruct the original. This pretraining is particularly effective for summarisation and text generation tasks where the output must maintain coherence over longer spans. BART fine-tuned on CNN/DailyMail achieved the best single-model summarisation ROUGE scores at the time of publication.
Checklist: Do You Understand This?
- Can you draw the original seq2seq architecture (Sutskever et al., 2014) and explain what the context vector c contains and why it is a bottleneck?
- Can you explain how Bahdanau attention solves the bottleneck — specifically, what the attention weights αit represent and how the context vector ct differs from the original seq2seq?
- Can you describe the three attention operations in a transformer encoder-decoder block (encoder self-attention, decoder self-attention, cross-attention) and state where Q, K, and V come from in each?
- Can you explain why cross-attention K and V can be cached during inference while Q cannot?
- Can you state which architecture (encoder-only, decoder-only, or encoder-decoder) is most natural for each of: sentence classification, open-ended text generation, and machine translation?
- Can you describe what T5's "text-to-text" framing means in practice — give two examples of how different tasks are reformulated?