Advanced

Transformer Variants — BERT, GPT, T5

The transformer block (attention + FFN + residual + layernorm) is a general-purpose module. How you stack it, what attention mask you use, and how you pre-train it determines what the resulting model is good at. Three paradigms emerged from 2018 to 2020 and between them cover virtually all uses of transformers in NLP: encoder-only models for understanding tasks, decoder-only models for generation, and encoder-decoder models for conditional generation. Understanding which paradigm to reach for — and why — is foundational knowledge for anyone building with language models.

The Three Paradigms

Encoder-Only (BERT)

Bidirectional — sees full context; best for classification

Decoder-Only (GPT)

Causal — autoregressive; best for generation

BERT / RoBERTa

DeBERTa

T5 / BART

GPT-2 / GPT-3

Claude / Llama 3

Paradigm	Attention	Pre-training Objective	Best-Fit Tasks	Examples
Encoder-only	Bidirectional (full)	Masked LM + NSP	Classification, NER, QA, retrieval embeddings	BERT, RoBERTa, DeBERTa
Decoder-only	Causal (masked)	Next-token prediction (CLM)	Text generation, chat, instruction following, coding	GPT-2/3/4, Llama 3, Claude, Gemini
Encoder-decoder	Bidir encoder + causal decoder	Span corruption (T5) / document corruption (BART)	Translation, summarisation, structured generation	T5, BART, Whisper, MarianMT

BERT and the Encoder-Only Family

BERT (Bidirectional Encoder Representations from Transformers, Devlin et al., 2018) demonstrated that pre-training a transformer encoder on unlabelled text followed by fine-tuning on labelled downstream tasks outperformed task-specific architectures across 11 NLU benchmarks simultaneously. This "pre-train then fine-tune" paradigm transformed the field.

BERT uses full bidirectional self-attention — every token attends to every other token in the sequence. This means BERT produces rich contextualised representations at every position, incorporating both left and right context. The pre-training uses two objectives:

Masked Language Modelling (MLM)

15% of input tokens are randomly replaced: 80% with [MASK], 10% with a random word, 10% left unchanged. The model predicts the original tokens at masked positions. This forces the model to build bidirectional context representations — it must use both left and right context to reconstruct the masked token.

Next Sentence Prediction (NSP)

Given two sentence segments, predict whether the second follows the first in the original document. Added to help with tasks requiring sentence-pair reasoning (NLI, QA). Later research (RoBERTa) showed NSP was not helpful and may hurt performance — it was dropped in subsequent models.

BERT-base has 12 layers, 768 d_model, 12 heads, and 110M parameters. BERT-large uses 24 layers, 1024 d_model, 16 heads, and 340M parameters. For downstream tasks, a classification head is added on top of the [CLS] token embedding and fine-tuned with task-specific labelled data.

BERT Variants

RoBERTa (2019)

Robustly Optimised BERT. Same architecture, better training: larger batches, longer training, more data (160GB vs 16GB), dynamic masking (new mask per epoch), and no NSP. Showed that the original BERT was significantly undertrained. RoBERTa dominated GLUE and SQuAD leaderboards for 2+ years.

DeBERTa (2020/2021)

Disentangled attention: instead of a single vector encoding content + position, uses two separate vectors for content and position. Each attention score is computed using three terms: content-to-content, content-to-position, position-to-content. Achieved SOTA on GLUE and SuperGLUE, briefly surpassing human performance on SuperGLUE.

ALBERT (2019)

Parameter reduction via cross-layer parameter sharing (same weights across all layers) and factorised embedding parameterisation (vocabulary embedding and hidden size decoupled). 18× fewer parameters than BERT-large but comparable performance. Suitable for memory-constrained deployments.

GPT and the Decoder-Only Family

GPT (Generative Pre-trained Transformer, Radford et al., OpenAI, 2018) established the decoder-only paradigm: train a transformer with causal (left-to-right) attention on next-token prediction, then fine-tune on downstream tasks. GPT showed that unsupervised pre-training on large text corpora could produce a generalist language representation that transfers to many NLP tasks.

The causal attention mask means each token can only attend to previous tokens. This is essential for autoregressive generation: the model predicts token t using only tokens 1 through t-1, so the same model can be used for both training (compute loss at every position in parallel) and inference (generate one token at a time autoregressively).

Model	Year	Parameters	Key Development
GPT-1	2018	117M	Established pre-train + fine-tune paradigm for generative models
GPT-2	2019	1.5B	Zero-shot task performance; released in stages due to misuse concern
GPT-3	2020	175B	In-context learning: few-shot examples in prompt without gradient updates
InstructGPT / GPT-3.5	2022	175B (RLHF)	RLHF alignment; instruction following; basis for ChatGPT
GPT-4	2023	Undisclosed	Multimodal input; expert-level benchmark performance across domains

The decoder-only paradigm has come to dominate for general-purpose LLMs — GPT, Claude, Llama, Mistral, Gemini, DeepSeek, Qwen are all decoder-only. The simplicity of the training objective (predict the next token) combined with scale produces models that generalise far beyond what their training distribution might suggest.

T5 and the Encoder-Decoder Family

T5 (Text-to-Text Transfer Transformer, Raffel et al., 2020) restated every NLP task as a text-to-text problem. Classification becomes outputting the class label as text; summarisation becomes outputting the summary; translation outputs the target sentence. A single model architecture and training procedure handles all of them.

T5 pre-trains with a span corruption objective: replace contiguous spans of tokens with sentinel tokens (⟨extra_id_0⟩, ⟨extra_id_1⟩, ...) and train the model to reconstruct the masked spans. This is more efficient than BERT's token-level MLM because the decoder only needs to predict the masked spans rather than the full sequence.

T5 span corruption example: Input: "Thank you ⟨X⟩ me to your party ⟨Y⟩ week." Target: "⟨X⟩ for inviting ⟨Y⟩ last ⟨Z⟩" T5 text-to-text framing examples: translate English to German: "That is good." → "Das ist gut." summarize: "Scientists discover ..." mnli hypothesis: "A dog bites a person." premise: "A person bites a dog." → "contradiction"

BART (Lewis et al., 2020) uses a similar encoder-decoder structure but pre-trains with more aggressive document corruption: sentence permutation, token masking, token deletion, text infilling, and document rotation. This diverse noise makes BART particularly strong for generation tasks where the output must maintain coherence across many sentences. BART fine-tuned on CNN/DailyMail set summarisation ROUGE records at the time.

Encoder vs Decoder for Embeddings

A practically important question: which architecture produces better sentence or document embeddings for retrieval?

Encoder-Only (Better for Retrieval)

Bidirectional attention means every token sees the full sequence before the final representation is computed. The [CLS] token embedding (or mean pool of all tokens) has access to complete context. Models fine-tuned for dense retrieval (E5, BGE, GTE) are all encoder-only BERT variants. For semantic similarity and retrieval, encoder models outperform decoder models of similar size.

Decoder-Only (Usable with Caveats)

Decoder models can produce embeddings by taking the last token's hidden state or mean-pooling over all positions. Causal attention means early tokens have not "seen" later tokens, making mean-pooling suboptimal. Instruction-tuned decoder models (E5-mistral-7b, LLM2Vec) can achieve strong retrieval via clever prompting, but require more compute than a small encoder model.

PrefixLM and UL2

Several models blend the encoder and decoder paradigms through hybrid training objectives. PrefixLM treats a prefix of the input as bidirectional (encoder-style) and the suffix as causal (decoder-style), by modifying the attention mask rather than the architecture. This allows the model to be both a strong encoder (for the prefix) and a generative decoder (for the continuation).

UL2 (Tay et al., 2022, Google) proposes a unified pretraining objective combining three modes: "R-Denoiser" (standard span corruption, like T5), "S-Denoiser" (prefix LM, generate continuations from a prefix), and "X-Denoiser" (extreme corruption, reconstruct heavily masked documents). By mixing these objectives, UL2 produces a model that is simultaneously strong at generation, classification, and conditional generation without any architectural changes — just different training signals. Flan-UL2 (fine-tuned on instruction data) is available as an open-weight model and performs competitively with GPT-3.5 on many benchmarks.

Checklist: Do You Understand This?

Can you describe the attention mask pattern (bidirectional vs causal) for encoder-only and decoder-only models, and explain why the mask choice determines whether a model can generate text autoregressively?
Can you describe BERT's masked language modelling objective and explain why it forces the model to build bidirectional context representations?
Can you explain what RoBERTa changed versus BERT and which changes had the largest empirical impact?
Can you describe the T5 "text-to-text" framing and give two examples of how different NLP tasks are reformulated as text-in, text-out problems?
Can you explain why encoder-only models are generally preferred over decoder-only models for dense retrieval tasks at equivalent parameter count?
If you needed to choose between a BERT-base encoder and a small decoder-only LLM for a document classification task with 1000 labelled examples, which would you choose and why?