🧠 All Things AI
Advanced

GPT Series — Architecture Evolution

The GPT (Generative Pre-trained Transformer) lineage from OpenAI is the most influential series of language models in AI history. Each generation introduced something genuinely new — not just more parameters, but different ideas about training paradigms, alignment, and capability. Understanding this lineage reveals why modern LLMs are built the way they are.

GPT-1 (2018) — Establishing the Paradigm

GPT-1 was modest by today's standards: 117 million parameters, 12 transformer decoder layers, trained on BooksCorpus (approximately 800M words from unpublished books). But its contribution was conceptual, not numerical.

The key insight was the unsupervised pretraining + supervised fine-tuning two-phase paradigm. Train a language model on large amounts of unlabeled text to develop a general language understanding, then fine-tune that same model on small labeled datasets for specific tasks. This was not obvious at the time — the dominant NLP approach was training task-specific architectures from scratch.

Architecture
  • 117M parameters
  • 12 transformer decoder layers
  • 768 hidden dimensions
  • 12 attention heads
  • Trained on BooksCorpus (800M words)
Key Innovation
  • Decoder-only transformer for language modeling
  • Unsupervised pretraining on raw text
  • Task-agnostic representations transfer to downstream tasks
  • Fine-tuning beats task-specific architectures

GPT-2 (2019) — Scaling and Zero-Shot Transfer

GPT-2 scaled to 1.5 billion parameters across 48 layers, trained on WebText — a dataset of approximately 40GB of text scraped from web pages linked from Reddit posts with 3+ upvotes (about 8 million documents). The data quality filtering was a significant step: popular links correlate with readable, coherent text.

The main discovery was zero-shot task transfer: without any fine-tuning, GPT-2 could perform reading comprehension, translation, and question answering simply by framing the task correctly in the prompt. Language models at sufficient scale develop task-solving abilities implicitly from pretraining.

OpenAI made the unusual decision to delay releasing the full 1.5B model due to concerns about misuse for generating disinformation — the first high-profile instance of responsible release policy in AI. Smaller versions were released first; the full model followed six months later.

GPT-3 (2020) — In-Context Learning at Scale

GPT-3 was a step change: 175 billion parameters, 96 transformer layers, 96 attention heads, 12,288 hidden dimensions. It was trained on a mixture of Common Crawl (filtered), WebText2, Books1, Books2, and Wikipedia — approximately 300 billion tokens. Training used FP16 precision across thousands of A100 GPUs.

ModelParamsLayersHeadsHidden Dim
GPT-1117M1212768
GPT-21.5B48251,600
GPT-3175B969612,288

GPT-3's defining contribution was demonstrating in-context learning (ICL): give the model a few examples (few-shot prompting) inside the prompt itself and it adapts its behavior without any weight update. This was qualitatively different from what GPT-2 showed with zero-shot. The combination of scale + few-shot prompting made GPT-3 a general-purpose API, spawning the modern era of LLM product development. The OpenAI API (launched 2020) enabled developers to build on GPT-3 without ever touching model weights — the birth of the "LLM API economy."

GPT-3 also established prompt engineering as a discipline. The format of the prompt — examples, instructions, role framing — dramatically changes model behavior. This was a new kind of programming.

InstructGPT (2022) — Aligning with Human Intent via RLHF

Raw GPT-3, despite its capabilities, was not reliably helpful or safe. It would follow the statistical pattern of its training data, which meant generating content that was unhelpful, toxic, or just wrong when users asked natural-language questions. InstructGPT was OpenAI's answer: align the model to follow instructions usingReinforcement Learning from Human Feedback (RLHF).

The SFT → RM → PPO Pipeline
Step 1 — Supervised Fine-Tuning (SFT)

Human labelers write ideal responses to prompts. GPT-3 is fine-tuned on these examples. This teaches the model the format of helpful responses.

Step 2 — Reward Model (RM)

Human labelers rank multiple model outputs from best to worst. A separate model is trained to predict these rankings — the reward model learns what humans prefer.

Step 3 — PPO Reinforcement Learning

The SFT model is further optimized using PPO (Proximal Policy Optimization) to maximize the reward model's score, with a KL-divergence penalty to prevent it drifting too far from the original SFT model.

The striking result: a 1.3B InstructGPT model was preferred over 175B raw GPT-3 by human evaluators. Alignment, not just scale, determines real-world usefulness. This finding reshaped the entire field — every major frontier model since has used some form of RLHF or preference optimization (DPO, GRPO, etc.).

GPT-4 (2023) — Undisclosed Architecture, Multimodal, Frontier Reasoning

OpenAI chose not to publish GPT-4's architecture, parameter count, or training details — a deliberate decision motivated by competitive and safety concerns, and marking a shift away from the open publication norm that characterized GPT-1 through GPT-3.

What is known: GPT-4 introduced image input (vision understanding), a 128K token context window(far beyond GPT-3's 4K), substantially better reasoning on professional and academic benchmarks (passing bar exams, medical licensing tests at passing grade), and formalized the system prompt as a first-class concept for controlling model behavior.

Credible reporting suggests GPT-4 uses a Mixture of Experts architecture — multiple specialized sub-models, with a routing mechanism selecting which experts process each token. This would explain how OpenAI achieves GPT-4's capability at economically viable inference costs.

Architectural Choices That Evolved Across the Series

Positional Embeddings

GPT-1/2/3 used learned absolute positional embeddings — a lookup table of position vectors added to token embeddings. Later models (and GPT-4 very likely) moved to RoPE (Rotary Position Embeddings), which encodes position by rotating query/key vectors, extending better to long contexts.

Layer Normalization Placement

Original transformers used post-LayerNorm (normalize after attention + FFN). GPT-2 and beyond adopted pre-LayerNorm (normalize before attention + FFN, then add residual). Pre-LN is more stable at scale — gradients flow more reliably through deep networks.

Context Window Expansion

GPT-1: 512 tokens. GPT-2: 1,024 tokens. GPT-3: 2,048 tokens. GPT-3.5-turbo: 4K → 16K. GPT-4: 8K → 128K. Window expansion requires positional embedding solutions that generalize — a key reason absolute learned positions were abandoned.

Training Paradigm

GPT-1: pretraining + task-specific fine-tuning. GPT-3: pretraining + in-context learning (no fine-tuning). InstructGPT: pretraining + SFT + RLHF. GPT-4: adds multimodal pretraining, RLHF at scale, system prompt training.

Checklist: Do You Understand This?

  • What was the two-phase training paradigm GPT-1 established, and why was it novel?
  • What is in-context learning (ICL), and which GPT model demonstrated it convincingly at scale?
  • Explain the three steps of the InstructGPT RLHF pipeline: SFT, reward model, PPO.
  • Why did a 1.3B InstructGPT model outperform 175B raw GPT-3 in human evaluations?
  • What is the difference between pre-LayerNorm and post-LayerNorm, and which is preferred in large models?
  • Why did absolute positional embeddings get replaced by RoPE in later models?
  • What does GPT-4 add that GPT-3 did not have, and what is deliberately undisclosed about its architecture?