RLHF โ Mechanics & Pipeline
Reinforcement Learning from Human Feedback (RLHF) is the technique that turned raw pre-trained language models into assistant-like systems that follow instructions, decline harmful requests, and give coherent, helpful answers. It is not a single algorithm โ it is a three-stage pipeline in which a base model is first taught to follow instructions via supervised learning, then a reward model is trained on human comparisons, and finally the language model is fine-tuned with reinforcement learning to maximize that learned reward. InstructGPT (Ouyang et al., 2022) established this pipeline; it is the direct ancestor of ChatGPT, Claude, and virtually every aligned foundation model since.
Three-Stage Pipeline Overview
The RLHF pipeline: each stage consumes the output of the previous one
Each stage has its own dataset, objective, and failure mode. Weakness at any stage propagates forward โ a poorly curated SFT dataset undermines the reward model; a noisy reward model sabotages the RL phase.
Stage 1 โ Supervised Fine-Tuning (SFT)
The base model โ a pre-trained LLM that has learned language structure but not instruction-following โ is fine-tuned on a curated dataset of (prompt, ideal response) pairs. Human contractors write or select responses demonstrating the desired behavior: following instructions, being factually accurate, declining inappropriate requests politely, and formatting answers clearly.
What SFT achieves
- Model learns the format and tone of assistant responses
- Establishes a prior for instruction-following behavior
- Provides the starting checkpoint for RL (the reference policy)
- Reduces the exploration space the RL phase must cover
SFT limitations
- Expensive to collect high-quality demonstrations at scale
- Demonstrates what good looks like โ does not encode relative quality judgments
- Behavior is constrained to patterns seen in demonstrations
- Cannot handle cases where humans disagree about what a good response is
The SFT model is also saved as the reference policy โ a frozen copy used later in Stage 3 as a KL-divergence anchor to prevent the RL phase from drifting too far from safe, coherent text.
Stage 2 โ Reward Model Training
Human preference comparisons โ not demonstrations โ drive Stage 2. For a given prompt, the SFT model generates several candidate responses. Human labelers read them and rank them (or select which of a pair they prefer). These comparisons are converted into a scalar reward signal by training a reward model.
Reward Model Architecture
The reward model starts from the SFT model checkpoint. The language-modeling head (vocabulary projection) is replaced with a linear scalar head that outputs a single number โ the predicted reward for that response. The model reads the full context (system prompt + user message + response) and outputs one scalar value.
Bradley-Terry Model for Rankings
Human comparisons are not directly a regression target. The Bradley-Terry model is applied: if response A is preferred over response B, the probability of that preference is modeled as a sigmoid over the difference in predicted rewards. The reward model is trained with a binary cross-entropy loss on these pairwise comparisons:
Loss = โlog ฯ(r(x, y_w) โ r(x, y_l))
r = reward model output ย |ย y_w = preferred response ย |ย y_l = rejected response ย |ย x = prompt
This loss pushes the reward model to assign a higher scalar to preferred responses than to rejected ones โ without needing absolute scores, only relative rankings.
Human Annotation in Practice
| Dimension | What annotators evaluate | Key challenge |
|---|---|---|
| Helpfulness | Does the response address what the user actually wanted? | Subjective; depends on annotator background and context |
| Harmlessness | Does the response avoid dangerous, offensive, or illegal content? | Cultural variation; edge cases require difficult judgment calls |
| Honesty | Is the response factually accurate and appropriately uncertain? | Annotators may lack domain expertise to verify claims |
Inter-annotator agreement is a persistent challenge โ even with detailed rubrics, different labelers rate the same pair differently 20โ30% of the time. Cost per comparison ranges from roughly $1 to $10 USD depending on task complexity and contractor location. Large-scale RLHF runs require millions of comparisons, making human annotation a significant operational cost.
Stage 3 โ RL with PPO
The SFT model (now called the policy) is fine-tuned using Proximal Policy Optimization (PPO), a stable on-policy RL algorithm. At each training step:
Draw a prompt from the RL training dataset
The current policy LM generates a response token-by-token
The frozen reward model assigns a scalar reward to the (prompt, response) pair
Compute KL divergence between current policy and frozen reference policy; subtract from reward
Update the policy parameters to maximize the KL-penalized reward using PPO's clipped objective
The KL Divergence Penalty โ Why It Exists
Without a KL penalty, RL optimization rapidly finds degenerate solutions: the model discovers that certain phrase patterns, repetitive structures, or even incoherent token sequences score highly on the learned reward model even though they are not genuinely helpful. This is called reward hacking.
Why reward hacking happens
The reward model is trained on a finite sample of human comparisons and is only a proxy for true human preference. Maximizing this proxy with unlimited RL steps is equivalent to overfitting to the reward model's quirks rather than learning genuinely better behavior.
The KL penalty โ ฮฒ ยท KL(ฯ_ฮธ โฅ ฯ_ref) โ is subtracted from the reward at every step. It measures how much the current policy distribution diverges from the frozen reference policy (the SFT model). A higher ฮฒ tightens the leash, keeping generated text close to the SFT distribution; a lower ฮฒ allows more exploration but increases reward-hacking risk. This term is the primary mechanism preventing catastrophic drift.
InstructGPT โ The Foundational Result
Ouyang et al. (2022) applied this three-stage pipeline to GPT-3 (175B parameters). The resulting InstructGPT model at 1.3B parameters was preferred by human evaluators over the raw 175B GPT-3 on the vast majority of prompts. A model 100ร smaller but trained with human feedback outperformed a much larger model trained purely on next-token prediction. This established RLHF as the standard post-training alignment technique and is the direct ancestor of ChatGPT.
Limitations of RLHF
Operational costs
- Human annotation is slow, expensive, and hard to scale
- PPO training is computationally unstable and requires careful hyperparameter tuning
- Requires running four models simultaneously: policy, reference policy, reward model, and value function network
Alignment quality
- Reward hacking: RL finds proxy-reward exploits not fully blocked by KL penalty
- Human preferences are inconsistent, culturally variable, and can encode annotator biases
- Sycophancy: models learn to tell evaluators what they want to hear
- Annotation rubrics shape model behavior in ways that may not generalize
These limitations motivated DPO โ which eliminates the RL training loop entirely โ and Constitutional AI โ which replaces human harmlessness labels with AI-generated feedback at scale. Both are covered in adjacent pages.
Checklist: Do You Understand This?
- Can you name the three stages of RLHF and what each one produces?
- What is the role of the reference policy (frozen SFT model) in Stage 3?
- What is the Bradley-Terry model and why is it used to train the reward model?
- Why does the KL penalty exist โ what concrete failure mode does it prevent?
- What did InstructGPT demonstrate about model size vs. alignment quality?
- What are three concrete failure modes or cost drivers of the RLHF pipeline?