Advanced

RLHF — Mechanics & Pipeline

Reinforcement Learning from Human Feedback (RLHF) is the technique that turned raw pre-trained language models into assistant-like systems that follow instructions, decline harmful requests, and give coherent, helpful answers. It is not a single algorithm — it is a three-stage pipeline in which a base model is first taught to follow instructions via supervised learning, then a reward model is trained on human comparisons, and finally the language model is fine-tuned with reinforcement learning to maximize that learned reward. InstructGPT (Ouyang et al., 2022) established this pipeline; it is the direct ancestor of ChatGPT, Claude, and virtually every aligned foundation model since.

Three-Stage Pipeline Overview

Stage 1

Supervised Fine-Tuning (SFT)

→

Stage 2

Reward Model Training

→

Stage 3

RL with PPO

The RLHF pipeline: each stage consumes the output of the previous one

Each stage has its own dataset, objective, and failure mode. Weakness at any stage propagates forward — a poorly curated SFT dataset undermines the reward model; a noisy reward model sabotages the RL phase.

Stage 1 — Supervised Fine-Tuning (SFT)

The base model — a pre-trained LLM that has learned language structure but not instruction-following — is fine-tuned on a curated dataset of (prompt, ideal response) pairs. Human contractors write or select responses demonstrating the desired behavior: following instructions, being factually accurate, declining inappropriate requests politely, and formatting answers clearly.

What SFT achieves

Model learns the format and tone of assistant responses
Establishes a prior for instruction-following behavior
Provides the starting checkpoint for RL (the reference policy)
Reduces the exploration space the RL phase must cover

SFT limitations

Expensive to collect high-quality demonstrations at scale
Demonstrates what good looks like — does not encode relative quality judgments
Behavior is constrained to patterns seen in demonstrations
Cannot handle cases where humans disagree about what a good response is

The SFT model is also saved as the reference policy — a frozen copy used later in Stage 3 as a KL-divergence anchor to prevent the RL phase from drifting too far from safe, coherent text.

Stage 2 — Reward Model Training

Human preference comparisons — not demonstrations — drive Stage 2. For a given prompt, the SFT model generates several candidate responses. Human labelers read them and rank them (or select which of a pair they prefer). These comparisons are converted into a scalar reward signal by training a reward model.

Reward Model Architecture

The reward model starts from the SFT model checkpoint. The language-modeling head (vocabulary projection) is replaced with a linear scalar head that outputs a single number — the predicted reward for that response. The model reads the full context (system prompt + user message + response) and outputs one scalar value.

Bradley-Terry Model for Rankings

Human comparisons are not directly a regression target. The Bradley-Terry model is applied: if response A is preferred over response B, the probability of that preference is modeled as a sigmoid over the difference in predicted rewards. The reward model is trained with a binary cross-entropy loss on these pairwise comparisons:

Loss = −log σ(r(x, y_w) − r(x, y_l))

r = reward model output | y_w = preferred response | y_l = rejected response | x = prompt

This loss pushes the reward model to assign a higher scalar to preferred responses than to rejected ones — without needing absolute scores, only relative rankings.

Human Annotation in Practice

Dimension	What annotators evaluate	Key challenge
Helpfulness	Does the response address what the user actually wanted?	Subjective; depends on annotator background and context
Harmlessness	Does the response avoid dangerous, offensive, or illegal content?	Cultural variation; edge cases require difficult judgment calls
Honesty	Is the response factually accurate and appropriately uncertain?	Annotators may lack domain expertise to verify claims

Inter-annotator agreement is a persistent challenge — even with detailed rubrics, different labelers rate the same pair differently 20–30% of the time. Cost per comparison ranges from roughly $1 to $10 USD depending on task complexity and contractor location. Large-scale RLHF runs require millions of comparisons, making human annotation a significant operational cost.

Stage 3 — RL with PPO

The SFT model (now called the policy) is fine-tuned using Proximal Policy Optimization (PPO), a stable on-policy RL algorithm. At each training step:

Sample a prompt

Draw a prompt from the RL training dataset

Generate a response

The current policy LM generates a response token-by-token

Score with the reward model

The frozen reward model assigns a scalar reward to the (prompt, response) pair

Compute KL penalty

Compute KL divergence between current policy and frozen reference policy; subtract from reward

PPO update

Update the policy parameters to maximize the KL-penalized reward using PPO's clipped objective

The KL Divergence Penalty — Why It Exists

Without a KL penalty, RL optimization rapidly finds degenerate solutions: the model discovers that certain phrase patterns, repetitive structures, or even incoherent token sequences score highly on the learned reward model even though they are not genuinely helpful. This is called reward hacking.

Why reward hacking happens

The reward model is trained on a finite sample of human comparisons and is only a proxy for true human preference. Maximizing this proxy with unlimited RL steps is equivalent to overfitting to the reward model's quirks rather than learning genuinely better behavior.

The KL penalty — β · KL(π_θ ∥ π_ref) — is subtracted from the reward at every step. It measures how much the current policy distribution diverges from the frozen reference policy (the SFT model). A higher β tightens the leash, keeping generated text close to the SFT distribution; a lower β allows more exploration but increases reward-hacking risk. This term is the primary mechanism preventing catastrophic drift.

InstructGPT — The Foundational Result

Ouyang et al. (2022) applied this three-stage pipeline to GPT-3 (175B parameters). The resulting InstructGPT model at 1.3B parameters was preferred by human evaluators over the raw 175B GPT-3 on the vast majority of prompts. A model 100× smaller but trained with human feedback outperformed a much larger model trained purely on next-token prediction. This established RLHF as the standard post-training alignment technique and is the direct ancestor of ChatGPT.

Limitations of RLHF

Operational costs

Human annotation is slow, expensive, and hard to scale
PPO training is computationally unstable and requires careful hyperparameter tuning
Requires running four models simultaneously: policy, reference policy, reward model, and value function network

Alignment quality

Reward hacking: RL finds proxy-reward exploits not fully blocked by KL penalty
Human preferences are inconsistent, culturally variable, and can encode annotator biases
Sycophancy: models learn to tell evaluators what they want to hear
Annotation rubrics shape model behavior in ways that may not generalize

These limitations motivated DPO — which eliminates the RL training loop entirely — and Constitutional AI — which replaces human harmlessness labels with AI-generated feedback at scale. Both are covered in adjacent pages.

Checklist: Do You Understand This?

Can you name the three stages of RLHF and what each one produces?
What is the role of the reference policy (frozen SFT model) in Stage 3?
What is the Bradley-Terry model and why is it used to train the reward model?
Why does the KL penalty exist — what concrete failure mode does it prevent?
What did InstructGPT demonstrate about model size vs. alignment quality?
What are three concrete failure modes or cost drivers of the RLHF pipeline?