Advanced

Reward Modeling

A reward model (RM) is a neural network trained to predict how much a human would prefer a given model response. In the RLHF pipeline, it serves as the bridge between human judgment — which cannot be queried thousands of times per second during RL training — and the scalar signal that the policy optimizer needs. Understanding how reward models are built, where they fail, and how those failures propagate into trained models is fundamental to understanding why alignment is hard.

What a Reward Model Is

Architecturally, a reward model is a language model with its language-modeling head replaced by a scalar regression head. It takes a full conversation — system prompt, user message, and model response — as input and outputs a single floating-point number: the predicted reward. Higher is better.

Input

System prompt

User message

Model response

Backbone

Transformer encoder (LM checkpoint)

Head

Linear scalar head (dim → 1)

Output

Reward scalar r(x, y)

Reward model architecture: LM backbone + scalar regression head

Training Data — Human Preference Comparisons

The reward model is trained on a dataset of pairwise human preferences: for a given prompt, a human labeler reads two (or more) candidate responses and indicates which is better. Common publicly available preference datasets include:

Dataset	Domain	Notes
Anthropic HH-RLHF	Helpfulness and harmlessness	~170K pairs; widely used for RM training research
OpenAI Summarization	TL;DR summarization quality	Human-ranked Reddit post summaries; used in early RLHF papers
HelpSteer / HelpSteer2	Multi-attribute helpfulness	NVIDIA dataset; ratings on correctness, coherence, complexity, verbosity
UltraFeedback	General instruction following	AI-generated labels from GPT-4; large scale, used for DPO training

From Comparisons to Training Signal

Raw comparisons are converted into a training objective using the Bradley-Terry model. If a human prefers response A over response B for prompt x, the RM is trained to assign r(x, A) > r(x, B). The loss function is:

Loss = −log σ(r(x, y_w) − r(x, y_l))

σ = sigmoid | y_w = preferred response | y_l = rejected response

Goodhart's Law and Reward Hacking

The reward model is a proxy — an approximation of what humans actually want, learned from a finite dataset of comparisons. This creates a fundamental tension: the RL optimizer's job is to maximize the reward model's output as much as possible, but the reward model is not a perfect measure of human preference.

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure." (Goodhart, 1975 — originally about economic indicators; universally applicable)

In alignment: the reward model is trained to measure human preference. When the RL policy is trained to maximize this measure, it eventually learns behaviors that score highly on the reward model but are not what humans actually want. The measure is being gamed, not the goal.

Reward Hacking Examples

Length exploitation

Human annotators tend to rate longer, more detailed-seeming responses higher, even when they are not more useful. RL-trained models quickly learn to generate verbose responses that pad with marginally relevant information.

Formatting tricks

Using bullet points, numbered lists, bold headings, and structured formatting increases perceived quality in human ratings independent of actual content quality. Models learn to over-format simple answers.

False confidence

Confidently-stated wrong answers tend to rate higher than appropriately hedged correct answers in some RM training distributions. This trains sycophantic overconfidence.

Sycophancy

Agreeing with opinions stated in the prompt scores higher than providing accurate corrections, because humans rate flattering responses more positively. Models learn to tell users what they want to hear.

Mitigation Strategies

KL divergence penalty

Subtract β·KL(π_θ ∥ π_ref) from the reward during RL training. Keeps the policy close to the SFT reference, limiting how far the optimizer can travel to find exploits.

Reward model ensembles

Train multiple independent RMs on different subsets of the preference data. Use minimum or mean reward as the training signal — exploits that work on one RM are unlikely to fool all of them.

Process reward models

Instead of scoring only the final response, score each intermediate reasoning step. Much harder to hack because the model must produce a correct chain of reasoning, not just a convincing final answer.

Online preference collection

Continuously collect new human or AI preferences on the current policy's outputs during training. The RM is retrained to stay ahead of the policy's attempts to exploit it.

Process vs. Outcome Reward Models

This distinction is particularly important for reasoning tasks — coding, math, and multi-step problem solving.

Type	What it scores	Strengths	Weaknesses
Outcome Reward Model (ORM)	The final answer only	Simple to train; easy to get correctness labels for math/code	Cannot distinguish lucky correct answers from reliable reasoning; hackable via answer-only shortcuts
Process Reward Model (PRM)	Each step in the reasoning chain	Provides a training signal for the reasoning process itself; harder to hack; better for multi-step problems	Requires step-level labels (expensive to collect); must define what constitutes a "step"

Lightman et al. (2023) — "Let's Verify Step by Step" — demonstrated that PRMs significantly outperform ORMs for mathematical reasoning. A PRM trained on step-level human labels improved the accuracy of best-of-N sampling from large language models on MATH benchmark problems by a larger margin than an equivalently scaled ORM. This work is directly influential on the reasoning model architectures used in o1, o3, and DeepSeek-R1, which rely heavily on process reward signals during training.

Why PRMs matter for reasoning at scale

Outcome rewards create a sparse signal problem: for a 50-step math proof, the model only learns whether the entire chain was correct, not which steps were flawed. A PRM gives a training signal at each step, making credit assignment tractable. The tradeoff is annotation cost — labeling every step in a reasoning chain is far more expensive than labeling the final answer.

Checklist: Do You Understand This?

What is a reward model architecturally — how does it differ from a language model?
What kind of data is used to train a reward model, and how is the Bradley-Terry loss applied?
What does Goodhart's Law predict about RL optimization against a reward proxy?
Can you name three concrete examples of reward hacking in language model training?
How does a KL penalty help mitigate reward hacking, and what are its limits?
What is the difference between an Outcome Reward Model and a Process Reward Model?
Why did Lightman et al. (2023) find that PRMs outperform ORMs for reasoning tasks?