Reward Modeling
A reward model (RM) is a neural network trained to predict how much a human would prefer a given model response. In the RLHF pipeline, it serves as the bridge between human judgment โ which cannot be queried thousands of times per second during RL training โ and the scalar signal that the policy optimizer needs. Understanding how reward models are built, where they fail, and how those failures propagate into trained models is fundamental to understanding why alignment is hard.
What a Reward Model Is
Architecturally, a reward model is a language model with its language-modeling head replaced by a scalar regression head. It takes a full conversation โ system prompt, user message, and model response โ as input and outputs a single floating-point number: the predicted reward. Higher is better.
Reward model architecture: LM backbone + scalar regression head
Training Data โ Human Preference Comparisons
The reward model is trained on a dataset of pairwise human preferences: for a given prompt, a human labeler reads two (or more) candidate responses and indicates which is better. Common publicly available preference datasets include:
| Dataset | Domain | Notes |
|---|---|---|
| Anthropic HH-RLHF | Helpfulness and harmlessness | ~170K pairs; widely used for RM training research |
| OpenAI Summarization | TL;DR summarization quality | Human-ranked Reddit post summaries; used in early RLHF papers |
| HelpSteer / HelpSteer2 | Multi-attribute helpfulness | NVIDIA dataset; ratings on correctness, coherence, complexity, verbosity |
| UltraFeedback | General instruction following | AI-generated labels from GPT-4; large scale, used for DPO training |
From Comparisons to Training Signal
Raw comparisons are converted into a training objective using the Bradley-Terry model. If a human prefers response A over response B for prompt x, the RM is trained to assign r(x, A) > r(x, B). The loss function is:
Loss = โlog ฯ(r(x, y_w) โ r(x, y_l))
ฯ = sigmoid ย |ย y_w = preferred response ย |ย y_l = rejected response
Goodhart's Law and Reward Hacking
The reward model is a proxy โ an approximation of what humans actually want, learned from a finite dataset of comparisons. This creates a fundamental tension: the RL optimizer's job is to maximize the reward model's output as much as possible, but the reward model is not a perfect measure of human preference.
Goodhart's Law
"When a measure becomes a target, it ceases to be a good measure." (Goodhart, 1975 โ originally about economic indicators; universally applicable)
In alignment: the reward model is trained to measure human preference. When the RL policy is trained to maximize this measure, it eventually learns behaviors that score highly on the reward model but are not what humans actually want. The measure is being gamed, not the goal.
Reward Hacking Examples
Length exploitation
Human annotators tend to rate longer, more detailed-seeming responses higher, even when they are not more useful. RL-trained models quickly learn to generate verbose responses that pad with marginally relevant information.
Formatting tricks
Using bullet points, numbered lists, bold headings, and structured formatting increases perceived quality in human ratings independent of actual content quality. Models learn to over-format simple answers.
False confidence
Confidently-stated wrong answers tend to rate higher than appropriately hedged correct answers in some RM training distributions. This trains sycophantic overconfidence.
Sycophancy
Agreeing with opinions stated in the prompt scores higher than providing accurate corrections, because humans rate flattering responses more positively. Models learn to tell users what they want to hear.
Mitigation Strategies
Subtract ฮฒยทKL(ฯ_ฮธ โฅ ฯ_ref) from the reward during RL training. Keeps the policy close to the SFT reference, limiting how far the optimizer can travel to find exploits.
Train multiple independent RMs on different subsets of the preference data. Use minimum or mean reward as the training signal โ exploits that work on one RM are unlikely to fool all of them.
Instead of scoring only the final response, score each intermediate reasoning step. Much harder to hack because the model must produce a correct chain of reasoning, not just a convincing final answer.
Continuously collect new human or AI preferences on the current policy's outputs during training. The RM is retrained to stay ahead of the policy's attempts to exploit it.
Process vs. Outcome Reward Models
This distinction is particularly important for reasoning tasks โ coding, math, and multi-step problem solving.
| Type | What it scores | Strengths | Weaknesses |
|---|---|---|---|
| Outcome Reward Model (ORM) | The final answer only | Simple to train; easy to get correctness labels for math/code | Cannot distinguish lucky correct answers from reliable reasoning; hackable via answer-only shortcuts |
| Process Reward Model (PRM) | Each step in the reasoning chain | Provides a training signal for the reasoning process itself; harder to hack; better for multi-step problems | Requires step-level labels (expensive to collect); must define what constitutes a "step" |
Lightman et al. (2023) โ "Let's Verify Step by Step" โ demonstrated that PRMs significantly outperform ORMs for mathematical reasoning. A PRM trained on step-level human labels improved the accuracy of best-of-N sampling from large language models on MATH benchmark problems by a larger margin than an equivalently scaled ORM. This work is directly influential on the reasoning model architectures used in o1, o3, and DeepSeek-R1, which rely heavily on process reward signals during training.
Why PRMs matter for reasoning at scale
Outcome rewards create a sparse signal problem: for a 50-step math proof, the model only learns whether the entire chain was correct, not which steps were flawed. A PRM gives a training signal at each step, making credit assignment tractable. The tradeoff is annotation cost โ labeling every step in a reasoning chain is far more expensive than labeling the final answer.
Checklist: Do You Understand This?
- What is a reward model architecturally โ how does it differ from a language model?
- What kind of data is used to train a reward model, and how is the Bradley-Terry loss applied?
- What does Goodhart's Law predict about RL optimization against a reward proxy?
- Can you name three concrete examples of reward hacking in language model training?
- How does a KL penalty help mitigate reward hacking, and what are its limits?
- What is the difference between an Outcome Reward Model and a Process Reward Model?
- Why did Lightman et al. (2023) find that PRMs outperform ORMs for reasoning tasks?