Alignment Techniques

Methods for training AI systems to follow instructions and behave in alignment with human preferences — from RLHF to direct preference optimization.

In This Section

RLHF — Mechanics & Pipeline

The SFT → reward model → PPO pipeline, KL penalty, and InstructGPT.

DPO — Direct Preference Optimization

How DPO replaces PPO with a simpler loss function derived from the same objective.

Constitutional AI & Self-Critique

Training models to critique and revise outputs against a written constitution.

Reward Modeling

How reward models are trained, Goodhart's Law, and process vs outcome reward models.