Alignment Techniques
Methods for training AI systems to follow instructions and behave in alignment with human preferences — from RLHF to direct preference optimization.
In This Section
RLHF — Mechanics & Pipeline
The SFT → reward model → PPO pipeline, KL penalty, and InstructGPT.
DPO — Direct Preference Optimization
How DPO replaces PPO with a simpler loss function derived from the same objective.
Constitutional AI & Self-Critique
Training models to critique and revise outputs against a written constitution.
Reward Modeling
How reward models are trained, Goodhart's Law, and process vs outcome reward models.