Advanced

Constitutional AI & Self-Critique

Constitutional AI (CAI) is Anthropic's approach to alignment, introduced in Bai et al. (2022). The central idea is to train a model to critique and revise its own outputs according to a written set of principles — the "constitution" — rather than relying exclusively on human labelers to rate the harmlessness of every response. This reduces the cost and inconsistency of human annotation for harmlessness while maintaining or improving the quality of the alignment signal.

CAI is a two-phase process: a supervised learning phase in which the model revises harmful outputs using AI-generated critiques, followed by a reinforcement learning phase in which AI-generated preference labels (rather than human labels) train a reward model. The result is a model that is simultaneously helpful and harmless without requiring human feedback on every possible harmful scenario.

The Constitution

The constitution is a document — typically 10 to 20 principles — that defines what a "harmless, helpful, and honest" response looks like. Principles are drawn from multiple sources to provide broad coverage:

UN Declaration of Human Rights

Principles around dignity, fairness, and protection from harm that should be reflected in AI responses.

DeepMind Sparrow Rules

Explicit rules about what constitutes harmful content — hate speech, bioweapons guidance, CSAM — and how an assistant should decline.

Helpfulness principles

Principles drawn from consumer guidelines (Apple terms of service, Apple usage policies) about what makes a useful, accurate, and trustworthy assistant.

At inference time during CAI training, the model is given a principle selected randomly from the constitution and asked to evaluate its output against that specific principle. Using a random subset of principles for each critique prevents the model from over-fitting to a single definition of "harmless."

Phase 1 — Supervised Learning from AI Feedback (SL-CAI)

The first phase produces a revised dataset of (harmful prompt, harmless response) pairs using the model's own self-critique ability — no human labels are involved for this phase's harmlessness signal.

Generate a response to a harmful prompt

The initial model responds to a red-teaming prompt — often producing a harmful or boundary-violating answer

Critique the response

The model is given a constitutional principle and asked: 'Identify specific ways in which this response is harmful, unethical, or violates the principle'

Revise the response

The model is asked to rewrite its response to address the identified problems while remaining helpful

Train on revised responses

The (prompt, revised response) pairs form a supervised fine-tuning dataset; the model is trained to produce these revisions directly

The chain-of-thought structure — critique first, then revise — is important. Asking the model to identify specific problems before rewriting produces higher-quality revisions than asking for a direct rewrite. The critique step forces the model to articulate what is wrong, which then guides the revision.

Phase 2 — Reinforcement Learning from AI Feedback (RLAIF)

Phase 2 replaces human preference labelers for harmlessness with AI-generated preference labels — while human feedback may still be used for helpfulness.

Generate response pairs

Two responses to same prompt

→

AI evaluator

Constitutional principle used to pick preferred response

→

Preference dataset

AI-generated (prompt, y_w, y_l) triples

→

Train reward model

RM trained on AI preferences

→

RL fine-tuning

Policy trained with RM reward

RLAIF pipeline: AI evaluator replaces human labelers for harmlessness preferences

The AI evaluator is given a constitutional principle and asked a direct comparison question: "Which response is less harmful and more in line with the principle?" The answer — along with a chain-of-thought rationale — is used as the preference label. This produces a large-scale preference dataset at a fraction of the cost of human annotation.

RLAIF vs. RLHF — What Changes

Dimension	RLHF (human feedback)	RLAIF (AI feedback)
Cost per label	$1–$10 per comparison	Fraction of a cent (LLM API call)
Scale	Limited by contractor capacity	Scales to millions of comparisons easily
Consistency	20–30% inter-annotator disagreement	Highly consistent within the same evaluator model
Coverage of harm types	Limited by what red-teamers think to probe	Can systematically cover principle categories
Helpfulness signal	Human-rated; captures nuance well	AI-rated; may miss subtle helpfulness distinctions

Anthropic's experiments found that RLAIF-trained models achieved comparable or better harmlessness scores than human-feedback-trained models, without sacrificing helpfulness. The key benefit is scale: the AI evaluator can be applied to far more prompts than a human labeling budget allows.

Scalable Oversight Connection

Constitutional AI connects to a broader alignment research question: how do we evaluate AI outputs on tasks where humans cannot easily verify correctness? RLAIF shows a path where a capable AI evaluator (using a clear principle as its rubric) can generate reliable training signal for another model — potentially scaling to domains where human judgment is bottlenecked by expertise or time. This is sometimes called scalable oversight: using AI assistance to allow humans (or AI evaluators) to supervise harder tasks than they could alone.

Open question

RLAIF assumes the AI evaluator is capable enough to reliably judge responses against the constitution. If the evaluator has its own biases or blind spots — which it will — those biases propagate into the trained model at scale. The quality of the constitution and the evaluator model sets a ceiling on alignment quality.

Practical Impact

Constitutional AI is the alignment approach underlying Anthropic's Claude models. The technique has been influential beyond Anthropic: RLAIF as a general method (using AI preference labels) is now widely used in alignment pipelines where human annotation is a bottleneck. Google's research on RLAIF (Lee et al., 2023) confirmed that AI-generated labels can match human-generated labels for preference training on many tasks, significantly reducing the need for large-scale human feedback for every new model version.

Checklist: Do You Understand This?

What is the "constitution" in Constitutional AI — what kinds of principles does it contain?
What happens in Phase 1 (SL-CAI) — what dataset does it produce and how?
Why is the critique step before revision important — what does it improve?
How does RLAIF differ from RLHF in terms of where the preference labels come from?
What are three practical advantages of RLAIF over human-labeled preference data?
What is scalable oversight, and how does Constitutional AI relate to it?
What is the main limitation or risk of using AI-generated preference labels?