🧠 All Things AI
Intermediate

LLM-as-Judge Evaluation

Human evaluation of LLM outputs is expensive, slow, and doesn't scale. Automated metrics like BLEU and ROUGE correlate poorly with human judgement for open-ended tasks. LLM-as-Judge is the middle path: use a capable language model to evaluate other models' outputs at scale. Done right, it correlates strongly with human preference. Done wrong, it amplifies the judge model's biases as ground truth.

Why LLM-as-Judge Emerged

Traditional evaluation approaches break down for modern LLM systems:

Human annotation

Expensive ($0.05–$1 per label), slow (hours to days), hard to scale, inconsistent across annotators

String-match metrics (BLEU/ROUGE)

Measures overlap, not quality. A paraphrase scores 0. A verbatim wrong answer scores high. Useless for open-ended generation.

LLM-as-Judge

Cheap (~$0.001 per eval), fast (seconds), scalable to millions of outputs, ~80–90% agreement with human preference on well-designed rubrics

Two Scoring Modes

Absolute scoring (single output)
System prompt + rubric
Define criteria explicitly
Response to evaluate
The output being judged
Score + reasoning
1–5 scale with explanation
Pairwise scoring (two outputs)
System prompt + criteria
What to optimise for
Response A vs Response B
Anonymous, randomised order
Winner + explanation
A wins / B wins / tie

Pairwise is more reliable but 2Ɨ the cost; absolute scoring is faster and scalable

ModeHow it worksStrengthsWeaknesses
Absolute (pointwise)Judge scores one response on a 1–5 or 1–10 rubricScalable to large sets; produces ordinal scores for regressionScores drift without calibration; sensitive to rubric wording
PairwiseJudge picks the better of two responses (A or B)Stronger agreement with human preference; easier to reason about2Ɨ cost; O(n²) comparisons for rankings; position bias
Reference-basedJudge compares response against a ground-truth reference answerFactual accuracy easier to assess with referenceNeeds gold answers; judge may over-favour verbatim similarity

Designing the Judge Prompt

The judge prompt is the most critical component. Vague criteria produce inconsistent scores. A well-designed judge prompt has:

1
Role definition

Establish the judge as an expert evaluator for the specific domain. E.g.: 'You are an expert software engineer evaluating Python code quality.'

2
Explicit criteria

List each dimension to evaluate and what constitutes each score level. Ambiguity in criteria = noise in scores.

3
Scoring scale with anchors

Define what 1, 3, and 5 mean concretely (not just 'bad, ok, good'). Anchor with examples where possible.

4
Chain-of-thought before score

Instruct the judge to reason first, then output the score. 'First, analyse the response against each criterion. Then output your score.' Reduces score variance by ~30%.

5
Structured output format

Require JSON output: { reasoning: string, score: number }. Parseable output is essential for automated pipelines.

Example judge prompt structure:

You are an expert evaluator for customer support responses.

Evaluate the response on three criteria:

1. Accuracy (1–5): Does it correctly answer the question?

2. Completeness (1–5): Does it address all parts of the question?

3. Tone (1–5): Is it professional and empathetic?

First write your analysis for each criterion.

Then output JSON: { "accuracy": N, "completeness": N, "tone": N, "overall": N }

Known Biases — and How to Mitigate Them

Position bias

In pairwise, the judge favours whichever response appears first (A over B) regardless of quality. Effect size: ~5–15% of decisions.

Mitigation: Evaluate both orderings (A,B) and (B,A); only count confident wins where the winner is consistent.

Verbosity bias

Longer responses are rated higher even when the shorter response is more accurate. The judge confuses length with quality.

Mitigation: Include explicit rubric guidance like "do not reward length — rate on accuracy and completeness, not word count."

Self-preference bias

A model used as judge disproportionately favours outputs from models in its own family. GPT-4 judging GPT-4o vs Claude will favour GPT-4o.

Mitigation: Use a different model family as judge, or use multiple judge models and take majority vote.

Sycophancy / authority bias

If the judge knows or infers which model produced a response, it may favour the "prestigious" model. Also, judges can agree with confident-sounding wrong answers.

Mitigation: Anonymise responses in the prompt; do not reveal model names to the judge.

Calibrating Against Human Labels

LLM-as-Judge is only as good as its correlation with human judgement on your task. Before deploying as an evaluation pipeline, calibrate:

Sample 200–500 outputs
Representative of your task distribution
→
Human annotation
2–3 annotators per sample; compute inter-annotator agreement
→
LLM-judge annotation
Run the same samples through your judge pipeline
→
Compare agreement
Cohen's kappa or Pearson correlation; target >0.7
→
Iterate judge prompt
If agreement is low, refine criteria or scoring anchors

Do not skip calibration — an uncalibrated judge can systematically reward the wrong things

Good LLM-judge systems achieve 80–90% agreement with expert human raters on well-defined criteria. Agreement below 70% indicates the judge prompt or criteria need revision.

Which Model to Use as Judge

Judge modelCostReliabilityNotes
GPT-4o / Claude Sonnet 4.6~$0.001–0.003 per evalHigh — strong rubric adherenceGood default; use different family from model being evaluated
o3 / Claude Opus~$0.01–0.05 per evalVery highFor high-stakes evals; overkill for routine regression
Gemini 2.5 Flash~$0.0003 per evalMedium-highCost-effective for high-volume evaluation pipelines
Llama 3.1 70B (self-hosted)~$0.00005 per evalMediumCheapest option; lower rubric adherence; fine-tune for specific criteria

When Not to Use LLM-as-Judge

Situations where LLM-as-Judge is unreliable

  • Factual accuracy without reference: Judge models hallucinate too — they will confidently score a wrong answer as correct if it sounds plausible
  • Evaluating tasks the judge can't do: If the judge can't solve the task itself, it can't evaluate solutions (e.g., evaluating o3-level proofs with a weaker judge)
  • Domain requiring expert knowledge: Medical diagnoses, legal reasoning, specialist engineering — the judge lacks domain knowledge to reliably score these
  • Subtle safety violations: Sophisticated jailbreaks or nuanced policy violations may not be detected by a judge without specific safety training
  • High-stakes decisions: Never use LLM-as-Judge as the sole gating signal for production model updates; always include human review in the loop

Production LLM-as-Judge Pipeline

In a production evaluation system, LLM-as-Judge is one layer of a broader eval stack:

Automated fast checks (every run)
Format validation
Schema, length, structure
Rule-based safety
Blocklist, regex
LLM-as-Judge (Gemini Flash)
~$0.0003/eval, high volume
LLM-judge deep eval (nightly / on regression)
GPT-4o / Claude judge
~$0.002/eval, calibrated rubric
Pairwise vs baseline
Current vs last release
Human review (sampled / on anomaly)
Expert annotation
5–10% sample or triggered by judge flags
Calibration update
Feed back into judge prompt refinement

Stack layers by cost — fast automated checks for everything, deep eval for regressions, humans for calibration

Checklist: Do You Understand This?

  • What are the two main LLM-as-Judge scoring modes, and when would you choose each?
  • Name three known biases in LLM-as-Judge and a mitigation strategy for each.
  • What does calibration mean in this context, and what agreement threshold should you target?
  • Why is LLM-as-Judge unreliable for evaluating factual accuracy without a reference answer?
  • Which five elements should a well-designed judge prompt include?
  • Describe a three-tier evaluation pipeline using LLM-as-Judge at different levels.