LLM-as-Judge Evaluation
Human evaluation of LLM outputs is expensive, slow, and doesn't scale. Automated metrics like BLEU and ROUGE correlate poorly with human judgement for open-ended tasks. LLM-as-Judge is the middle path: use a capable language model to evaluate other models' outputs at scale. Done right, it correlates strongly with human preference. Done wrong, it amplifies the judge model's biases as ground truth.
Why LLM-as-Judge Emerged
Traditional evaluation approaches break down for modern LLM systems:
Human annotation
Expensive ($0.05ā$1 per label), slow (hours to days), hard to scale, inconsistent across annotators
String-match metrics (BLEU/ROUGE)
Measures overlap, not quality. A paraphrase scores 0. A verbatim wrong answer scores high. Useless for open-ended generation.
LLM-as-Judge
Cheap (~$0.001 per eval), fast (seconds), scalable to millions of outputs, ~80ā90% agreement with human preference on well-designed rubrics
Two Scoring Modes
Pairwise is more reliable but 2Ć the cost; absolute scoring is faster and scalable
| Mode | How it works | Strengths | Weaknesses |
|---|---|---|---|
| Absolute (pointwise) | Judge scores one response on a 1ā5 or 1ā10 rubric | Scalable to large sets; produces ordinal scores for regression | Scores drift without calibration; sensitive to rubric wording |
| Pairwise | Judge picks the better of two responses (A or B) | Stronger agreement with human preference; easier to reason about | 2à cost; O(n²) comparisons for rankings; position bias |
| Reference-based | Judge compares response against a ground-truth reference answer | Factual accuracy easier to assess with reference | Needs gold answers; judge may over-favour verbatim similarity |
Designing the Judge Prompt
The judge prompt is the most critical component. Vague criteria produce inconsistent scores. A well-designed judge prompt has:
Establish the judge as an expert evaluator for the specific domain. E.g.: 'You are an expert software engineer evaluating Python code quality.'
List each dimension to evaluate and what constitutes each score level. Ambiguity in criteria = noise in scores.
Define what 1, 3, and 5 mean concretely (not just 'bad, ok, good'). Anchor with examples where possible.
Instruct the judge to reason first, then output the score. 'First, analyse the response against each criterion. Then output your score.' Reduces score variance by ~30%.
Require JSON output: { reasoning: string, score: number }. Parseable output is essential for automated pipelines.
Example judge prompt structure:
You are an expert evaluator for customer support responses.
Evaluate the response on three criteria:
1. Accuracy (1ā5): Does it correctly answer the question?
2. Completeness (1ā5): Does it address all parts of the question?
3. Tone (1ā5): Is it professional and empathetic?
First write your analysis for each criterion.
Then output JSON: { "accuracy": N, "completeness": N, "tone": N, "overall": N }
Known Biases ā and How to Mitigate Them
Position bias
In pairwise, the judge favours whichever response appears first (A over B) regardless of quality. Effect size: ~5ā15% of decisions.
Mitigation: Evaluate both orderings (A,B) and (B,A); only count confident wins where the winner is consistent.
Verbosity bias
Longer responses are rated higher even when the shorter response is more accurate. The judge confuses length with quality.
Mitigation: Include explicit rubric guidance like "do not reward length ā rate on accuracy and completeness, not word count."
Self-preference bias
A model used as judge disproportionately favours outputs from models in its own family. GPT-4 judging GPT-4o vs Claude will favour GPT-4o.
Mitigation: Use a different model family as judge, or use multiple judge models and take majority vote.
Sycophancy / authority bias
If the judge knows or infers which model produced a response, it may favour the "prestigious" model. Also, judges can agree with confident-sounding wrong answers.
Mitigation: Anonymise responses in the prompt; do not reveal model names to the judge.
Calibrating Against Human Labels
LLM-as-Judge is only as good as its correlation with human judgement on your task. Before deploying as an evaluation pipeline, calibrate:
Do not skip calibration ā an uncalibrated judge can systematically reward the wrong things
Good LLM-judge systems achieve 80ā90% agreement with expert human raters on well-defined criteria. Agreement below 70% indicates the judge prompt or criteria need revision.
Which Model to Use as Judge
| Judge model | Cost | Reliability | Notes |
|---|---|---|---|
| GPT-4o / Claude Sonnet 4.6 | ~$0.001ā0.003 per eval | High ā strong rubric adherence | Good default; use different family from model being evaluated |
| o3 / Claude Opus | ~$0.01ā0.05 per eval | Very high | For high-stakes evals; overkill for routine regression |
| Gemini 2.5 Flash | ~$0.0003 per eval | Medium-high | Cost-effective for high-volume evaluation pipelines |
| Llama 3.1 70B (self-hosted) | ~$0.00005 per eval | Medium | Cheapest option; lower rubric adherence; fine-tune for specific criteria |
When Not to Use LLM-as-Judge
Situations where LLM-as-Judge is unreliable
- Factual accuracy without reference: Judge models hallucinate too ā they will confidently score a wrong answer as correct if it sounds plausible
- Evaluating tasks the judge can't do: If the judge can't solve the task itself, it can't evaluate solutions (e.g., evaluating o3-level proofs with a weaker judge)
- Domain requiring expert knowledge: Medical diagnoses, legal reasoning, specialist engineering ā the judge lacks domain knowledge to reliably score these
- Subtle safety violations: Sophisticated jailbreaks or nuanced policy violations may not be detected by a judge without specific safety training
- High-stakes decisions: Never use LLM-as-Judge as the sole gating signal for production model updates; always include human review in the loop
Production LLM-as-Judge Pipeline
In a production evaluation system, LLM-as-Judge is one layer of a broader eval stack:
Stack layers by cost ā fast automated checks for everything, deep eval for regressions, humans for calibration
Checklist: Do You Understand This?
- What are the two main LLM-as-Judge scoring modes, and when would you choose each?
- Name three known biases in LLM-as-Judge and a mitigation strategy for each.
- What does calibration mean in this context, and what agreement threshold should you target?
- Why is LLM-as-Judge unreliable for evaluating factual accuracy without a reference answer?
- Which five elements should a well-designed judge prompt include?
- Describe a three-tier evaluation pipeline using LLM-as-Judge at different levels.