Intermediate

LLM-as-Judge Evaluation

Human evaluation of LLM outputs is expensive, slow, and doesn't scale. Automated metrics like BLEU and ROUGE correlate poorly with human judgement for open-ended tasks. LLM-as-Judge is the middle path: use a capable language model to evaluate other models' outputs at scale. Done right, it correlates strongly with human preference. Done wrong, it amplifies the judge model's biases as ground truth.

Why LLM-as-Judge Emerged

Traditional evaluation approaches break down for modern LLM systems:

Human annotation

Expensive ($0.05–$1 per label), slow (hours to days), hard to scale, inconsistent across annotators

String-match metrics (BLEU/ROUGE)

Measures overlap, not quality. A paraphrase scores 0. A verbatim wrong answer scores high. Useless for open-ended generation.

LLM-as-Judge

Cheap (~$0.001 per eval), fast (seconds), scalable to millions of outputs, ~80–90% agreement with human preference on well-designed rubrics

Two Scoring Modes

Absolute scoring (single output)

System prompt + rubric

Define criteria explicitly

Response to evaluate

The output being judged

Score + reasoning

1–5 scale with explanation

Pairwise scoring (two outputs)

System prompt + criteria

What to optimise for

Response A vs Response B

Anonymous, randomised order

Winner + explanation

A wins / B wins / tie

Pairwise is more reliable but 2× the cost; absolute scoring is faster and scalable

Mode	How it works	Strengths	Weaknesses
Absolute (pointwise)	Judge scores one response on a 1–5 or 1–10 rubric	Scalable to large sets; produces ordinal scores for regression	Scores drift without calibration; sensitive to rubric wording
Pairwise	Judge picks the better of two responses (A or B)	Stronger agreement with human preference; easier to reason about	2× cost; O(n²) comparisons for rankings; position bias
Reference-based	Judge compares response against a ground-truth reference answer	Factual accuracy easier to assess with reference	Needs gold answers; judge may over-favour verbatim similarity

Designing the Judge Prompt

The judge prompt is the most critical component. Vague criteria produce inconsistent scores. A well-designed judge prompt has:

Role definition

Establish the judge as an expert evaluator for the specific domain. E.g.: 'You are an expert software engineer evaluating Python code quality.'

Explicit criteria

List each dimension to evaluate and what constitutes each score level. Ambiguity in criteria = noise in scores.

Scoring scale with anchors

Define what 1, 3, and 5 mean concretely (not just 'bad, ok, good'). Anchor with examples where possible.

Chain-of-thought before score

Instruct the judge to reason first, then output the score. 'First, analyse the response against each criterion. Then output your score.' Reduces score variance by ~30%.

Structured output format

Require JSON output: { reasoning: string, score: number }. Parseable output is essential for automated pipelines.

Example judge prompt structure:

You are an expert evaluator for customer support responses.

Evaluate the response on three criteria:

1. Accuracy (1–5): Does it correctly answer the question?

2. Completeness (1–5): Does it address all parts of the question?

3. Tone (1–5): Is it professional and empathetic?

First write your analysis for each criterion.

Then output JSON: { "accuracy": N, "completeness": N, "tone": N, "overall": N }

Known Biases — and How to Mitigate Them

Position bias

In pairwise, the judge favours whichever response appears first (A over B) regardless of quality. Effect size: ~5–15% of decisions.

Mitigation: Evaluate both orderings (A,B) and (B,A); only count confident wins where the winner is consistent.

Verbosity bias

Longer responses are rated higher even when the shorter response is more accurate. The judge confuses length with quality.

Mitigation: Include explicit rubric guidance like "do not reward length — rate on accuracy and completeness, not word count."

Self-preference bias

A model used as judge disproportionately favours outputs from models in its own family. GPT-4 judging GPT-4o vs Claude will favour GPT-4o.

Mitigation: Use a different model family as judge, or use multiple judge models and take majority vote.

Sycophancy / authority bias

If the judge knows or infers which model produced a response, it may favour the "prestigious" model. Also, judges can agree with confident-sounding wrong answers.

Mitigation: Anonymise responses in the prompt; do not reveal model names to the judge.

Calibrating Against Human Labels

LLM-as-Judge is only as good as its correlation with human judgement on your task. Before deploying as an evaluation pipeline, calibrate:

Sample 200–500 outputs

Representative of your task distribution

→

Human annotation

2–3 annotators per sample; compute inter-annotator agreement

→

LLM-judge annotation

Run the same samples through your judge pipeline

→

Compare agreement

Cohen's kappa or Pearson correlation; target >0.7

→

Iterate judge prompt

If agreement is low, refine criteria or scoring anchors

Do not skip calibration — an uncalibrated judge can systematically reward the wrong things

Good LLM-judge systems achieve 80–90% agreement with expert human raters on well-defined criteria. Agreement below 70% indicates the judge prompt or criteria need revision.

Which Model to Use as Judge

Judge model	Cost	Reliability	Notes
GPT-4o / Claude Sonnet 4.6	~$0.001–0.003 per eval	High — strong rubric adherence	Good default; use different family from model being evaluated
o3 / Claude Opus	~$0.01–0.05 per eval	Very high	For high-stakes evals; overkill for routine regression
Gemini 2.5 Flash	~$0.0003 per eval	Medium-high	Cost-effective for high-volume evaluation pipelines
Llama 3.1 70B (self-hosted)	~$0.00005 per eval	Medium	Cheapest option; lower rubric adherence; fine-tune for specific criteria

When Not to Use LLM-as-Judge

Situations where LLM-as-Judge is unreliable

Factual accuracy without reference: Judge models hallucinate too — they will confidently score a wrong answer as correct if it sounds plausible
Evaluating tasks the judge can't do: If the judge can't solve the task itself, it can't evaluate solutions (e.g., evaluating o3-level proofs with a weaker judge)
Domain requiring expert knowledge: Medical diagnoses, legal reasoning, specialist engineering — the judge lacks domain knowledge to reliably score these
Subtle safety violations: Sophisticated jailbreaks or nuanced policy violations may not be detected by a judge without specific safety training
High-stakes decisions: Never use LLM-as-Judge as the sole gating signal for production model updates; always include human review in the loop

Production LLM-as-Judge Pipeline

In a production evaluation system, LLM-as-Judge is one layer of a broader eval stack:

Automated fast checks (every run)

Format validation

Schema, length, structure

Rule-based safety

Blocklist, regex

LLM-as-Judge (Gemini Flash)

~$0.0003/eval, high volume

LLM-judge deep eval (nightly / on regression)

GPT-4o / Claude judge

~$0.002/eval, calibrated rubric

Pairwise vs baseline

Current vs last release

Human review (sampled / on anomaly)

Expert annotation

5–10% sample or triggered by judge flags

Calibration update

Feed back into judge prompt refinement

Stack layers by cost — fast automated checks for everything, deep eval for regressions, humans for calibration

Checklist: Do You Understand This?

What are the two main LLM-as-Judge scoring modes, and when would you choose each?
Name three known biases in LLM-as-Judge and a mitigation strategy for each.
What does calibration mean in this context, and what agreement threshold should you target?
Why is LLM-as-Judge unreliable for evaluating factual accuracy without a reference answer?
Which five elements should a well-designed judge prompt include?
Describe a three-tier evaluation pipeline using LLM-as-Judge at different levels.