Advanced

Replication & Benchmarking Practice

A large fraction of published ML results fail to replicate. The numbers are real — in the sense that someone ran the code and got that output — but they don't transfer to different codebases, different seeds, or slightly different evaluation protocols. Understanding why this happens, which benchmarks are more or less trustworthy, and how to design your own fair evaluations is essential for anyone working seriously with AI systems.

The Replication Crisis in ML

The ML replication crisis is well-documented but under-discussed in practitioner communities. Studies attempting to reproduce results from top ML conferences have found failure rates ranging from 30% to over 60%, depending on how strictly "replication" is defined. The causes fall into several categories:

Common replication failure causes

Missing hyperparameters: papers report final values but omit the search range and selection criterion
Cherry-picked seeds: results from the best of many random seeds, not the median
Undisclosed tricks: gradient clipping, warmup schedules, early stopping criteria that change results significantly
Evaluation protocol differences: few-shot vs zero-shot, temperature settings, prompt formatting
Dataset version drift: benchmark datasets get updated; results on v1 don't reproduce on v2
Compute differences: mixed precision, batch size, hardware — all affect final numbers

What good reproducibility looks like

Full training code released (not just inference)
Model checkpoints at intermediate stages
Exact hyperparameter config file committed alongside code
Multiple random seeds reported with mean and standard deviation
Hardware and software environment (CUDA version, library versions) documented
Evaluation scripts shared separately from training scripts

Papers with Code — The Reproducibility Layer

Papers with Code (paperswithcode.com) is the most useful single resource for assessing whether a result is real. It links papers to open implementations, tracks benchmark leaderboards with linked code, and surfaces the state of the art on hundreds of tasks with provenance for each number.

How to use Papers with Code effectively:

Search for a paper — if no code is linked, treat the result with extra skepticism
Check the benchmark leaderboard for the task — is the paper's claimed result consistent with community-tracked numbers?
Look at how many independent implementations exist for a method — more implementations = higher confidence the method works as described
Check whether linked code was authored by the original authors or an independent reimplementation
For benchmarks: look at the evaluation notes — many leaderboards track which protocol each result used

Benchmark Methodology — Why Numbers Vary

The same model can produce dramatically different benchmark numbers depending on evaluation choices. These are not rounding differences — they can shift results by 10–20 percentage points on widely-used benchmarks.

Protocol choice	Options	Impact
Few-shot prompting	0-shot, 1-shot, 3-shot, 5-shot, 10-shot	Can differ by 10–20 points on the same task; always state which is used
Chain-of-thought	Direct answer vs CoT reasoning before answer	CoT dramatically improves performance on reasoning tasks; comparisons must match
Scoring method	Exact string match, likelihood-based, LLM-as-judge	LLM judge scores typically higher than exact match; not directly comparable
Prompt format	Chat format, completion format, system prompt presence	Instruction-tuned models can score 10+ points differently depending on format
Answer extraction	First token, parsed final answer, regex over full output	Different extraction methods produce different numbers even with identical outputs

MMLU — The Saturated Standard

MMLU (Massive Multitask Language Understanding) covers 57 academic subjects in multiple-choice format, from high school mathematics to professional law and medicine. It was the dominant benchmark for frontier model comparisons from 2021 through 2024.

What MMLU measures

Broad knowledge recall across academic domains
Multiple-choice elimination strategies
At high performance levels: ability to reason through unfamiliar domain questions

Why MMLU is limited

Data contamination: MMLU questions are widely reproduced on the internet; models may have memorized answers
Performance has saturated: frontier models score 85–90%, leaving little room to differentiate
Multiple-choice format artificially inflates scores relative to open-ended generation of the same knowledge
Some questions have known errors or ambiguous correct answers

SWE-bench — Task-Based Evaluation

SWE-bench evaluates models on real software engineering tasks drawn from GitHub issues in popular open-source Python repositories. Each task provides a repository, a failing test suite, and an issue description; the model must produce a patch that makes the tests pass. This makes SWE-bench significantly harder to game than multiple-choice formats.

SWE-bench variants:

SWE-bench (full): 2,294 tasks from 12 repositories; broad but includes tasks that are ambiguous or require environment-specific setup
SWE-bench Lite: 300 carefully filtered tasks; more reproducible evaluation, lower cost
SWE-bench Verified: 500 tasks reviewed by human annotators to confirm problem specifications are unambiguous; considered the most reliable subset for fair comparison

As of early 2025, frontier model scores on SWE-bench Verified have risen from under 5% (2024) to 40–50%+ for the best agent scaffolds. Compare results only on the same variant using the same scaffold.

GPQA — Reasoning Under Expert Pressure

GPQA (Graduate-Level Google-Proof Q&A) consists of questions written by domain experts in biology, chemistry, and physics — questions that are intentionally designed so that web search alone cannot answer them. Non-expert humans with web access score around 34%; domain experts score around 65%; frontier models as of 2025 score in the 50–75% range.

GPQA is a more reliable benchmark than MMLU for measuring reasoning capability because: it is too hard to memorize (questions are new and expert-written), it resists web search (answers require integrating multiple specialized concepts), and it hasn't saturated yet. It is the primary benchmark used to detect reasoning capability in o1/o3-class models.

Chatbot Arena — Human Preference

Chatbot Arena (lmarena.ai, formerly LMSYS Chatbot Arena) takes a fundamentally different approach: instead of objective metrics, it measures human preference through pairwise comparison. Users have real conversations with two anonymous models and vote on which response was better. ELO scores are computed from over one million comparisons.

Why Arena scores matter

Human preference is ultimately what matters for product applications
Tasks are user-generated, not benchmark tasks — more representative of real use
Anonymous models eliminate hype bias (users don't know which model they're rating)
Large sample size (~1M+ votes) makes the rankings statistically stable
Currently considered the gold standard for relative model quality ranking

Arena limitations

Preference is not accuracy — users can prefer a confident wrong answer to a hedged correct one
User population is self-selected (mostly English-speaking tech-savvy users)
Task distribution skews toward coding and creative writing; medical or legal tasks underrepresented
ELO intervals overlap for similarly-ranked models — fine-grained ranking differences are not statistically significant

Running Your Own Benchmark

For production decisions — which model to use, whether a fine-tuned model is better — you should run your own evaluation on your specific task and data. General benchmarks tell you about average performance across a population of tasks, not about your task.

A minimal honest evaluation protocol:

Define the task precisely: what inputs, what outputs, what constitutes success
Collect a representative eval set: sample from your real distribution, not a convenient subset; aim for at least 100 examples for statistical stability
Freeze the eval set: do not look at it while tuning; hold it out until final comparison
Define the metric before running: exact match, F1, ROUGE, human score — fix this in advance
Run all conditions with the same protocol: same prompts, same extraction logic, same scoring
Report variance: run at least 3 seeds or 3 eval passes; report mean ± std, not just the best run
Do not report only the winning condition: show the full comparison table

Checklist: Do You Understand This?

Name three common causes of ML replication failures. Which do you think is hardest to address?
Why can the same model produce benchmark numbers that differ by 15+ points depending on evaluation protocol? Give a concrete example of a protocol choice that causes this.
What is data contamination in the context of MMLU, and why does it limit the benchmark's value for frontier models?
What makes SWE-bench Verified a more reliable benchmark than MMLU for comparing coding capability?
Why is Chatbot Arena considered the gold standard for relative model ranking, and what are two reasons it shouldn't be used as the sole evaluation?
If you were choosing between two LLMs for a customer support application, describe the evaluation protocol you would design, including how you would prevent your eval set from leaking into your prompt-tuning process.