Advanced

Data Curation at Scale

"Data is the new oil" understates the case for language model pretraining. At frontier scale, the difference between a mediocre and a state-of-the-art model is often not architecture or training compute — it is data quality. A model trained on 10 trillion tokens of carefully filtered, deduplicated, and domain-balanced text will substantially outperform the same architecture trained on 10 trillion tokens of raw web crawl noise. This page covers how modern pretraining datasets are built, the techniques used to improve their quality, and the hidden data decisions that determine what a model will and will not be good at.

Collect

Web crawls (Common Crawl), curated sources (books, code, Wikipedia, papers)

Quality Filter

Heuristic rules (length, punctuation density) + classifier-based scoring (C4-style)

Deduplicate

MinHash LSH removes near-duplicate documents; exact dedup removes identical spans

Domain Mix

Upsample high-quality sources (code, books); downsample low-quality web text

Final Corpus

Trillions of tokens, ready for pretraining — data card documents decisions

Scale of Pretraining Data

Modern frontier models train on data volumes that were considered implausible as recently as 2021. GPT-3 (2020) trained on roughly 300 billion tokens. LLaMA 1 (2023) trained on 1.4 trillion tokens. LLaMA 3 (2024) trained on 15 trillion tokens. Frontier models from Google, OpenAI, and Anthropic are believed to train on 20–100 trillion tokens, though exact figures are not publicly disclosed.

At this scale, no human-curated dataset is large enough. All major pretraining pipelines draw predominantly from automatically scraped web content, supplemented by higher-quality sources at elevated sampling rates.

Dataset	Size (approx.)	Source	Notable for
Common Crawl	~70 PB raw; trillions of tokens filtered	Web crawl (monthly snapshots since 2011)	Largest available corpus; highly noisy without filtering
C4	~750 GB text	Cleaned Common Crawl (T5 project)	Early high-quality filtered web corpus; widely benchmarked
The Pile	~825 GB / ~300B tokens	EleutherAI — 22 curated sub-datasets	Diverse domain mix: books, code, arXiv, PubMed, GitHub
FineWeb	~15T tokens (FineWeb-Edu: ~1.3T)	HuggingFace — filtered Common Crawl	Open, well-documented pipeline; educational quality classifier
DCLM	~3.8T tokens	DataComp for LMs (academic consortium)	Systematic filtering ablations; open weights trained on it
StarCoder data	~250B tokens	The Stack — BigCode project	Permissively licensed code from 80+ programming languages

Quality Filtering

Raw web crawl contains spam, boilerplate, duplicate navigation menus, adult content, hate speech, low-information pages, and vast quantities of machine-generated SEO content. Quality filtering removes this noise before training. Two broad approaches are usually applied in sequence.

Heuristic Filters

Language detection: fastText or langdetect — discard non-target-language documents
Length thresholds: minimum and maximum character or word count per document
Perplexity filtering: score documents with a small KenLM model trained on clean text; discard high-perplexity documents (likely noise or foreign language fragments)
Repetition removal: discard documents where the same line or paragraph appears many times (boilerplate templates, cookie notices)
Punctuation/symbol ratio: discard documents with abnormal fractions of non-alphabetic characters
Blocklist filtering: URL blocklists for known spam and adult domains; word-level toxicity filters

Classifier-Based Filtering

A small classifier — typically a fastText or logistic regression model trained on a few thousand human-labelled examples of good vs. bad web text — scores each document. Documents below a threshold are discarded.

Google's C4 pipeline filtered pages not linked to by Wikipedia as a proxy for quality. Meta's LLaMA 3 used a quality classifier trained on curated reference documents (books, Wikipedia, arXiv).

HuggingFace's FineWeb-Edu uses an educational quality classifier to extract a subset of web data estimated to be as informative as textbook content — demonstrating that targeted classifier filtering can extract much denser signal from noisy crawls.

Quality filtering involves real tradeoffs. Aggressive filtering improves average document quality but reduces diversity and may over-represent formal written English, disadvantaging informal registers, dialects, and non-English languages. Finding the right aggressiveness level is an empirical question answered through downstream evaluation sweeps.

Deduplication

Web data contains enormous amounts of duplicated content. News articles are reprinted across thousands of sites. Legal boilerplate appears in millions of contracts. README files appear verbatim across GitHub forks. Failing to deduplicate creates two compounding problems: the model disproportionately memorises frequently repeated text (increasing verbatim regurgitation risk and reducing generalisation), and compute is wasted seeing the same gradient signal many times rather than diverse examples.

Exact Deduplication

Hash each document (MD5 or SHA-256) and discard all but one instance of documents with matching hashes. Fast and perfect for bit-identical duplicates, but misses near-duplicates where only a single word differs. Practical for code, where two functions may be identical except for a variable name.

MinHash LSH

Locality-Sensitive Hashing on MinHash sketches. Each document is represented by a compact fingerprint derived from its set of character n-grams. Documents with similar fingerprints are candidate near-duplicates and are clustered for removal. The approach scales to billions of documents and catches content that has been lightly reformatted or paraphrased.

Suffix Array Substring Dedup

Suffix array construction over the full training corpus finds shared substrings above a minimum length. Used in LLaMA and some Google pipelines. More compute-intensive but catches cross-document duplicate passages that MinHash may miss — e.g. a frequently quoted paragraph appearing in many different articles.

Deduplication at scale is a distributed systems problem as much as a data science problem. Processing 15 trillion tokens requires careful sharding, approximate algorithms, and trade-offs between recall (catching more duplicates) and precision (avoiding false positives that remove genuinely distinct content).

Data Mixing and Domain Weights

Not all data sources are equal, and a uniform mixture of web, code, books, and academic papers is rarely optimal. The mixing ratio — what fraction of training tokens comes from each domain — is one of the most impactful hyperparameters in pretraining, yet it is rarely disclosed publicly because it encodes proprietary knowledge about pipeline design.

Code overweighting dramatically improves reasoning. Even when evaluating on text-only benchmarks, models trained with 10–30% code tokens outperform those with 1–5%, likely because code enforces precise logical structure that generalises to structured reasoning in natural language.
Books and long-form documents improve coherence. Web text is dominated by short pages; long-form documents (novels, textbooks, long-form journalism) teach the model to maintain coherent arguments across many paragraphs.
Upsampling quality data repeatedly is beneficial. LLaMA 3 reports upsampling books up to 4 times (4 training epochs per token) while web data is seen once. High-quality data can be repeated without the same overfitting penalty that applies to noisy web content.
Domain weight effects are non-linear. Adding 10× more math data may improve MATH benchmark performance by 20% but reduce general language quality by 2%. Optimal mixing is found through multi-dimensional evaluation sweeps against held-out task suites.

Some research groups use learned domain weights — training a small "data selection model" to score documents by their estimated gradient benefit for a target distribution. DOREMI and DOGE are examples of this approach, which treats mixing ratios as optimisable parameters rather than fixed human decisions.

Synthetic Data

As high-quality natural text becomes increasingly scarce, synthetic data — content generated by AI models — has become a major component of both pretraining and fine-tuning datasets.

Productive Uses of Synthetic Data

Instruction tuning: GPT-4 and Claude generate instruction/response pairs at scale (Self-Instruct, Alpaca, UltraChat approaches)
Mathematical reasoning: DeepMind's AlphaProof generated formal proof traces used to train math-capable models
Code synthesis: StarCoder 2 trained on model-generated completions to supplement real GitHub data
Low-resource languages: Synthetic examples of medical, legal, or scientific text in underrepresented languages
RLHF negatives: Generate bad answers to train reward models that distinguish quality

Model Collapse Risk

If a model is trained on its own outputs, and those outputs are used to generate training data for the next generation, errors and biases accumulate. The distribution gradually drifts away from the original human distribution. This is called model collapse (Shumailov et al., 2023).

In practice the risk is managed by mixing synthetic data with substantial verified human-written text, and by using stronger teacher models to generate data for weaker student models rather than self-training. Completely recursive synthetic training degrades quality across generations in controlled experiments.

The Data Flywheel

The most powerful data advantage in AI is not having the largest crawled dataset — it is having a deployed product that generates supervised signal from real users. This is the data flywheel: user interactions produce implicit or explicit feedback, that feedback generates fine-tuning data, improved fine-tuning improves the model, a better model attracts more users, which generates more feedback.

Why the Flywheel Creates Moats

OpenAI's ChatGPT reached 100 million users within two months of launch. Every thumbs-up/thumbs-down, every correction, every "regenerate" click produces signal about where the model fails and what users prefer. Competitors training in isolation on static datasets cannot replicate this continuously refreshed, user-validated, task-distribution-matched signal.

This is why product adoption and model quality are not independent. The most-used models accumulate the best fine-tuning data. Companies with large user bases can train RLHF reward models on real human preferences rather than contractor annotations alone, producing more representative and harder-to-fake quality signals.

Benchmark Contamination

Modern pretraining corpora are large enough to contain most publicly available benchmarks. If a model has seen MMLU questions and answers during pretraining, its MMLU score measures memorisation, not reasoning ability. This is benchmark contamination, and it is a systematic bias that inflates reported numbers across the field.

The problem is structural. Common Crawl contains web pages that discuss, explain, or directly copy questions from nearly every major English-language benchmark. Code benchmarks like HumanEval have been reproduced in blog posts and GitHub repos that appear in training data. MATH and GSM8K problems are widely discussed online and indexed by crawlers.

Decontamination Approaches

N-gram overlap filtering: remove documents containing 13-gram matches with benchmark test sets
Private held-out sets: embed unique test examples not available publicly
Contamination probing: ask the model to reproduce benchmark examples; high recall indicates contamination
Time-split evaluation: test only on benchmarks released after the training data cutoff

Why Contamination Persists

Complete decontamination is computationally expensive at trillion-token scale
Paraphrased or partially reformatted benchmark content evades n-gram matching
Labs have incentive to report high benchmark numbers and may underinvest in decontamination
There is no agreed-upon standard for what "contamination-free" means across the field

The practical implication for practitioners: treat published benchmark numbers as upper bounds, not precise measurements. When evaluating models for production use, complement public benchmark scores with internal evaluations on private, task-specific test sets that have never appeared in any public corpus.

Checklist: Do You Understand This?

Why is raw Common Crawl not usable as-is for pretraining? Name three specific types of noise it contains.
Explain the difference between exact deduplication (hash-based) and near-deduplication (MinHash LSH). When would each fail to catch duplicates?
Why does deduplication matter for memorisation risk, not just training efficiency?
What is perplexity filtering in the context of data quality, and what kind of model is typically used to compute the perplexity score?
If you were building a pretraining mix and wanted to improve mathematical reasoning, what two domain weight adjustments would you make based on published evidence?
Describe the data flywheel and why it creates a compounding competitive advantage for models with large deployed user bases.
A model scores 85% on MMLU. What questions should you ask before trusting that number, and how would you check for contamination?
What is model collapse, and under what specific condition does it occur during synthetic data generation?