LLM Pre-training
How large language models are trained from scratch — tokenization, training objectives, and the massive data pipelines that feed trillions of tokens.
In This Section
Tokenization & Vocabulary Design
BPE, WordPiece, SentencePiece, vocabulary size tradeoffs, and how tokenization affects capability.
Pre-training Objectives
CLM, MLM, span corruption, and why causal language modeling won at scale.
Data Curation at Scale
Quality filtering, deduplication, data mixing, synthetic data, and benchmark contamination.