LLM Pre-training

How large language models are trained from scratch — tokenization, training objectives, and the massive data pipelines that feed trillions of tokens.

In This Section

Tokenization & Vocabulary Design

BPE, WordPiece, SentencePiece, vocabulary size tradeoffs, and how tokenization affects capability.

Pre-training Objectives

CLM, MLM, span corruption, and why causal language modeling won at scale.

Data Curation at Scale

Quality filtering, deduplication, data mixing, synthetic data, and benchmark contamination.