🧠All Things AI — by Subhojit DeyAll Things AI
🌱Start Here🔧Build with AIDaily StackDevelopersVibe CodingOthersLocal🏢Industry🛡️Legal🔬Deep Dive📰News
🧠 All Things AI
🌱🧠🔧⚡⚡🤖✨🔍🔶🎯💜⚡🪟🦙🤗🦞🔁🌊✕🔀🛠️🏢🛡️✅🏭🔬📰
🔬Deep Dive
Math Foundations
Neural Networks
Transformer Architecture
Scaling
LLM Pre-training
Alignment Techniques
Reasoning Internals
Interpretability
Model Architectures
Hardware & Compute
Fine-tuning & Adaptation
Research Skills
AI Economics & Impact
🔬Deep Dive
Math Foundations
Neural Networks
Transformer Architecture
Scaling
LLM Pre-training
Alignment Techniques
Reasoning Internals
Interpretability
Model Architectures
Hardware & Compute
Fine-tuning & Adaptation
Research Skills
AI Economics & Impact
Deep DiveLLM Pre-training

LLM Pre-training

How large language models are trained from scratch — tokenization, training objectives, and the massive data pipelines that feed trillions of tokens.

In This Section

Tokenization & Vocabulary Design

BPE, WordPiece, SentencePiece, vocabulary size tradeoffs, and how tokenization affects capability.

Pre-training Objectives

CLM, MLM, span corruption, and why causal language modeling won at scale.

Data Curation at Scale

Quality filtering, deduplication, data mixing, synthetic data, and benchmark contamination.

Previous← Emergent AbilitiesNextTokenization →

Page built: 01 Jun 2026