Model Architectures Deep Dive
Inside the architectures of today's frontier models — how GPT, Llama, Mixture of Experts, Mistral, and DeepSeek are designed.
In This Section
GPT Series — Architecture Evolution
GPT-1 through GPT-4: what changed, what scaled, and what the series established.
Llama 3 — Architecture & Design Choices
GQA, RoPE, training details, and why Llama became the open-model baseline.
Mixture of Experts (MoE) — How It Works
Routing, sparsity, load balancing, and the compute-vs-memory tradeoff.
Mistral & Mixtral Internals
Sliding window attention, GQA, and Mixtral 8x7B vs dense models.
DeepSeek Architecture & Training
MLA, DeepSeekMoE, FP8 training, and the $6M frontier model.