Model Architectures Deep Dive

Inside the architectures of today's frontier models — how GPT, Llama, Mixture of Experts, Mistral, and DeepSeek are designed.

In This Section

GPT Series — Architecture Evolution

GPT-1 through GPT-4: what changed, what scaled, and what the series established.

Llama 3 — Architecture & Design Choices

GQA, RoPE, training details, and why Llama became the open-model baseline.

Mixture of Experts (MoE) — How It Works

Routing, sparsity, load balancing, and the compute-vs-memory tradeoff.

Mistral & Mixtral Internals

Sliding window attention, GQA, and Mixtral 8x7B vs dense models.

DeepSeek Architecture & Training

MLA, DeepSeekMoE, FP8 training, and the $6M frontier model.