Hardware & Compute
The physical substrate of AI — GPU architecture, specialized accelerators, memory bottlenecks, and how large models are distributed across thousands of chips.
In This Section
GPU Architecture — CUDA, Cores, Memory Hierarchy
SMs, Tensor Cores, HBM, and why GPUs dominate AI workloads.
TPU vs GPU vs Custom Silicon
Google TPU, Groq LPU, Cerebras, Tenstorrent, and Apple Neural Engine.
Memory Bandwidth — The Real Bottleneck
The roofline model, arithmetic intensity, KV cache, and Flash Attention.
FLOPS, MFU & Compute Efficiency
FLOP counting, model FLOP utilization, and how to measure training efficiency.
Distributed Training
Data, tensor, and pipeline parallelism, ZeRO, and 3D parallelism.