🧠 All Things AI
Advanced

GPU Architecture — CUDA, Cores, Memory Hierarchy

Modern AI runs on GPUs. Not because of raw clock speed — CPUs are faster per core — but because AI workloads are dominated by matrix multiplications, and matrix multiplications are embarrassingly parallel: every output element can be computed independently. A CPU has tens of high-clock cores optimized for sequential logic. An H100 GPU has 16,896 CUDA cores designed to execute thousands of parallel threads simultaneously. Understanding GPU internals is the foundation for understanding AI performance, cost, and engineering tradeoffs.

Why GPUs for AI?

The core operation in every transformer layer is matrix multiplication: multiply an activation matrix by a weight matrix. For a layer with 4096 hidden dimensions, a single forward pass through one attention head requires millions of multiply-add operations. These operations have no data dependencies between them — any element of the output matrix can be computed without knowing the others.

CPU — Latency Optimized

  • 8–96 high-frequency cores (3–5 GHz)
  • Large caches (L1/L2/L3) for branch prediction
  • Out-of-order execution, speculative execution
  • Excels at sequential, branchy logic
  • Poor at parallel matrix math

GPU — Throughput Optimized

  • Thousands of simpler cores (1–2 GHz)
  • Small per-core cache, large shared memory pools
  • SIMT execution: same instruction, many threads
  • Excels at parallel, regular math (matrix multiply)
  • Poor at branchy, sequential code

The CUDA Programming Model

CUDA (Compute Unified Device Architecture) is NVIDIA's programming framework for GPU computation. It defines a hierarchical execution model that maps computation onto GPU hardware.

Thread
1 execution unit
Warp
32 threads (lockstep)
Thread Block
Multiple warps, shared memory
Streaming Multiprocessor (SM)
Executes blocks, has Tensor Cores
GPU
132 SMs on H100

CUDA execution hierarchy: threads → warps → blocks → SMs → GPU

LevelUnit CountShared MemoryKey Detail
Thread1RegistersFastest storage, per-thread, very limited (256 per thread)
Warp32 threadsRegistersExecute in lockstep (SIMT); diverging branches serialized
Thread BlockUp to 1024 threadsShared memory (L1)All threads can share 128–228 KB of fast on-chip SRAM
SMMultiple blocksL2 cacheH100 has 132 SMs; each has 128 CUDA cores + 4 Tensor Cores

The critical implication of SIMT (Single Instruction, Multiple Threads): if threads within a warp take different code paths (branch divergence), the GPU must serialize them — half the warp idles while the other executes each branch. AI kernels are written to avoid divergence by keeping all threads in a warp doing the same work.

Tensor Cores — Purpose-Built Matrix Hardware

Tensor Cores are specialized hardware units introduced with Volta (2017) and refined through every generation since. They execute Matrix Multiply-Accumulate (MMA) operations — the exact operation at the heart of every transformer layer — at drastically higher throughput than standard CUDA cores.

What a Tensor Core does in one instruction

D = A × B + C — where A, B, C, D are small matrices (e.g., 4×4 or 8×4). A standard CUDA core computes one multiply-add per clock. One Tensor Core instruction computes 64 (4×4×4) or more multiply-adds per clock, depending on the generation and precision. 4th gen Tensor Cores on H100 support FP8, BF16, TF32, FP16, INT8, and FP64 formats.

BF16 Tensor Core

989 TFLOPS on H100. Primary training format. Good numeric range (same exponent bits as FP32), lower precision than FP16 but more stable for training.

FP8 Tensor Core

~1,979 TFLOPS on H100 (2× BF16). New in H100. Requires loss scaling and careful calibration. Used by DeepSeek-V3 for frontier training efficiency.

FP32 (CUDA cores)

67 TFLOPS on H100. Used for optimizer state accumulation, loss computation, and gradient scaling — not for bulk matrix ops.

NVIDIA H100 — The AI Datacenter Standard

As of 2025, the H100 SXM5 is the primary chip used for frontier model training and high-throughput inference. Understanding its specifications helps interpret benchmark numbers and cost estimates.

SpecificationValue
CUDA Cores16,896
Tensor Cores (4th gen)528
HBM3 Memory80 GB
Memory Bandwidth3.35 TB/s
BF16 Tensor FLOPS989 TFLOPS
NVLink 4.0 bandwidth900 GB/s bidirectional (per GPU)
TDP700W
Process nodeTSMC 4N (customized 4nm)

NVLink 4.0 is what enables multi-GPU training within a node. With 8× H100s connected via NVSwitch, all-reduce operations happen at 900 GB/s — critical for keeping GPUs synchronized during distributed training without PCIe becoming the bottleneck.

Memory Hierarchy

GPU performance is constrained as much by data movement as by raw compute. The memory hierarchy exists to keep data close to the compute units that need it, but each level has dramatically different bandwidth and capacity tradeoffs.

Fastest / Smallest
Registers
~256 per thread
Per-SM (~228 KB)
Shared Memory / L1 Cache
~19 TB/s effective
Whole GPU (50 MB)
L2 Cache
~5 TB/s
On-Package (80 GB)
HBM3 (High Bandwidth Memory)
3.35 TB/s
System RAM (slowest)
CPU DRAM (via PCIe)
~64 GB/s PCIe 5.0

GPU memory hierarchy — every level is 5-50× slower than the one above

Key insight: bandwidth drop-off is enormous

Shared memory to registers: ~19 TB/s. HBM to L2: 3.35 TB/s. PCIe to CPU: 64 GB/s. A kernel that must repeatedly read data from HBM instead of shared memory can be 5× slower. Flash Attention (used in all modern transformers) was invented specifically to avoid unnecessary HBM reads during attention computation.

FLOPS vs Effective FLOPS

The 989 TFLOPS BF16 rating for the H100 is a theoretical peak. Real training workloads achieve 30–65% of this — a metric called MFU (Model FLOP Utilization). The gap exists because:

  • Memory bottlenecks: if data isn't ready in shared memory, Tensor Cores stall waiting for HBM reads
  • Communication overhead: all-reduce synchronization between GPUs during distributed training introduces idle time
  • Kernel launch overhead: small operations that don't fill all SMs waste capacity
  • Memory allocation and fragmentation: dynamic shapes cause reallocations that interrupt pipelining

A 50% MFU on H100 — which top training runs achieve with careful optimization — represents excellent engineering. Getting from 30% to 50% MFU doubles effective training throughput without buying a single additional GPU.

NVIDIA Dominance and the CUDA Moat

NVIDIA holds ~80–90% of the AI training accelerator market as of 2025. The hardware specs matter, but the real moat is CUDA: a 15-year accumulation of optimized libraries (cuBLAS, cuDNN, NCCL), tooling (Nsight profiler, CUDA-GDB), compiler infrastructure (PTX, NVCC), and ecosystem lock-in where every PyTorch, JAX, and TensorFlow kernel is CUDA-native.

AMD ROCm — The Challenger

AMD MI300X has competitive specs (192 GB HBM3, 5.3 TB/s bandwidth — beats H100 on memory). ROCm software stack has improved significantly 2023–2025.

Gap: custom CUDA kernels (Flash Attention variants, specialized ops) are not always available or optimized for ROCm. Large hyperscalers are adopting AMD for inference at scale, but NVIDIA still dominates training.

Intel Gaudi — The Enterprise Play

Intel Gaudi 3 (2024): competitive on inference FLOPS per dollar. Habana Labs acquisition (2019) gave Intel a credible deep learning accelerator. Used by some cloud providers as lower-cost alternative.

Gap: ecosystem is much smaller. Porting PyTorch code requires work. Not suitable for cutting-edge training runs where CUDA kernel customization is required.

Checklist: Do You Understand This?

  • Can you explain why matrix multiplication maps naturally to GPU parallelism?
  • What is a warp, and what happens when threads in a warp diverge?
  • How do Tensor Cores differ from CUDA cores, and what operation do they accelerate?
  • What is the memory bandwidth of H100 HBM3, and why does bandwidth matter as much as FLOPS?
  • List the GPU memory hierarchy from fastest to slowest and give approximate bandwidths.
  • Why does achieved FLOPS typically fall to 30–65% of theoretical peak?
  • What is CUDA's role in NVIDIA's competitive moat beyond hardware specs?
  • What is the primary hardware advantage of AMD MI300X over H100, and where does AMD still trail?