Advanced

Distributed Training — Data, Model & Pipeline Parallelism

Training a modern frontier language model is a distributed systems problem at least as much as it is a machine learning problem. A 70B parameter model requires approximately 140 GB of memory just for its parameters in BF16 — far exceeding the 80 GB HBM of an H100 GPU. Storing optimiser states (Adam requires 2× the model size for momentum and variance) brings the total memory requirement to roughly 560 GB, requiring partitioning across at least 7 H100s. At 400B+ parameters, training must span hundreds of nodes. Three complementary strategies — data parallelism, tensor parallelism, and pipeline parallelism — are combined to solve this problem.

Why Distributed Training is Necessary

Memory Wall

A 70B model in BF16 = 140 GB. Adam optimiser states = 280 GB additional. Total training memory ≈ 560 GB. Even eight H100s (640 GB combined HBM) are barely sufficient for 70B. For 400B+ models, single-node training is not possible.

Compute Wall

Llama 3 405B required approximately 3.8 × 10²⁵ FLOPs to train. A single H100 at 1000 TFLOPS BF16 would take ~1,200 years. Parallelising across 16,000 H100s brings this to ~28 days — the practical minimum for frontier model training.

Solution: 3D Parallelism

Combine data parallelism (split data across GPU replicas), tensor parallelism (split layers within a GPU node), and pipeline parallelism (split layers across nodes) to distribute both memory and compute across thousands of GPUs simultaneously.

Data Parallelism

Data parallelism (DP) is the simplest form of distributed training and works when the entire model fits on a single GPU. Each GPU holds a complete copy of the model. The training batch is split across GPUs — each GPU processes a different mini-batch and computes gradients independently. After the backward pass, gradients are synchronised across all GPUs via an all-reduce operation, and each GPU updates its model copy identically.

Per training step with N data-parallel GPUs: 1. Split global batch B into N micro-batches of size B/N 2. Each GPU i: forward(model, micro_batch_i) → loss_i 3. Each GPU i: backward(loss_i) → grads_i 4. All-reduce: avg_grads = (1/N) · Σᵢ grads_i (synchronise across all GPUs) 5. Each GPU: optimizer.step(avg_grads) → identical updated model Effective batch size: N × (B/N) = B (same as single GPU, but faster)

The all-reduce synchronisation is the bottleneck: it transfers the full gradient tensor (model size in floats) across all N GPUs each step. For a 7B model at BF16, that is 14 GB per step per GPU pair. With fast NVLink (600 GB/s) this is fast; over slower inter-node Ethernet it can dominate training time. PyTorch DDP (DistributedDataParallel) implements this with overlap of gradient communication and backward computation to hide the communication latency.

ZeRO: Zero Redundancy Optimizer

ZeRO (Rajbhandari et al., 2020, Microsoft DeepSpeed) extends data parallelism by partitioning the model state rather than replicating it. In standard DP, each GPU stores a complete copy of the parameters, gradients, and optimiser states — a factor of N redundancy. ZeRO eliminates this redundancy in three stages:

Stage	What is Partitioned	Memory Reduction	Communication Overhead
ZeRO-1	Optimizer states only	~4× vs baseline Adam	Same as DDP (one all-reduce per step)
ZeRO-2	Optimizer states + gradients	~8× vs baseline Adam	Same as DDP
ZeRO-3	Optimizer states + gradients + parameters	N× (linear with GPU count)	~1.5× DDP (extra all-gather for params)

ZeRO-3 (equivalent to PyTorch FSDP, Fully Sharded Data Parallel) partitions every parameter, gradient, and optimiser state across all N GPUs. GPU i only holds 1/N of each tensor. Before each forward/backward computation, the required parameters are gathered via an all-gather collective, used for the computation, then discarded. This means GPU memory scales as O(model_size / N), enabling training of arbitrarily large models given enough GPUs, at the cost of additional all-gather communication.

Tensor Parallelism

Tensor parallelism (TP), developed in Megatron-LM (Shoeybi et al., 2019, NVIDIA), splits individual layers across multiple GPUs rather than replicating them. Specifically, each weight matrix is partitioned column-wise or row-wise across T GPUs, so each GPU holds 1/T of each matrix.

Input X (all GPUs have a copy)

X (full)

Column-parallel Linear (W₁ split across GPUs)

GPU 0: W₁[:, :d/T]

GPU 1: W₁[:, d/T:2d/T]

GPU T: W₁[:, (T-1)d/T:]

Row-parallel Linear (W₂ split across GPUs)

GPU 0: W₂[:d/T, :]

GPU 1: W₂[d/T:2d/T, :]

GPU T: W₂[(T-1)d/T:, :]

All-reduce → Output Y (all GPUs reconstruct full output)

Y (full, synchronised)

Tensor parallelism: weight matrices split across T GPUs; one all-reduce per layer

The FFN layer in a transformer performs two linear transformations (expand then project back). With column-parallel for the first and row-parallel for the second, the computation proceeds as: each GPU independently computes its portion of the expansion, applies the activation function, computes its portion of the projection, then an all-reduce sums the partial results. One all-reduce per linear layer pair — FFN requires one, multi-head attention requires one.

Tensor parallelism requires very fast inter-GPU communication (NVLink within a node, 600 GB/s) because the all-reduce happens every forward and backward pass through every layer. It is typically applied only within a single 8-GPU node and does not scale efficiently across slower inter-node interconnects.

Pipeline Parallelism

Pipeline parallelism (PP) assigns different layers of the model to different GPUs (or nodes). GPU 0 runs layers 1–8, GPU 1 runs layers 9–16, and so on. Data flows sequentially through the pipeline — like an assembly line where each stage processes the data and passes intermediate activations to the next stage.

GPU 0: Layers 1–8

Receives input tokens; computes embedding + first 8 transformer blocks; passes activation tensor to GPU 1.

GPU 1: Layers 9–16

Receives activation from GPU 0; computes transformer blocks 9–16; passes to GPU 2.

GPU 2: Layers 17–24

Computes blocks 17–24. In a 4-stage pipeline, passes to GPU 3.

GPU 3: Layers 25–32

Final transformer blocks + language model head. Computes loss and initiates backward pass through the pipeline.

The fundamental challenge of pipeline parallelism is the pipeline bubble: when the forward pass starts, only GPU 0 is busy; when GPU 3 computes the loss and the backward pass starts, only GPU 3 is busy. Each GPU waits at the start and end of every batch. The bubble fraction is approximately (P-1)/P where P is the number of pipeline stages — with 4 stages, 25% of GPU time is idle at the pipeline transitions.

Micro-batching (GPipe, Huang et al., 2019) reduces the bubble by splitting the batch into M micro-batches that flow through the pipeline in sequence. While micro-batch 1 is on GPU 2, micro-batch 2 is on GPU 1, and micro-batch 3 is on GPU 0 — the pipeline is much fuller. The bubble fraction with P stages and M micro-batches is (P-1)/(M+P-1). With M=8 and P=4, the bubble drops from 25% to ~30/36 ≈ 8%.

3D Parallelism

Production LLM training at scale combines all three strategies simultaneously — this is called 3D parallelism:

Data Parallelism (across nodes)

Multiple identical copies of the full (TP+PP-distributed) model train on different data shards. Gradients are synchronised across all DP replicas via all-reduce over the slower inter-node Infiniband/RoCE interconnect. DP degree is typically the largest dimension.

Tensor Parallelism (within a node)

Within each 8-GPU server, all layers are split across GPUs via tensor parallelism. Uses fast NVLink for the per-layer all-reduce. TP degree is typically 4 or 8 (the number of GPUs per server).

Pipeline Parallelism (across nodes)

Layers are divided across pipeline stages, each stage running on one server (8 GPUs). Inter-stage communication transfers activation tensors over Infiniband. PP degree determines how many servers form one pipeline.

For example, training Llama 3 405B with 8-way tensor parallelism, 8-way pipeline parallelism, and 16-way data parallelism requires 8 × 8 × 16 = 1,024 GPUs per training replica. Meta reported using 16,384 H100s total, implying multiple data-parallel replicas running simultaneously.

Communication Primitives

The collective communication operations underlying all three parallelism strategies are standard distributed computing primitives, implemented by NCCL (NVIDIA Collective Communications Library) for GPU clusters.

Operation	What it Does	Used By	Data Volume
All-Reduce	Sum (or average) a tensor across all ranks; every rank receives the result	DP gradient sync; TP layer outputs	2× tensor size (ring algorithm)
All-Gather	Each rank contributes its shard; every rank receives the full tensor	ZeRO-3 parameter reconstruction before each layer	N-1/N × tensor size sent, full tensor received
Reduce-Scatter	Sum tensors across all ranks; each rank receives only its shard of the result	ZeRO-3 gradient partitioning; can replace all-reduce (all-reduce = reduce-scatter + all-gather)	N-1/N × tensor size
Point-to-Point (P2P)	Send a tensor from one rank to a specific other rank	Pipeline parallelism stage-to-stage activation transfers	Activation tensor size per micro-batch

Overlapping communication with computation is critical for training efficiency. Modern frameworks (Megatron-LM, PyTorch FSDP) schedule communication operations to run asynchronously while compute is ongoing — for example, prefetching the next micro-batch's parameters via all-gather while the current layer's backward pass is running. Achieving high GPU utilisation on a large training cluster is primarily a problem of keeping communication hidden behind compute.

Practical Tooling

PyTorch FSDP

Fully Sharded Data Parallel — PyTorch's built-in ZeRO-3 implementation. Integrates with the standard PyTorch training loop; requires minimal code changes. Supports mixed precision and activation checkpointing. Default choice for training models up to ~70B on a single cluster.

Megatron-LM

NVIDIA's research framework implementing all three parallelism strategies (TP+PP+DP). Used to train GPT-3, Llama derivatives, and most NVIDIA-backed models. Higher performance than FSDP at large scale but requires modifying model code to use Megatron's layer primitives.

DeepSpeed

Microsoft's training optimisation library. Implements ZeRO-1/2/3, pipeline parallelism, and a range of memory optimisations (CPU offloading, NVMe offloading). Integrates with Hugging Face Accelerate. Strong documentation and community support; widely used for academic and smaller-scale training runs.

NCCL

NVIDIA Collective Communications Library — the low-level communication backend used by all the above frameworks. Implements all-reduce, all-gather, reduce-scatter over NVLink, Infiniband, and RoCE. Automatically selects optimal communication algorithm (ring, tree) based on topology. Not used directly — called by FSDP, Megatron, or DeepSpeed.

Checklist: Do You Understand This?

Can you explain why a 70B model requires more than 140 GB of training memory when using the Adam optimiser, and state what the three memory components are?
Can you describe data parallelism: how the batch is split, how gradients are synchronised, and what all-reduce does?
Can you explain the three stages of ZeRO (ZeRO-1, 2, 3) and describe what additional memory is freed at each stage versus standard data parallelism?
Can you describe tensor parallelism and explain why it requires fast intra-node communication (NVLink) rather than slower inter-node links?
Can you explain what the "pipeline bubble" is in pipeline parallelism and how micro-batching reduces it?
Can you describe 3D parallelism and give a concrete example of how DP + TP + PP degrees multiply to determine total GPU count?
Can you distinguish between all-reduce, all-gather, and reduce-scatter and state which parallelism strategy uses each?