Data parallelism + ZeRO/FSDP

Section 24.1

Data parallelism + ZeRO/FSDP

Single-GPU training works up to about 1B parameters. For anything larger, you need to distribute the work across multiple GPUs. The simplest distribution — data parallelism (DP) — replicates the entire model on every GPU and gives each a different chunk of the training batch. Each GPU computes its own gradients; gradients are averaged via all-reduce; each GPU applies the same optimizer update. Beautifully simple, but it requires the model + optimizer state to fit on ONE GPU. ZeRO (Rajbhandari 2020) extends DP by sharding the bookkeeping across ranks, allowing each GPU to hold only 1/N of the optimizer state, gradients, and (optionally) parameters. FSDP (PyTorch’s Fully Sharded Data Parallel) is the production implementation. This section walks both.

Plain data parallelism

The baseline DP setup:

Data parallelism setup: N GPUs (data-parallel ranks). Each rank holds: full model weights + full optimizer state + full gradient buffers. Each rank processes its own MICRO-BATCH (different subset of the global batch). Per training step: 1. Each rank: forward + backward → local gradients. 2. ALL-REDUCE gradients across all N ranks: every rank ends up with the global average gradient. 3. Each rank: AdamW step (using the same gradient → same weight update). 4. Weights stay synchronized. Memory per rank: 4× model size (weights + grads + Adam m, v in fp32). For Llama 3 70B: ~1.1 TB. Doesn't fit on a single H100 (80 GB). This is why plain DP doesn't work for modern LLMs. The optimizer state alone (2× weights for Adam m and v) is too big.

The all-reduce step is the dominant communication. For a 70B model, ~280 GB of gradients per rank must be summed across N ranks every step.

The all-reduce cost

Ring all-reduce (the standard algorithm): N ranks in a logical ring. Total gradient tensor size G bytes. Algorithm: 1. Split G into N chunks of size G/N. 2. SCATTER-REDUCE phase: in N-1 rounds, each rank sends one chunk to its neighbor and receives one. After N-1 rounds, each rank has the fully-reduced version of one chunk. 3. ALL-GATHER phase: in N-1 rounds, ranks share their reduced chunks. After N-1 rounds, every rank has the complete reduced tensor. Per-step traffic per link: 2 · G · (N-1)/N ≈ 2·G bytes (large N). Wall-clock with bandwidth B per link: 2·G / B. Crucially: ring all-reduce is BANDWIDTH-OPTIMAL — every byte of the gradient travels exactly 2 link-hops (down and back up). Cannot do better with any algorithm. For Llama 3 70B (G = 140 GB gradients in fp16): On NVLink (900 GB/s per link): 2 · 140 / 900 = ~0.3 seconds per step (within node). On InfiniBand (50 GB/s per link): 2 · 140 / 50 = ~5.6 seconds (across nodes). The 18× factor is why multi-node DP gets expensive. Communication can dominate compute.

Ring all-reduce distributed primitive A bandwidth-optimal all-reduce algorithm that arranges N ranks in a logical ring. Each rank sends and receives one chunk per round; after N-1 rounds of scatter-reduce + N-1 rounds of all-gather, every rank has the reduced version of the full tensor. Per-step traffic per link: ~2·G/N bytes. Total traffic per rank: ~2·G·(N-1)/N. Optimal because every byte must traverse N-1 link-hops to be reduced and N-1 link-hops to be distributed back, totaling 2(N-1)/N → 2 hops per byte at large N. Used by NCCL, MPI, every modern distributed training stack. is the canonical algorithm. Modern libraries (NCCL for NVIDIA, RCCL for AMD) implement it directly in CUDA / ROCm kernels, taking advantage of NVLink and InfiniBand simultaneously where possible.

— think, then check —

Setup:

1024 GPUs total, organised as 128 nodes × 8 GPUs/node.

Intra-node: NVLink at ~900 GB/s per link.

Inter-node: InfiniBand HDR at ~50 GB/s per link.

Gradient size G = 140 GB.

Ring all-reduce cost:

If the ring traverses ONLY NVLink (impossible at this scale, since you only have 8 GPUs in a node):

Traffic per link: 2·G/N = 2·140/1024 = 273 MB. At 900 GB/s: 0.3 ms per round, × (N-1) rounds ≈ 300 ms per all-reduce.

If the ring traverses ONLY InfiniBand:

Per link: 273 MB. At 50 GB/s: 5.5 ms per round × 1023 rounds ≈ 5.6 seconds. Way too slow.

The hierarchical all-reduce solution:

1. Local all-reduce within each node (over NVLink): Each node’s 8 GPUs ring-reduce their gradients. Takes 2·G·7/8 / 900 GB/s ≈ 270 ms.

2. Cross-node all-reduce (over InfiniBand): One representative per node now holds the node-local sum. 128 representatives all-reduce across nodes. Total traffic per representative: 2·G·127/128 ≈ 2·G = 280 GB. At 50 GB/s: ~5.6 seconds per representative.

3. Local broadcast within node: The representative broadcasts the result back to its 7 peers via NVLink. Fast.

Total: ~270 ms (intra-node) + ~5.6 s (inter-node) + ~30 ms (broadcast) ≈ 6 seconds per step.

Why intra-node vs inter-node matters:

If everything were NVLink: ~300 ms per step. If everything were InfiniBand: ~6 s per step. 20× ratio.

Modern training runs are sized so the model and TP/PP groups fit within nodes (NVLink), and only DP all-reduce crosses node boundaries (InfiniBand). This puts the HIGH-bandwidth ops on the FAST network and the LOW-bandwidth ops on the SLOW one.

If you scale DP across more nodes without proportional InfiniBand upgrades, the cross-node all-reduce time grows quadratically (well, log-linearly with ring all-reduce, but still). Training throughput degrades.

This is why the LLama 3 70B training uses ~16K GPUs (not 64K): the cross-node DP all-reduce becomes the bottleneck above ~16K. Going higher requires hybrid-precision gradients, asynchronous DP, or other techniques.

↳ §24.1 + ring all-reduce

ZeRO — sharding the bookkeeping

Rajbhandari 2020 “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models” introduced the structural fix. The insight: in plain DP, the SAME optimizer state and gradients are HELD ON EVERY RANK. This is redundancy — you don’t need N copies.

ZeRO Stage 1: shard the OPTIMIZER STATE. - Adam's m, v (in fp32) are partitioned across DP ranks. - Each rank holds 1/N of the optimizer state. - Each rank still holds the full weights and full gradients. - At step: all-reduce gradients (normal); then each rank updates ONLY its assigned 1/N of the parameters; then all-gather updated params across ranks. - Memory per rank: weights (1×) + grads (1×) + Adam (2× / N) ≈ 2 + 2/N copies. - For N = 16: 2.125× weight size per rank vs 4× for plain DP. ZeRO Stage 2: shard the GRADIENTS too. - Each rank holds 1/N of gradients (the ones for its assigned params). - All-reduce is replaced with REDUCE-SCATTER (more efficient per-rank work). - Memory: weights (1×) + grads (1×/N) + Adam (2×/N) ≈ 1 + 3/N copies. ZeRO Stage 3: shard the PARAMETERS too. - Each rank holds 1/N of the model weights. - Forward pass: ALL-GATHER all weights just-in-time for each layer; compute; release. - Backward: similar pattern in reverse. - Memory: (1× + 1× + 2×) / N = 4× / N copies per rank. - For N = 16: 0.25× weight size per rank. Llama 3 70B at 70 GB/16 ≈ 4.4 GB per rank (fits comfortably on any H100). Trade-off: Stage 1: minimal communication overhead, modest memory savings. Stage 3: max memory savings, but ALL-GATHER per layer adds communication.

ZeRO distributed training A family of memory optimization techniques for data-parallel training (Rajbhandari 2020). Stage 1 shards the optimizer state (Adam's m, v) across DP ranks. Stage 2 also shards gradients. Stage 3 also shards parameters. At Stage 3, each rank holds only 1/N of the model's weights, gradients, and optimizer state — but must ALL-GATHER weights just-in-time for each layer's forward pass. Trades increased communication for dramatically reduced per-rank memory. The PyTorch implementation is FSDP; the DeepSpeed implementation is the original ZeRO. ’s three stages are increasingly aggressive memory-saving techniques with increasingly higher communication overhead. The choice depends on the model-vs-memory ratio.

FSDP — PyTorch’s incarnation

FSDP PyTorch distributed PyTorch's implementation of ZeRO Stage 3 sharding (renamed Fully Sharded Data Parallel; introduced 2022). Each rank holds 1/N of the model's parameters, gradients, and optimizer state. Forward and backward pass require ALL-GATHER of full layer weights just-in-time. Now the standard PyTorch tool for training models larger than a single GPU's memory. Supersedes the older DistributedDataParallel (DDP) for large-model training. is the PyTorch-native implementation of ZeRO Stage 3:

FSDP usage (simplified): import torch.distributed.fsdp as fsdp model = MyLargeModel() model = fsdp.FullyShardedDataParallel( model, sharding_strategy=fsdp.ShardingStrategy.FULL_SHARD, # ZeRO-3 mixed_precision=fsdp.MixedPrecision( param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16, ), cpu_offload=False, # could offload to CPU RAM for even bigger models ) # Now training as normal: for batch in dataloader: loss = model(batch) loss.backward() optimizer.step() # only updates its 1/N of params # Behind the scenes: # forward: all-gather params for each layer, compute, free params. # backward: all-gather params, compute grads, reduce-scatter grads, free. Memory: each rank holds 1/N of model state, plus a small all-gather buffer. Communication: increased by 1× weights per training step (the all-gathers). Throughput: ~10-20% lower than plain DP because of communication, but enables models that wouldn't fit otherwise.

— think, then check —

Setup:

70B model, mixed-precision training:

bf16 weights: 140 GB
fp32 master weights: 280 GB
fp32 gradients: 280 GB
fp32 Adam m, v: 560 GB
Total: 1260 GB

8× H100 = 8 × 80 GB = 640 GB total HBM.

Without sharding: doesn’t fit (1.26 TB > 640 GB).

Stage 1 (shard Adam m, v):

Per rank: weights (140 GB) + master (280 GB) + grads (280 GB) + Adam/8 (70 GB) = 770 GB / 8 = 96 GB. Doesn’t fit.

Stage 2 (shard grads + Adam):

Per rank: weights (140 GB) + master (280 GB) + (grads + Adam)/8 = 140 + 280 + 105 = 525 GB / 8 = 66 GB. Fits! Tight though.

Stage 3 (shard everything):

Per rank: (weights + master + grads + Adam) / 8 = 1260 / 8 = 158 GB. Fits with lots of room for activations.

Wait, that’s wrong. Let me recompute:

Stage 3 per rank: all four shards = 1260 / 8 = 158 GB. Hmm, that’s more than 80 GB.

Need to be more precise: at Stage 3, the PER-RANK memory is 1/N of EVERYTHING:

1260 / 8 = 157.5 GB. Doesn’t fit on 80 GB H100.

So even Stage 3 needs more than 8 H100s for a 70B model. In practice:

- For 70B on 8 H100s: also use bf16 master (saves 1× weights of memory). Plus CPU offload of less-frequently-used state. Hard.

- For 70B on 16 H100s: ZeRO Stage 3 gives ~79 GB/rank, fits.

- For 70B on 32 H100s: ~40 GB/rank, comfortable.

Typical: 70B fine-tuning uses 16-32 H100s with Stage 3.

Communication cost comparison:

Stage 1: only optimizer step needs comm (all-gather updated params, ~14 GB / step). Minor.

Stage 2: reduce-scatter on gradients (~140 GB / step traffic). About same as plain DP all-reduce.

Stage 3: full all-gather + reduce-scatter per LAYER per forward + backward. For 80 layers, this is significantly more communication. Throughput typically 15-25% lower than Stage 2.

So Stage 2 is optimal when memory permits (no over-communication); Stage 3 is required when you must shrink per-rank memory further (at communication cost).

For 8 H100s training 70B: not really feasible. You’d need either to (a) use 16+ GPUs, (b) reduce model size, or (c) use aggressive offloading (CPU/NVMe). Production training of frontier 70B is on 1024+ GPUs anyway.

↳ §24.1 + ZeRO paper

When to use what

Choosing your strategy (rough guide): Model fits on 1 GPU (≤ 7B): Plain DP. Maximum throughput. Model fits on 1 GPU with offloading or smaller batch (~13-30B): Plain DP + activation checkpointing + bf16. Often called "DDP" mode. Or ZeRO Stage 1 for slight memory help. Model doesn't fit on 1 GPU but fits on a node (70B on 8× H100): ZeRO Stage 3 / FSDP within the node. DP across nodes if you have multiple nodes for batch parallelism. Model doesn't fit on a node (~200B+): Combine FSDP within nodes + tensor/pipeline parallelism across nodes (§24.2). This is the 3D parallelism setup used at frontier labs.

— think, then check —

Why FSDP dominates:

1. Memory is the binding constraint. For 70B+ models, getting them to fit at all is hard. FSDP makes it possible on commodity 8-16 GPU clusters. The communication overhead (15-25% throughput loss) is acceptable when the alternative is “can’t train at all.”

2. Communication overhead has been optimised. Modern FSDP supports:

OVERLAP of all-gather with compute (next layer’s all-gather starts during current layer’s compute).
REDUCE_OP fusion (combining multiple reduce ops into one).
FORWARD PREFETCH (all-gather params for upcoming layers).

These cut effective overhead to ~10-15% in practice.

3. FSDP is simpler than TP+PP+DP. Tensor parallelism requires modifying model code; pipeline parallelism requires custom scheduling. FSDP is “wrap the model, train as normal.” For most teams (non-frontier labs), the simpler API matters more than the extra ~10% throughput.

CPU offload — the next dimension:

FSDP supports OFFLOADING param shards (or optimizer state) to CPU RAM instead of GPU HBM. The full picture:

HBM: full state for the layer being computed RIGHT NOW (after just-in-time all-gather).
System RAM: optimizer state for layers not being updated. CPU RAM is cheap (TB scale).
NVMe SSD (ZeRO-Infinity): for truly huge state, offload further to disk.

This lets you train EVEN BIGGER models on the same hardware:

1× H100 + 256 GB RAM + FSDP CPU offload: can fine-tune ~40B model. Slow but works.
8× H100 + 2 TB RAM + ZeRO-Infinity: can fine-tune 200B+ models. Very slow but possible.

The trade-off: CPU offload makes the optimizer step slow (move state to GPU for update, then back). For inference, no problem (forward only). For training: ~2-5× slowdown per step.

When CPU offload is worth it:

Research / experimentation on a single workstation.
Fine-tuning very large models without a giant cluster.
QLoRA-style PEFT: 4-bit base + small adapter; FSDP shards the adapter. Memory is way smaller, but offload still helps with optimizer state for the adapter.

What this enables:

The “fine-tune 70B Llama on a single H100” tutorials use a combination of:

QLoRA (4-bit base, small fp16 adapter).
FSDP-style sharding (across multiple GPUs if available).
CPU offload (for Adam state when GPU memory is tight).
Gradient checkpointing (recompute activations during backward).

Together: the entire training fits in ~40 GB GPU memory + 100+ GB CPU RAM. Production-grade fine-tuning becomes accessible.

↳ §24.1 + FSDP docs + production training

Next: §24.2 — Tensor, pipeline, and expert parallelism. The other parallelism dimensions that combine with DP/FSDP to make trillion-parameter training feasible.