TRAINING AT SCALE
Section 24.1
01

Data parallelism + ZeRO/FSDP

Single-GPU training works up to about 1B parameters. For anything larger, you need to distribute the work across multiple GPUs. The simplest distribution — data parallelism (DP) — replicates the entire model on every GPU and gives each a different chunk of the training batch. Each GPU computes its own gradients; gradients are averaged via all-reduce; each GPU applies the same optimizer update. Beautifully simple, but it requires the model + optimizer state to fit on ONE GPU. ZeRO (Rajbhandari 2020) extends DP by sharding the bookkeeping across ranks, allowing each GPU to hold only 1/N of the optimizer state, gradients, and (optionally) parameters. FSDP (PyTorch’s Fully Sharded Data Parallel) is the production implementation. This section walks both.

Plain data parallelism

The baseline DP setup:

Data parallelism setup: N GPUs (data-parallel ranks). Each rank holds: full model weights + full optimizer state + full gradient buffers. Each rank processes its own MICRO-BATCH (different subset of the global batch). Per training step: 1. Each rank: forward + backward → local gradients. 2. ALL-REDUCE gradients across all N ranks: every rank ends up with the global average gradient. 3. Each rank: AdamW step (using the same gradient → same weight update). 4. Weights stay synchronized. Memory per rank: 4× model size (weights + grads + Adam m, v in fp32). For Llama 3 70B: ~1.1 TB. Doesn't fit on a single H100 (80 GB). This is why plain DP doesn't work for modern LLMs. The optimizer state alone (2× weights for Adam m and v) is too big.

The all-reduce step is the dominant communication. For a 70B model, ~280 GB of gradients per rank must be summed across N ranks every step.

The all-reduce cost

Ring all-reduce (the standard algorithm): N ranks in a logical ring. Total gradient tensor size G bytes. Algorithm: 1. Split G into N chunks of size G/N. 2. SCATTER-REDUCE phase: in N-1 rounds, each rank sends one chunk to its neighbor and receives one. After N-1 rounds, each rank has the fully-reduced version of one chunk. 3. ALL-GATHER phase: in N-1 rounds, ranks share their reduced chunks. After N-1 rounds, every rank has the complete reduced tensor. Per-step traffic per link: 2 · G · (N-1)/N ≈ 2·G bytes (large N). Wall-clock with bandwidth B per link: 2·G / B. Crucially: ring all-reduce is BANDWIDTH-OPTIMAL — every byte of the gradient travels exactly 2 link-hops (down and back up). Cannot do better with any algorithm. For Llama 3 70B (G = 140 GB gradients in fp16): On NVLink (900 GB/s per link): 2 · 140 / 900 = ~0.3 seconds per step (within node). On InfiniBand (50 GB/s per link): 2 · 140 / 50 = ~5.6 seconds (across nodes). The 18× factor is why multi-node DP gets expensive. Communication can dominate compute.

Ring all-reduce is the canonical algorithm. Modern libraries (NCCL for NVIDIA, RCCL for AMD) implement it directly in CUDA / ROCm kernels, taking advantage of NVLink and InfiniBand simultaneously where possible.

— think, then check —

Setup:

1024 GPUs total, organised as 128 nodes × 8 GPUs/node.

Intra-node: NVLink at ~900 GB/s per link.

Inter-node: InfiniBand HDR at ~50 GB/s per link.

Gradient size G = 140 GB.

Ring all-reduce cost:

If the ring traverses ONLY NVLink (impossible at this scale, since you only have 8 GPUs in a node):

Traffic per link: 2·G/N = 2·140/1024 = 273 MB. At 900 GB/s: 0.3 ms per round, × (N-1) rounds ≈ 300 ms per all-reduce.

If the ring traverses ONLY InfiniBand:

Per link: 273 MB. At 50 GB/s: 5.5 ms per round × 1023 rounds ≈ 5.6 seconds. Way too slow.

The hierarchical all-reduce solution:

1. Local all-reduce within each node (over NVLink): Each node’s 8 GPUs ring-reduce their gradients. Takes 2·G·7/8 / 900 GB/s ≈ 270 ms.

2. Cross-node all-reduce (over InfiniBand): One representative per node now holds the node-local sum. 128 representatives all-reduce across nodes. Total traffic per representative: 2·G·127/128 ≈ 2·G = 280 GB. At 50 GB/s: ~5.6 seconds per representative.

3. Local broadcast within node: The representative broadcasts the result back to its 7 peers via NVLink. Fast.

Total: ~270 ms (intra-node) + ~5.6 s (inter-node) + ~30 ms (broadcast) ≈ 6 seconds per step.

Why intra-node vs inter-node matters:

If everything were NVLink: ~300 ms per step. If everything were InfiniBand: ~6 s per step. 20× ratio.

Modern training runs are sized so the model and TP/PP groups fit within nodes (NVLink), and only DP all-reduce crosses node boundaries (InfiniBand). This puts the HIGH-bandwidth ops on the FAST network and the LOW-bandwidth ops on the SLOW one.

If you scale DP across more nodes without proportional InfiniBand upgrades, the cross-node all-reduce time grows quadratically (well, log-linearly with ring all-reduce, but still). Training throughput degrades.

This is why the LLama 3 70B training uses ~16K GPUs (not 64K): the cross-node DP all-reduce becomes the bottleneck above ~16K. Going higher requires hybrid-precision gradients, asynchronous DP, or other techniques.

ZeRO — sharding the bookkeeping

Rajbhandari 2020 “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models” introduced the structural fix. The insight: in plain DP, the SAME optimizer state and gradients are HELD ON EVERY RANK. This is redundancy — you don’t need N copies.

ZeRO Stage 1: shard the OPTIMIZER STATE. - Adam's m, v (in fp32) are partitioned across DP ranks. - Each rank holds 1/N of the optimizer state. - Each rank still holds the full weights and full gradients. - At step: all-reduce gradients (normal); then each rank updates ONLY its assigned 1/N of the parameters; then all-gather updated params across ranks. - Memory per rank: weights (1×) + grads (1×) + Adam (2× / N) ≈ 2 + 2/N copies. - For N = 16: 2.125× weight size per rank vs 4× for plain DP. ZeRO Stage 2: shard the GRADIENTS too. - Each rank holds 1/N of gradients (the ones for its assigned params). - All-reduce is replaced with REDUCE-SCATTER (more efficient per-rank work). - Memory: weights (1×) + grads (1×/N) + Adam (2×/N) ≈ 1 + 3/N copies. ZeRO Stage 3: shard the PARAMETERS too. - Each rank holds 1/N of the model weights. - Forward pass: ALL-GATHER all weights just-in-time for each layer; compute; release. - Backward: similar pattern in reverse. - Memory: (1× + 1× + 2×) / N = 4× / N copies per rank. - For N = 16: 0.25× weight size per rank. Llama 3 70B at 70 GB/16 ≈ 4.4 GB per rank (fits comfortably on any H100). Trade-off: Stage 1: minimal communication overhead, modest memory savings. Stage 3: max memory savings, but ALL-GATHER per layer adds communication.

ZeRO’s three stages are increasingly aggressive memory-saving techniques with increasingly higher communication overhead. The choice depends on the model-vs-memory ratio.

FSDP — PyTorch’s incarnation

FSDP is the PyTorch-native implementation of ZeRO Stage 3:

FSDP usage (simplified): import torch.distributed.fsdp as fsdp model = MyLargeModel() model = fsdp.FullyShardedDataParallel( model, sharding_strategy=fsdp.ShardingStrategy.FULL_SHARD, # ZeRO-3 mixed_precision=fsdp.MixedPrecision( param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16, ), cpu_offload=False, # could offload to CPU RAM for even bigger models ) # Now training as normal: for batch in dataloader: loss = model(batch) loss.backward() optimizer.step() # only updates its 1/N of params # Behind the scenes: # forward: all-gather params for each layer, compute, free params. # backward: all-gather params, compute grads, reduce-scatter grads, free. Memory: each rank holds 1/N of model state, plus a small all-gather buffer. Communication: increased by 1× weights per training step (the all-gathers). Throughput: ~10-20% lower than plain DP because of communication, but enables models that wouldn't fit otherwise.
— think, then check —

Setup:

70B model, mixed-precision training:

  • bf16 weights: 140 GB
  • fp32 master weights: 280 GB
  • fp32 gradients: 280 GB
  • fp32 Adam m, v: 560 GB
  • Total: 1260 GB

8× H100 = 8 × 80 GB = 640 GB total HBM.

Without sharding: doesn’t fit (1.26 TB > 640 GB).

Stage 1 (shard Adam m, v):

Per rank: weights (140 GB) + master (280 GB) + grads (280 GB) + Adam/8 (70 GB) = 770 GB / 8 = 96 GB. Doesn’t fit.

Stage 2 (shard grads + Adam):

Per rank: weights (140 GB) + master (280 GB) + (grads + Adam)/8 = 140 + 280 + 105 = 525 GB / 8 = 66 GB. Fits! Tight though.

Stage 3 (shard everything):

Per rank: (weights + master + grads + Adam) / 8 = 1260 / 8 = 158 GB. Fits with lots of room for activations.

Wait, that’s wrong. Let me recompute:

Stage 3 per rank: all four shards = 1260 / 8 = 158 GB. Hmm, that’s more than 80 GB.

Need to be more precise: at Stage 3, the PER-RANK memory is 1/N of EVERYTHING:

1260 / 8 = 157.5 GB. Doesn’t fit on 80 GB H100.

So even Stage 3 needs more than 8 H100s for a 70B model. In practice:

- For 70B on 8 H100s: also use bf16 master (saves 1× weights of memory). Plus CPU offload of less-frequently-used state. Hard.

- For 70B on 16 H100s: ZeRO Stage 3 gives ~79 GB/rank, fits.

- For 70B on 32 H100s: ~40 GB/rank, comfortable.

Typical: 70B fine-tuning uses 16-32 H100s with Stage 3.

Communication cost comparison:

Stage 1: only optimizer step needs comm (all-gather updated params, ~14 GB / step). Minor.

Stage 2: reduce-scatter on gradients (~140 GB / step traffic). About same as plain DP all-reduce.

Stage 3: full all-gather + reduce-scatter per LAYER per forward + backward. For 80 layers, this is significantly more communication. Throughput typically 15-25% lower than Stage 2.

So Stage 2 is optimal when memory permits (no over-communication); Stage 3 is required when you must shrink per-rank memory further (at communication cost).

For 8 H100s training 70B: not really feasible. You’d need either to (a) use 16+ GPUs, (b) reduce model size, or (c) use aggressive offloading (CPU/NVMe). Production training of frontier 70B is on 1024+ GPUs anyway.

When to use what

Choosing your strategy (rough guide): Model fits on 1 GPU (≤ 7B): Plain DP. Maximum throughput. Model fits on 1 GPU with offloading or smaller batch (~13-30B): Plain DP + activation checkpointing + bf16. Often called "DDP" mode. Or ZeRO Stage 1 for slight memory help. Model doesn't fit on 1 GPU but fits on a node (70B on 8× H100): ZeRO Stage 3 / FSDP within the node. DP across nodes if you have multiple nodes for batch parallelism. Model doesn't fit on a node (~200B+): Combine FSDP within nodes + tensor/pipeline parallelism across nodes (§24.2). This is the 3D parallelism setup used at frontier labs.
— think, then check —

Why FSDP dominates:

1. Memory is the binding constraint. For 70B+ models, getting them to fit at all is hard. FSDP makes it possible on commodity 8-16 GPU clusters. The communication overhead (15-25% throughput loss) is acceptable when the alternative is “can’t train at all.”

2. Communication overhead has been optimised. Modern FSDP supports:

  • OVERLAP of all-gather with compute (next layer’s all-gather starts during current layer’s compute).
  • REDUCE_OP fusion (combining multiple reduce ops into one).
  • FORWARD PREFETCH (all-gather params for upcoming layers).

These cut effective overhead to ~10-15% in practice.

3. FSDP is simpler than TP+PP+DP. Tensor parallelism requires modifying model code; pipeline parallelism requires custom scheduling. FSDP is “wrap the model, train as normal.” For most teams (non-frontier labs), the simpler API matters more than the extra ~10% throughput.

CPU offload — the next dimension:

FSDP supports OFFLOADING param shards (or optimizer state) to CPU RAM instead of GPU HBM. The full picture:

  • HBM: full state for the layer being computed RIGHT NOW (after just-in-time all-gather).
  • System RAM: optimizer state for layers not being updated. CPU RAM is cheap (TB scale).
  • NVMe SSD (ZeRO-Infinity): for truly huge state, offload further to disk.

This lets you train EVEN BIGGER models on the same hardware:

  • 1× H100 + 256 GB RAM + FSDP CPU offload: can fine-tune ~40B model. Slow but works.
  • 8× H100 + 2 TB RAM + ZeRO-Infinity: can fine-tune 200B+ models. Very slow but possible.

The trade-off: CPU offload makes the optimizer step slow (move state to GPU for update, then back). For inference, no problem (forward only). For training: ~2-5× slowdown per step.

When CPU offload is worth it:

  • Research / experimentation on a single workstation.
  • Fine-tuning very large models without a giant cluster.
  • QLoRA-style PEFT: 4-bit base + small adapter; FSDP shards the adapter. Memory is way smaller, but offload still helps with optimizer state for the adapter.

What this enables:

The “fine-tune 70B Llama on a single H100” tutorials use a combination of:

  • QLoRA (4-bit base, small fp16 adapter).
  • FSDP-style sharding (across multiple GPUs if available).
  • CPU offload (for Adam state when GPU memory is tight).
  • Gradient checkpointing (recompute activations during backward).

Together: the entire training fits in ~40 GB GPU memory + 100+ GB CPU RAM. Production-grade fine-tuning becomes accessible.

Next: §24.2 — Tensor, pipeline, and expert parallelism. The other parallelism dimensions that combine with DP/FSDP to make trillion-parameter training feasible.