Training at scale

§1 Data parallelism + ZeRO/FSDP
Data parallelism — replicate the model on N GPUs, give each a different micro-batch, all-reduce gradients at the end — is the simplest scaling strategy. It works until the model itself doesn‘t fit on one GPU. ZeRO (Rajbhandari 2020) extends DP by SHARDING the optimizer state, gradients, and parameters across the DP ranks. FSDP is the PyTorch implementation. Together: a 70B model trains on 8× H100 without tensor parallelism, by carefully partitioning the per-rank memory footprint.
§2 Tensor + pipeline + expert parallelism
Beyond data parallelism, three orthogonal parallelism dimensions: TENSOR parallel (split each matmul across GPUs), PIPELINE parallel (split the model‘s layers into sequential stages), EXPERT parallel (place different MoE experts on different GPUs). At frontier scale, all three combine with DP into a "3D parallelism" grid that maps the model onto thousands of GPUs.
§3 Context parallelism + the multi-node systems problem
For ultra-long contexts (1M+ tokens), even attention with FlashAttention runs out of memory and compute. Context parallelism (CP) splits the SEQUENCE dimension across GPUs. Ring Attention (Liu 2023) combines this with FlashAttention‘s tile-and-stream pattern to make 1M-token training viable. This section also closes Part V by walking the broader systems-engineering picture of LLM training at scale.