Tensor + pipeline + expert parallelism

Section 24.2

Tensor + pipeline + expert parallelism

Data parallelism scales batch size but doesn’t address per-rank model size. ZeRO/FSDP shard the optimizer/gradient/parameter state but require ALL-GATHER per layer, capping throughput. For truly large models (200B+) trained at scale, three additional parallelism axes come into play: tensor parallelism (TP) splits each matmul across GPUs; pipeline parallelism (PP) splits the model’s layers into sequential stages; expert parallelism (EP) for MoE places different experts on different GPUs. Combined with DP, this forms a 3D (or 4D with EP) parallelism grid that maps the model onto a cluster. This section walks each axis and their interactions.

Tensor parallelism — splitting matmuls

Megatron-LM (Shoeybi 2019) introduced the canonical tensor parallelism design. The idea: split each large matmul across TP_size GPUs, so each GPU only holds a SLICE of each weight matrix and computes a PARTIAL output.

Tensor parallelism for attention QKV projection: Standard: Y = X · W where W ∈ ℝ^{d × d}, all on one GPU. TP=4: W is split into 4 column slices W_1, W_2, W_3, W_4 each ∈ ℝ^{d × d/4}. Each GPU i computes: Y_i = X · W_i ∈ ℝ^{N × d/4} No communication for this step — X is REPLICATED across the TP group. For FFN's down projection (different split direction): Standard: Y = X · W where W ∈ ℝ^{F × d}. TP=4: X is split into 4 row slices X_1, X_2, ..., X_4 each ∈ ℝ^{N × F/4}. Each GPU i computes: Y_i = X_i · W_i (each GPU has its own slice of W: F/4 × d). Each GPU's Y_i is a PARTIAL SUM. ALL-REDUCE Y across the TP group → final Y. The cleverness: alternate column-split and row-split so that ONE all-reduce per "block of matmuls" produces the correct output. Within a transformer block, typically 2 all-reduces (one after attention, one after FFN).

Tensor parallelism distributed training A parallelism strategy that splits each linear layer's weight matrix across multiple GPUs. Each GPU holds a slice of the weight; computes a partial output; ALL-REDUCE combines partial outputs. Introduced by Megatron-LM (Shoeybi 2019). Per transformer block: 2 all-reduces (after attention, after FFN). Requires HIGH-BANDWIDTH inter-GPU communication — usually within a single node (NVLink) to keep throughput acceptable. Typical TP size: 4 or 8 (limited by node size). is COMMUNICATION-INTENSIVE — each transformer block does 2 all-reduces of size ~N×d per pass. For high throughput, TP must use NVLink-class bandwidth. Going across nodes (InfiniBand) destroys throughput.

TP per-step communication for Llama 3 70B at TP=8, batch=4M tokens: Per block: 2 all-reduces × N × d × 2 bytes = 2 · 8192 · 4096 · 2 = 134 MB per all-reduce. With ring all-reduce over 8 GPUs: 2 · 134 / 8 · 7 = ~234 MB per link per all-reduce. At NVLink 900 GB/s: 0.26 ms per all-reduce. Per block: 2 all-reduces × 0.26 ms = 0.52 ms. Total: 80 blocks × 0.52 ms × 2 (forward + backward) = 83 ms of TP communication per step. vs total step time of ~3-5 seconds: TP comm is ~2-3%. Acceptable on NVLink. If TP were across nodes (InfiniBand): Per all-reduce: 234 MB / 50 GB/s = 4.7 ms. Per step: 80 · 2 · 4.7 · 2 = 1500 ms. That's 30-50% of step time. Unacceptable. This is why TP groups are within a single node.

Pipeline parallelism — splitting layers

While TP splits each layer, pipeline parallelism distributed training A parallelism strategy that splits the model into sequential STAGES, each on a different GPU. A microbatch flows from stage to stage, getting processed at each. Multiple microbatches are pipelined so all stages stay busy. Per-step communication: only the activation tensor between stages (~MB), not the full model. Cheaper per-step communication than TP, but introduces a 'pipeline bubble' — startup/teardown overhead. PP size: typically 4-16. Common combinations: PP × TP × DP for 3D parallelism on frontier-scale training. splits the model into sequential layers groups:

Pipeline parallelism (PP=4 example for an 80-layer model): Stage 0: layers 0-19 on GPUs in pipeline rank 0 Stage 1: layers 20-39 on GPUs in pipeline rank 1 Stage 2: layers 40-59 on GPUs in pipeline rank 2 Stage 3: layers 60-79 on GPUs in pipeline rank 3 Forward pass for one microbatch: GPU 0 processes layers 0-19, sends activations to GPU 1. GPU 1 processes layers 20-39, sends to GPU 2. ... and so on. Naively: GPU 0 finishes in time T/4, then waits while GPUs 1-3 run. Result: each GPU is utilised only 1/4 of the time. Wasted compute. The fix: MICROBATCHING. Split the global batch into K microbatches. Pipeline them: as soon as microbatch 1 leaves stage 0, microbatch 2 enters. At steady state: all stages compute simultaneously on different microbatches. Throughput: still some wasted time at startup (filling pipeline) and teardown (draining). This is the "pipeline bubble."

The pipeline bubble

Pipeline bubble math (1F1B schedule): K microbatches, P pipeline stages, T = time per stage per microbatch. Naive sequential: K · P · T. 1F1B (one forward, one backward) pipeline: Startup: (P-1) · T to fill the pipeline. Steady state: K · T for K forward and backward passes. Teardown: (P-1) · T to drain. Total: (2(P-1) + K) · T. Efficiency = K · T · P / [(2(P-1) + K) · T · P] = K / (2(P-1) + K). For K = 4, P = 4: efficiency = 4 / 10 = 40%. Half the work is bubble. For K = 16, P = 4: efficiency = 16 / 22 = 73%. For K = 32, P = 4: efficiency = 32 / 38 = 84%. Rule: K >> P for efficient pipelining. At Llama 3 70B scale: PP = 4, microbatches = 128. Efficiency = 128 / (6 + 128) = 95%. Bubble is small.

Pipeline bubble PP inefficiency The startup and teardown overhead in pipeline parallelism, where some pipeline stages sit idle while others are still working through their assigned microbatches. The bubble's relative cost is (P-1)/(K + P-1) where P is pipeline stages and K is microbatches. Mitigated by using many microbatches (K >> P), interleaved schedules (1F1B, where forward and backward steps alternate), and zero-bubble pipelines (the most recent technique — overlap forward and backward on different microbatches to eliminate the bubble entirely). is the main efficiency loss in PP. More microbatches = less bubble, but at the cost of activation memory (need to keep K microbatches’ activations alive in memory).

Modern improvements (Qi 2024 “Zero Bubble Pipeline”) split the backward pass into smaller stages and reorder operations to eliminate the bubble almost entirely. At frontier scale (Llama 3, DeepSeek V3), bubble overhead is reduced to ~5% or less.

— think, then check —

Setup:

70B model. TP=8 (within node, NVLink), DP=128 (across nodes, InfiniBand). Total 1024 GPUs.

Each TP group has its own copy of the model (8 GPUs share one model). Each DP rank has a TP group.

Effective per-GPU model size: 70B / 8 = 8.75B params.

TP communication (within each node):

Per transformer block: 2 all-reduces of size ~N · d · 2 bytes = 8K · 4K · 2 = 64 MB each.

Per layer: 2 all-reduces. Per forward + backward: 4 all-reduces per layer.

80 layers × 4 all-reduces × 64 MB = 20 GB of TP traffic per microbatch per pipeline stage.

With ring all-reduce among 8 GPUs (factor 7/8 efficiency): per-GPU per-link traffic = 20 GB · 2 · 7/8 = 35 GB per pass.

At NVLink 900 GB/s: 35 GB / 900 GB/s = 40 ms per pass.

Per training step with 128 microbatches: 128 · 40 ms = 5 seconds of TP comm? That’s a lot.

Actually: TP all-reduces happen WITHIN each microbatch. Per microbatch: 80 layers × 2 all-reduces × 64 MB = 10 GB → 10 ms. Per step with 128 microbatches: 128 · 10 = 1280 ms.

Still significant but ~30% of step time. Modern impl overlaps TP comm with compute, reducing this further.

DP communication (across nodes):

Per step: ONE all-reduce of full gradient. Gradient size: 70B / 8 (TP shard) = 8.75B params → 17.5 GB in bf16.

Ring all-reduce among 128 DP ranks: per-link traffic = 17.5 GB · 2 · 127/128 = 35 GB per link.

At InfiniBand 50 GB/s: 35 GB / 50 GB/s = 700 ms per step.

This is the dominant communication cost. At frontier scale, this is often the bottleneck.

Why this split:

TP within node: needs NVLink bandwidth (900 GB/s) to be tractable; per-block frequent.
DP across nodes: only one all-reduce per STEP (not per block); InfiniBand bandwidth (50 GB/s) is adequate because of the lower frequency.

Putting TP across nodes would multiply traffic on the slow InfiniBand link by ~80 (number of layers) — completely infeasible.

Putting DP within node would force you to either have very large microbatches (memory limit) or under-utilise the NVLink bandwidth for the rare DP all-reduce.

The hierarchical mapping matches the communication hierarchy to the bandwidth hierarchy. This is THE central insight of distributed LLM training infrastructure.

↳ §24.2 + Megatron-LM

Expert parallelism — for MoE

For MoE models (Ch.17), the experts are an additional parallelism axis. Each rank holds a subset of the experts.

Expert parallelism for Mixtral 8x7B (8 experts, top-2 routing): EP=8: each rank holds 1 expert. At each MoE layer: Each token's router picks 2 experts. Tokens are sent (ALL-TO-ALL) to the GPUs holding their assigned experts. Each rank computes its expert on the tokens routed to it. Outputs are returned (ALL-TO-ALL) to their originating GPUs. Communication: 2 all-to-all per MoE layer, each tensor of size ~ (k/N) · token_count. For DeepSeek V3 (256 experts, top-8): all-to-all is heavier but still tractable. Combine with TP, PP, DP: - Attention layers: TP (each GPU holds a slice of attn weights). - MoE layers: EP (each GPU holds a few experts). - Both within a pipeline stage (PP). - DP across nodes/clusters. This is the 4D parallelism used by DeepSeek V3 et al.

The 3D / 4D parallelism grid

For Llama 3 70B at 16K GPUs:

Llama 3 70B parallelism grid (rough configuration): DP = 64 (across batch dimension) PP = 4 (4 pipeline stages) TP = 8 (8-way tensor parallel within each pipeline stage) Total = 64 · 4 · 8 = 2048 GPUs per microbatch instance. For full 16K GPUs: DP = 8 replicas of the above grid = 16384 GPUs. Microbatch count per step: K = 128 (gives ~95% PP efficiency). Effective global batch: 128 microbatches · ~32K tokens/microbatch = 4M tokens. Memory per GPU: Model: 70B / (TP · PP) = 70B / 32 ≈ 2.2B params per GPU = ~4.4 GB. Activations: ~10-15 GB depending on context length and microbatch size. Adam state for assigned params: ~9 GB. Total: ~25-30 GB per GPU. Fits comfortably in 80 GB H100.

— think, then check —

Bubble math (1F1B):

Efficiency = K / (K + 2(P-1)).

P = 8 stages.

For K = 16: efficiency = 16 / (16 + 14) = 16 / 30 = 53%. Half the time is bubble.

For K = 64: efficiency = 64 / (64 + 14) = 64 / 78 = 82%. Bubble is much smaller.

For K = 128: efficiency = 128 / 142 = 90%.

Why more K is better:

More microbatches → less proportional time in the startup/teardown bubble → higher utilisation.

Why you can’t just use K = infinity:

Each microbatch’s activations must be kept ALIVE in memory until that microbatch’s backward pass completes. With K microbatches in flight, you have K × activation_memory bytes pinned.

For a 70B model with batch size 8K tokens per microbatch and ~32 transformer layers’ worth of activations: each microbatch’s activations ≈ 8K · 4096 · 2 · 32 ≈ 2 GB.

With K = 128: 128 · 2 = 256 GB of activation memory per pipeline stage. Too much for a single GPU.

With K = 16: 16 · 2 = 32 GB. Fits in 80 GB H100 with model + Adam state.

The trade-off:

Few K (e.g., 16): less activation memory, low PP efficiency (40-60%).
Many K (e.g., 128): high PP efficiency (90%+), but doesn’t fit in memory.

Optimal K depends on how much activation memory each microbatch consumes vs how much GPU memory is available.

Activation checkpointing helps:

Recompute activations during backward instead of storing them. Reduces activation memory ~4-8× at cost of 25-30% more compute.

Allows higher K with the same memory budget. Llama 3 uses K = 128 with checkpointing for ~95% PP efficiency.

Zero Bubble pipeline (Qi 2024):

Recent technique that splits the backward pass into two stages: gradient w.r.t. inputs (small) and gradient w.r.t. weights (large). Overlap them to eliminate the bubble entirely. Used in some 2024+ training runs.

Even with the bubble: PP is still chosen because it enables training models that wouldn’t fit otherwise. 5-10% efficiency loss is acceptable when the alternative is “can’t train at all.”

↳ §24.2 + pipeline schedules

— think, then check —

Setup:

671B params total, 37B active per token. 256 experts, top-8 routing. 60+ transformer layers.

2K H100s (256 nodes × 8 GPUs/node).

Per-rank memory needs:

Without sharding: 671B × 2 bytes = 1.34 TB of weights. Plus Adam, gradients, activations. Far too much.

Parallelism strategy (one likely configuration):

Expert parallelism (EP): 8-16. Each rank holds 16-32 experts (out of 256). Within a node, 8 GPUs share the 256 experts. EP is constrained to within-node because all-to-all communication is bandwidth-intensive.

Tensor parallelism (TP): 2-4. The attention and per-expert FFN matrices are TP-sharded. Within node alongside EP.

Pipeline parallelism (PP): 4-8. Layers split across pipeline stages. Each stage is a node-group (PP between nodes).

Data parallelism (DP): Whatever fits: DP = 2048 / (PP · TP · EP) = 2048 / 32 ≈ 64.

Effective batch: 64 (DP) × N (microbatches) per step.

Memory check (per rank):

Attention weights (TP=4): per-layer attention params / 4. Small.
FFN / Expert weights (EP=16, 16 experts × ~2.3B / 16 = 2.3B per rank for experts). Plus shared expert.
Total model: ~30-40B params per rank.
In fp16: ~60-80 GB. Tight on 80 GB H100, needs aggressive ZeRO.

Actual config: likely combines all of the above + ZeRO Stage 1 (shard Adam) + activation checkpointing.

Communication hierarchy:

EP all-to-all: intra-node NVLink (per-layer, frequent).
TP all-reduce: intra-node NVLink (per-layer).
PP point-to-point: inter-node InfiniBand (per-microbatch).
DP all-reduce: inter-node InfiniBand (per-step).

The bandwidth-mapping rule:

Higher-frequency communications go on faster links. Per-step DP all-reduce can tolerate InfiniBand; per-layer TP and EP cannot.

DeepSeek’s published configuration (from their paper):

They actually run with EP=8 (intra-node), PP=8 (pipeline across nodes), DP across the remaining. Plus selective activation checkpointing. Their throughput is reportedly ~300 TFLOPs/GPU sustained — quite good for an MoE.

The 3D/4D parallelism grid is a multi-dimensional optimisation problem. Each frontier lab has internal tools to search the configuration space; the answer depends on model architecture, cluster topology, and target latency/throughput.

↳ §24.2 + production training

Next: §24.3 — Context parallelism and the multi-node systems problem. The newest parallelism axis (for very long sequences), and the broader systems picture.