The backprop algorithm — tape, topo, memory

Section 9.2

The backprop algorithm — tape, topo, memory

§9.1 gave you the VJP for each op type. §9.2 is the framework that chains them together — the algorithm that every autograd library implements. It’s surprisingly short. Forward pass: run the model, recording each operation onto a tape. Backward pass: walk the tape in reverse, calling each op’s VJP function with the upstream gradient and accumulating the downstream gradient. That’s it. Once you have the tape data structure and the VJPs from §9.1, the rest is bookkeeping. The interesting part is the memory cost: activation memory for a transformer training run typically dwarfs parameter memory by 10–100×, and the standard fix (gradient checkpointing) is the engineering trade that lets you fit 70B parameter models on 80GB GPUs.

The tape and the topological walk

Backprop’s central data structure is the tape — a linear list of operations recorded during the forward pass. Each tape entry is roughly:

struct TapeEntry: op_type : enum { MATMUL, RELU, SOFTMAX_CE, ADD_BIAS, ... } inputs : indices of earlier tape entries this op consumed output_value : the tensor produced (saved for the backward pass) parameters : pointer to params (W, b, …) the op references

When you write y = matmul(W, x), the framework records (MATMUL, [W_idx, x_idx], y, …) onto the tape. When you write h = relu(y), it records (RELU, [y_idx], h, …). By the end of the forward pass, the tape contains the entire computation as a flat sequence.

Computing gradients is then trivial:

backward(loss): grad[loss] = 1 // seed for entry in tape, REVERSED: upstream_grad = grad[entry.output] for each input of entry: downstream_grad = entry.op.vjp(upstream_grad, saved_values) grad[input] += downstream_grad // accumulate (a node may // be consumed by multiple ops)

The reverse iteration over the tape is automatically a reverse topological ordering — every gradient that flows into a node has been accumulated before we use that node’s gradient to compute what flows out. We don’t have to construct a graph algorithm; the tape’s linearity handles it.

— think, then check —

The tape is a linear list of operation records. Each entry stores: the op type (matmul, relu, etc.), references to the input tensors (other tape entries or leaf parameters), and the output tensor produced by the op. It’s built up sequentially during the forward pass as each op executes.

Reverse iteration works because by construction, an op’s output is added to the tape AFTER its inputs were added. So when we walk the tape backwards, we visit every op AFTER we’ve visited all the ops that consume its output. Equivalent to reverse-topological-order on the computation DAG, but trivially so — no graph algorithm needed.

This is why torch.autograd works on dynamically-built computation graphs: every forward pass appends to a tape, and the backward pass runs the tape in reverse, regardless of what the model code did (loops, conditionals, recursion). Pythonic flexibility for free.

↳ §9.2 tape

Where the memory goes

Each tape entry has to save the output value (and sometimes other intermediates) so the backward VJPs can use them. For a 32-layer transformer with hidden dim 4096 and batch size 32 sequence length 2048:

per-layer activation memory ≈ batch × seq_len × hidden_dim × bytes = 32 × 2048 × 4096 × 4 = 1 GB per layer in float32 (= 500 MB in bfloat16) 32 layers × 500 MB = 16 GB of activation memory Parameter memory (for the model itself): ~30B params × 2 bytes (bfloat16) = 60 GB + gradients (60 GB) + optimiser state (120 GB for AdamW) = 240 GB

For a 30B model: parameter + gradient + optimiser memory totals ~240 GB. Activation memory at our sample batch size adds another ~16 GB — significant but not dominant. Bump the batch size or sequence length and activation memory grows linearly. For full-attention sequence lengths > 8K, activation memory becomes the binding constraint before parameter memory does.

This is what people mean when they say “training a 70B model requires 8 GPUs” — it’s not just parameters; it’s parameters + gradients + optimiser state + activations, with activations often forcing the tighter constraint at long sequence length.

— think, then check —

For a 30B param model in bfloat16:

Parameters: 30B × 2 bytes = 60 GB.
Gradients: same shape as params in bfloat16 = 60 GB. (Usually kept in fp32 for accumulation: 120 GB. Modern practice mixes bf16 grads with fp32 master copies.)
Optimiser state (AdamW): 2 × param size in fp32 = 240 GB.
Activations: ~B × seq × hidden × num_layers × 2 bytes. At B=32, seq=2048, hidden=4096, 32 layers: ~16 GB.

Totals: ~440 GB at our sample shapes. Hence ‘training 30B requires 6× 80 GB GPUs.‘

As sequence length grows, activations grow LINEARLY (or quadratically if you count attention’s QKᵀ matrix at the naïve O(N²) cost without FlashAttention — Ch.13). Past seq=16K or so, activations become the binding constraint over parameters/optimizer. This is exactly the problem FlashAttention solves (Ch.13): O(N²) attention memory → O(N) by streaming.

↳ §9.2 activation memory

Gradient checkpointing — the memory/compute trade

If activation memory is the binding constraint, you have two levers: smaller batches (hurts compute utilisation) or gradient checkpointing standard technique Save activations only at certain checkpoint layers, not at every layer. During the backward pass, recompute the missing intermediate activations by re-running the forward pass between checkpoints. Trades extra compute (~33% more forward FLOPs) for dramatic reduction in activation memory (typically 4-10x). Standard in every large-scale training stack. Then → now: 'Checkpointing' or 're-materialization' has been a numerical-method idea since the 1990s. Re-introduced for deep learning by Chen et al. 2016 ('Training Deep Nets with Sublinear Memory Cost'). By 2020 it was standard in every distributed training framework. PyTorch exposes it as torch.utils.checkpoint, JAX as jax.checkpoint, etc. .

The idea: don’t save every layer’s activations. Save only at certain “checkpoint” layers. During the backward pass, when you need an intermediate activation that wasn’t saved, re-run the forward pass between the nearest two checkpoints to recompute it.

Memory cost without checkpointing: O(num_layers × per_layer_activation) Memory cost WITH checkpointing: O(√num_layers × per_layer_activation) Compute cost without checkpointing: O(num_layers × per_layer_forward) Compute cost WITH checkpointing: O(num_layers × per_layer_forward) × (1 + 1/3) (an extra ~33% from recomputed intermediates)

In practice this saves 5–10× on activation memory for ~33% additional compute — a huge win for fitting larger models. (Chen et al. 2016, “Training Deep Nets with Sublinear Memory Cost,” arXiv:1604.06174.)

The pattern of “trade extra compute for less memory” runs through every large-scale training stack. Gradient checkpointing (this section). FlashAttention (Ch.13, recompute attention rather than materialise it). ZeRO / FSDP partitioning (Ch.23, parameter copies sharded across ranks; recomputed/re-fetched as needed). Every modern training framework — PyTorch FSDP, Megatron, DeepSpeed — implements all three. Memory is the scarce resource at the frontier scale; compute is comparatively cheap. The trades are everywhere.

Saving activations — what each op actually keeps

Each op type holds different things on the tape:

Op	Saves for backward	Why
`matmul(W, x)`	both `W` and `x`	VJP needs `x` for `dW = v · xᵀ` and `W` for `dx = Wᵀ · v`
`relu(x)`	the mask (`x > 0`) — 1 bit per element	VJP zeros out gradient where input was non-positive
`softmax_ce(z, y)`	the softmax output `p`	VJP is `p − y` (no z needed!)
`layernorm(x)`	normalised output + mean/variance scalars	VJP needs normalisation statistics
`attention(Q, K, V)`	output, attention weights (or just `O` for FlashAttention)	VJP through softmax + matmul chain

The activation memory cost depends on the op. Matmul’s saved x is the big one for transformer trunks. ReLU is cheap (one bit per element if you pack the mask). Softmax+CE saves the probability distribution, ~vocab-size per token.

This is why selective activation saving — checkpointing only specific layers — is so effective. The matmul activations are the expensive ones; ReLU activations are essentially free. Targeted checkpointing can save 80% of memory at ~10% compute cost by only re-materialising the matmul outputs.

— think, then check —

At frontier scale, memory is binding and compute is comparatively abundant. A 70B model in bfloat16 needs ~140 GB just for parameters and gradients. Optimiser state (AdamW) adds another ~280 GB in fp32. Activations at long sequence length can easily add another ~100+ GB. We can’t easily make GPUs bigger, but we can spend more FLOPs.

33% extra forward compute means 33% slower steps. 10× less activation memory means we can fit batches 10× bigger, OR longer sequences, OR train a 70B model on 8 GPUs instead of 16. The ratio of memory savings to compute cost is asymmetric — and at frontier scale the memory side is the binding constraint, so the trade is almost always worth it.

Interaction with FlashAttention: attention’s memory cost is O(N²) naïvely (the full QKᵀ score matrix has N² entries). FlashAttention computes attention block-by-block, never materialising the full score matrix, dropping memory cost to O(N). This is EXACTLY the same “recompute instead of store” trade — at the level of one op rather than across layers. So FlashAttention + gradient checkpointing combine multiplicatively: both reduce activation memory; both pay extra FLOPs; both are essential for long-context training. Modern training stacks (Megatron-LM, FSDP) use both by default.

↳ §9.2 checkpointing + Ch.1 §1.2 dot product

END OF CH.9 §2 — The backprop algorithm.
Three recall items: easy (tape structure), medium (memory budget for a 30B training run), hard (gradient checkpointing trade and its synergy with FlashAttention).
Coming next: §9.3 — Vector autograd in 250 lines. Real C code; train a 2-layer MLP from scratch using everything from Ch.4 §3 + Ch.8 §3 + this chapter’s VJPs.