The residual stream — why deep networks train
Before 2015, deep networks past about 20 layers didn’t train. Activations would either explode or vanish as they passed through layer after layer of nonlinear transformations; gradients would do the same on the way back. He 2015 “Deep Residual Learning for Image Recognition” fixed this with a single architectural change: add the input back to the output of each block. The skip connection seemed almost too simple to matter; it ended up being the change that unlocked all of modern deep learning. In transformers this skip path becomes a central highway — the residual stream — that every attention block and every FFN block reads from and writes back to. This section covers (1) why residuals fix the vanishing-gradient problem, (2) the residual-stream framing that powers mechanistic interpretability, and (3) the closing arithmetic on what makes Part III’s full transformer block train.
The residual fix — a skip path through every block
For a block computing some nonlinear transformation F, the residual block computes:
The two key things to internalise:
- The skip connection is the identity function — it adds the unmodified input. Not “approximately the input through a small layer,” literally x_in. This is what the gradient analysis below depends on.
- The sublayer output is added, not multiplied. If the sublayer learns to output zero, the block becomes the identity. This means the network can “skip” blocks it doesn’t need — and at initialisation, sublayer outputs start small, so the network is initially close to the identity and can learn to use blocks gradually.
Residual connections are the load-bearing architectural choice that makes everything else in modern deep learning possible. Worth understanding precisely why.
The gradient argument
For a stack of L residual blocks, the network output is:
The residual identity path is the structural fix. Whatever the sublayers do, the gradient ALWAYS has a direct path back to x_0 — the identity matrix from the chain of skip connections. The product structure (which causes vanishing/exploding) becomes a SUM structure (which is bounded).
Setup: non-residual: y = F_L(F_(L−1)(…F_1(x_0)…)). Residual: y = x_0 + F_1(x_0) + F_2(…) + … + F_L(…).
Non-residual gradient: ∂y/∂x_0 = product of L Jacobians ∂F_l/∂y_(l−1).
If each Jacobian has spectral norm σ ≠ 1, the product has norm σ^L. For σ = 0.9: at L = 50, σ^L = 0.005 — gradient down by 200×. For σ = 1.1: at L = 50, σ^L = 117 — gradient blows up. Either direction, deep networks can’t train.
Residual gradient: ∂y/∂x_0 = I + (sum of Jacobian-product terms).
That identity matrix is a constant — it doesn’t depend on the F_l’s. Even if every sublayer Jacobian product collapses to zero, the gradient is still I (bounded, full-rank). Even if some Jacobian products explode, they’re ADDED to I rather than multiplied — they can be regularised; they can’t kill the gradient signal.
The structural reason: the math went from a product (which has multiplicative cumulative error) to a sum (which has only additive cumulative error). For depths > 12-20, the product structure was fatal. The sum structure scales to arbitrary depth.
This is also why “pre-norm” matters (§14.2): in post-norm, the LN wraps the residual addition, multiplying the gradient on the skip path by γ/σ. The pure identity is broken. In pre-norm, the LN is INSIDE the sublayer path; the residual remains a pure identity. Gradient flow is preserved.
The residual stream as a first-class object
A more recent way of thinking about the residual stream — popularised by Elhage 2021 “A Mathematical Framework for Transformer Circuits” (Anthropic) — is to treat it as the central object of the transformer rather than as a side effect of residual connections.
Residual stream is the natural object to reason about in two contexts:
- Training stability. As above — the residual path is what gives gradients a clean identity flow.
- Mechanistic interpretability. Each block’s contribution is additive, so you can decompose the residual stream into “what each block wrote into it.” This decomposition is the basis for circuit discovery (which heads attend to what; which FFN neurons fire for which patterns; how information flows across layers).
The viz shows the residual stream’s norm growing across depth. With pre-norm and random sublayer outputs (each contributing a δ of small norm σ in roughly a random direction), the running norm satisfies ‖r_L‖² ≈ ‖r_0‖² + L · σ² — random walk in d-dim space. For typical L = 32, σ = 0.2: the norm grows by ~1.18×, which is corrected by a final LN before the output projection.
The “residual stream as a communication channel” framing is doing real work in mechanistic interpretability research. Sublayers that learn similar features write similar things into the stream; sublayers that learn complementary features write orthogonal things. The dimensionality of the residual stream (d_model = 4096 in Llama 2 70B) is a hard constraint on how many independent “features” the network can simultaneously carry — and recent work (Bricken 2023, Anthropic) suggests the residual stream encodes vastly more features than d_model via “superposition” (dictionary-style sparse codes). This is the cutting edge of “what is the network actually doing”; Ch.26 will cover this when we get to current research.
The framing: instead of “input → block1 → block2 → … → block_L → output,” think of the transformer as a single shared vector r ∈ ℝ^d that persists from the embedding to the final layer norm. Every attention head and every FFN layer reads r (via LayerNorm), computes a δ, and adds it back to r. Nothing is replaced; everything is additive.
What this framing makes clear:
- All blocks “see” each other. What block 3 writes into the stream is visible to blocks 4, 5, …, L. The transformer is best understood as a system of cooperating writers/readers, not a feedforward chain.
- Capacity is the residual stream dimension. Only d_model independent features can co-exist at any depth. This is a hard architectural constraint; everything else is determined.
- Decomposability. Since contributions are additive, r_L = r_0 + (sum of all block outputs). You can attribute the final r_L to specific blocks (“this output was caused 23% by attention head 5 of layer 7”) in a way that’s impossible in a feedforward chain.
- Sparsity / superposition. Recent interpretability work (Anthropic 2023+) suggests the residual stream uses a “sparse code” — far more than d_model semantic features, each represented by a small linear combination of the d_model basis directions. The residual-stream view is how this decomposition is formulated mathematically.
- Gradient flow. Pre-norm ensures the residual stream has identity gradient paths back through every block. That’s why deep networks train.
What it doesn’t make clear: the framing emphasises additive structure but downplays the fact that LN normalises before each read, which is a non-additive operation. And it treats blocks as independent contributors, which is technically wrong (block N’s contribution depends on what block N-1 wrote, since it reads from the current stream). Both are minor; the framing is mostly a useful lens, not a replacement for the math.
Closing the loop on Part III
We now have every piece of a transformer block:
That’s a complete transformer. Every part is now grounded in math (Ch.4-7), in code (Ch.8-10 backprop and optimisation, Ch.11-14’s specific layers), and in systems (Ch.2-3 SIMD, Ch.12-13 tiling and FlashAttention). The math half of the book is done.
Four orthogonal mechanisms:
- Residual connections (skip paths). Fix vanishing/exploding gradient by replacing the product structure ∂y/∂x_0 = ∏ ∂F_l with an additive structure I + Σ ∂F_l. Without residuals: networks past ~20 layers can’t train at all because of multiplicative gradient collapse. Composes: provides the identity path through which gradients flow regardless of what individual layers do.
- LayerNorm/RMSNorm (per-token normalisation). Fix activation magnitude drift — without normalisation, activation norms tend to grow exponentially through layers (each linear layer multiplies by some W; if ‖W‖ > 1, activations explode). LN keeps each block’s input in a bounded range. Without LN: residual connections alone allow gradient to flow but the activations themselves still drift, eventually saturating the softmax in attention or pushing the FFN into dead-ReLU territory. Composes: works with residuals by being placed BEFORE the sublayer (pre-norm), preserving the identity path.
- Adam/AdamW (per-parameter adaptive learning rate). Fix per-parameter scale heterogeneity. Different parameters in a transformer have very different “natural scales” — attention W_O wants smaller updates than embedding tables. Adam estimates per-parameter variance and rescales gradients individually. Without Adam (with vanilla SGD): different parameter groups train at very different effective rates; some saturate while others underfit. Composes: applies after the residual + LN forward pass to use the resulting gradients.
- Learning rate warmup + cosine decay. Fix the transient dynamics at the start of training. Without warmup, the initial gradients (when LN’s running statistics are wrong, when Adam’s variance estimates are uninitialised, when residual streams have no learned structure) can be huge and destabilise training. Warmup ramps lr from 0 to peak over the first few thousand steps; cosine decay anneals it down over training. Without it: training diverges in the first 1K steps. Composes: scales Adam’s update so it’s safe in the warmup phase and refined in the decay phase.
How they compose:
(1) residuals give the gradient a path back through all layers. (2) LN keeps activations bounded so the sublayers operate in a learnable regime. (3) Adam scales the per-parameter updates to whatever scale that parameter naturally wants. (4) Warmup ensures the first few hundred steps are gentle enough to let the running statistics in (2) and (3) settle.
Remove ANY one of these and large transformer training breaks: vanishing gradients (no residuals), exploding activations (no LN), divergent or underfit parameters (no Adam), or first-step divergence (no warmup). The four together make 80-layer 70B-parameter training routine.
This is why modern LLM training “just works” with these four ingredients across architecture variants — they each address a fundamental failure mode that any deep network encounters.
END OF CH.14 — Normalization & residuals.
§1 (LayerNorm: per-token whitening, gradient analysis) ·
§2 (RMSNorm: drop the mean, pre-norm placement) ·
§3 (residual stream: gradient flow, additive contributions, why deep nets train).
END OF PART III — The neural network is fully assembled.
Every piece — embeddings, softmax, attention (with FlashAttention), normalization, residuals —
is in place. Math + kernels + viz, end-to-end. From here we shift gears: Part IV–VI is about
what makes an LLM specifically (architecture choices, pretraining, alignment, fine-tuning) and
the systems engineering that runs it at scale (hardware, runtimes, inference, training).
Coming next: Ch.15 — The GPT architecture, end-to-end. Putting all of Part III together into a real,
identifiable LLM forward pass.