The residual stream — why deep networks train

Section 14.3

The residual stream — why deep networks train

Before 2015, deep networks past about 20 layers didn’t train. Activations would either explode or vanish as they passed through layer after layer of nonlinear transformations; gradients would do the same on the way back. He 2015 “Deep Residual Learning for Image Recognition” fixed this with a single architectural change: add the input back to the output of each block. The skip connection seemed almost too simple to matter; it ended up being the change that unlocked all of modern deep learning. In transformers this skip path becomes a central highway — the residual stream — that every attention block and every FFN block reads from and writes back to. This section covers (1) why residuals fix the vanishing-gradient problem, (2) the residual-stream framing that powers mechanistic interpretability, and (3) the closing arithmetic on what makes Part III’s full transformer block train.

The residual fix — a skip path through every block

For a block computing some nonlinear transformation F, the residual block computes:

Residual block: x_out = x_in + F(x_in) Specifically for a pre-norm transformer block: x_out = x_in + Sublayer( LayerNorm(x_in) ) where Sublayer is either attention or FFN. Full block: y = x + Attn( LN(x) ) (after attention sublayer) z = y + FFN( LN(y) ) (after FFN sublayer) Each block ADDS a contribution to the running vector; nothing is replaced.

The two key things to internalise:

The skip connection is the identity function — it adds the unmodified input. Not “approximately the input through a small layer,” literally x_in. This is what the gradient analysis below depends on.
The sublayer output is added, not multiplied. If the sublayer learns to output zero, the block becomes the identity. This means the network can “skip” blocks it doesn’t need — and at initialisation, sublayer outputs start small, so the network is initially close to the identity and can learn to use blocks gradually.

Residual connections architecture A skip path that adds the unmodified input of a block to the block's output: x_out = x_in + F(x_in). Introduced by He 2015 for ResNets, now universal in transformers. Two purposes: (1) the gradient has a clean identity path back through every block, fixing vanishing gradients; (2) the network can learn to leave blocks 'inactive' by driving F's output to zero, making depth a soft choice instead of hard. are the load-bearing architectural choice that makes everything else in modern deep learning possible. Worth understanding precisely why.

The gradient argument

For a stack of L residual blocks, the network output is:

After L blocks: x_L = x_0 + F_1(x_0) + F_2(x_1) + ... + F_L(x_{L-1}) The Jacobian of x_L with respect to x_0: ∂x_L/∂x_0 = I + ∂F_1/∂x_0 + ∂F_2/∂x_1 · ∂x_1/∂x_0 + ... That I is the identity matrix. Even if all the ∂F_l/∂x_{l-1} terms shrink to zero (vanishing gradient case), the identity term GUARANTEES the gradient passes through unimpaired. Compare to a NON-residual network: y_L = F_L( F_{L-1}( ... F_1(x_0) ... ) ) ∂y_L/∂x_0 = ∂F_L/∂y_{L-1} · ∂F_{L-1}/∂y_{L-2} · ... · ∂F_1/∂x_0 A product of L Jacobians. If each has spectral norm < 1, the product shrinks exponentially with L — gradients vanish past ~20 layers. If each has spectral norm > 1, the product explodes — gradients NaN out. This is why pre-2015 deep networks didn't train past ~20 layers.

The residual identity path is the structural fix. Whatever the sublayers do, the gradient ALWAYS has a direct path back to x_0 — the identity matrix from the chain of skip connections. The product structure (which causes vanishing/exploding) becomes a SUM structure (which is bounded).

— think, then check —

Setup: non-residual: y = F_L(F_(L−1)(…F_1(x_0)…)). Residual: y = x_0 + F_1(x_0) + F_2(…) + … + F_L(…).

Non-residual gradient: ∂y/∂x_0 = product of L Jacobians ∂F_l/∂y_(l−1).

If each Jacobian has spectral norm σ ≠ 1, the product has norm σ^L. For σ = 0.9: at L = 50, σ^L = 0.005 — gradient down by 200×. For σ = 1.1: at L = 50, σ^L = 117 — gradient blows up. Either direction, deep networks can’t train.

Residual gradient: ∂y/∂x_0 = I + (sum of Jacobian-product terms).

That identity matrix is a constant — it doesn’t depend on the F_l’s. Even if every sublayer Jacobian product collapses to zero, the gradient is still I (bounded, full-rank). Even if some Jacobian products explode, they’re ADDED to I rather than multiplied — they can be regularised; they can’t kill the gradient signal.

The structural reason: the math went from a product (which has multiplicative cumulative error) to a sum (which has only additive cumulative error). For depths > 12-20, the product structure was fatal. The sum structure scales to arbitrary depth.

This is also why “pre-norm” matters (§14.2): in post-norm, the LN wraps the residual addition, multiplying the gradient on the skip path by γ/σ. The pure identity is broken. In pre-norm, the LN is INSIDE the sublayer path; the residual remains a pure identity. Gradient flow is preserved.

↳ §14.3 + Ch.9 backprop

The residual stream as a first-class object

A more recent way of thinking about the residual stream — popularised by Elhage 2021 “A Mathematical Framework for Transformer Circuits” (Anthropic) — is to treat it as the central object of the transformer rather than as a side effect of residual connections.

The residual stream is a vector r ∈ ℝ^d that persists through the entire forward pass. Each block READS from r (via LN), produces a δ, and WRITES the δ back by adding it to r: r_0 = embed(tokens) + positional_encoding r_1 = r_0 + attn_1( LN(r_0) ) r_2 = r_1 + ffn_1( LN(r_1) ) r_3 = r_2 + attn_2( LN(r_2) ) ... r_2L = r_{2L-1} + ffn_L( LN(r_{2L-1}) ) logits = unembed( LN(r_2L) ) The residual stream is a SHARED communication channel. Every block sees what every prior block wrote; every block contributes additively to what every subsequent block will see.

Residual stream transformer framing The d-dimensional vector that persists from the embedding layer through every transformer block to the unembedding. Each attention head and each FFN layer reads from the residual stream (via LayerNorm) and writes back by adding its output to the stream. The residual stream is the transformer's only working memory; every block both reads from and contributes to it. Elhage 2021, Anthropic. is the natural object to reason about in two contexts:

Training stability. As above — the residual path is what gives gradients a clean identity flow.
Mechanistic interpretability. Each block’s contribution is additive, so you can decompose the residual stream into “what each block wrote into it.” This decomposition is the basis for circuit discovery (which heads attend to what; which FFN neurons fire for which patterns; how information flows across layers).

depth L 12 init ||x|| 1.00 sublayer scale 0.20

L1 ||x|| 1.03 +δ=0.24

L2 ||x|| 1.04 +δ=0.12

L3 ||x|| 1.06 +δ=0.23

L4 ||x|| 1.09 +δ=0.24

L5 ||x|| 1.10 +δ=0.13

L6 ||x|| 1.12 +δ=0.22

L7 ||x|| 1.12 +δ=0.12

L8 ||x|| 1.13 +δ=0.15

L9 ||x|| 1.15 +δ=0.22

L10 ||x|| 1.17 +δ=0.18

L11 ||x|| 1.18 +δ=0.15

L12 ||x|| 1.18 +δ=0.12

init ||x|| 1.00

final ||x|| 1.18

growth factor 1.18×

expected √(init² + L · δ²) 1.22

Pre-norm transformer: each block adds δ to the residual stream. With random δ-direction, ||x|| grows like √(init² + L · σ²) — the random-walk in d-dim space. For L = 32 and σ = 0.2, final ||x|| ≈ 1.18× init. A final LayerNorm before the output projection brings everything back to unit scale.

The residual stream's norm grows ~√L with depth in pre-norm transformers. Mitigated by careful sublayer initialisation (1/√L scaling) and a final LayerNorm before the output projection.

The viz shows the residual stream’s norm growing across depth. With pre-norm and random sublayer outputs (each contributing a δ of small norm σ in roughly a random direction), the running norm satisfies ‖r_L‖² ≈ ‖r_0‖² + L · σ² — random walk in d-dim space. For typical L = 32, σ = 0.2: the norm grows by ~1.18×, which is corrected by a final LN before the output projection.

The “residual stream as a communication channel” framing is doing real work in mechanistic interpretability research. Sublayers that learn similar features write similar things into the stream; sublayers that learn complementary features write orthogonal things. The dimensionality of the residual stream (d_model = 4096 in Llama 2 70B) is a hard constraint on how many independent “features” the network can simultaneously carry — and recent work (Bricken 2023, Anthropic) suggests the residual stream encodes vastly more features than d_model via “superposition” (dictionary-style sparse codes). This is the cutting edge of “what is the network actually doing”; Ch.26 will cover this when we get to current research.

— think, then check —

The framing: instead of “input → block1 → block2 → … → block_L → output,” think of the transformer as a single shared vector r ∈ ℝ^d that persists from the embedding to the final layer norm. Every attention head and every FFN layer reads r (via LayerNorm), computes a δ, and adds it back to r. Nothing is replaced; everything is additive.

What this framing makes clear:

All blocks “see” each other. What block 3 writes into the stream is visible to blocks 4, 5, …, L. The transformer is best understood as a system of cooperating writers/readers, not a feedforward chain.
Capacity is the residual stream dimension. Only d_model independent features can co-exist at any depth. This is a hard architectural constraint; everything else is determined.
Decomposability. Since contributions are additive, r_L = r_0 + (sum of all block outputs). You can attribute the final r_L to specific blocks (“this output was caused 23% by attention head 5 of layer 7”) in a way that’s impossible in a feedforward chain.
Sparsity / superposition. Recent interpretability work (Anthropic 2023+) suggests the residual stream uses a “sparse code” — far more than d_model semantic features, each represented by a small linear combination of the d_model basis directions. The residual-stream view is how this decomposition is formulated mathematically.
Gradient flow. Pre-norm ensures the residual stream has identity gradient paths back through every block. That’s why deep networks train.

What it doesn’t make clear: the framing emphasises additive structure but downplays the fact that LN normalises before each read, which is a non-additive operation. And it treats blocks as independent contributors, which is technically wrong (block N’s contribution depends on what block N-1 wrote, since it reads from the current stream). Both are minor; the framing is mostly a useful lens, not a replacement for the math.

↳ §14.3 residual stream

Closing the loop on Part III

We now have every piece of a transformer block:

PreNorm Transformer Block (Llama-style, exactly what every modern LLM uses): # x: residual stream, shape (batch, seq, d_model) y = x + Attention( RMSNorm(x) ) ← attention sublayer z = y + FFN( RMSNorm(y) ) ← FFN sublayer (GELU/SwiGLU) # repeat L times. # at the end: final = RMSNorm(z) ← final norm before unembed logits = Unembed(final) ← projection back to vocab (tied to embedding) Inside each Attention: Inside each FFN: Q, K, V = X · W_Q, W_K, W_V h = GELU(X · W_in) scores = Q · K^T / √d_k out = h · W_out weights = softmax(scores) out = weights · V (Modern: SwiGLU instead of GELU; Ch.15 covers the exact FFN.) All inside FlashAttention's tile-and-stream. The whole forward pass is: embed → L × (attention block + FFN block) → norm → unembed → softmax ↑ ↓ tokens logits

That’s a complete transformer. Every part is now grounded in math (Ch.4-7), in code (Ch.8-10 backprop and optimisation, Ch.11-14’s specific layers), and in systems (Ch.2-3 SIMD, Ch.12-13 tiling and FlashAttention). The math half of the book is done.

— think, then check —

Four orthogonal mechanisms:

Residual connections (skip paths). Fix vanishing/exploding gradient by replacing the product structure ∂y/∂x_0 = ∏ ∂F_l with an additive structure I + Σ ∂F_l. Without residuals: networks past ~20 layers can’t train at all because of multiplicative gradient collapse. Composes: provides the identity path through which gradients flow regardless of what individual layers do.
LayerNorm/RMSNorm (per-token normalisation). Fix activation magnitude drift — without normalisation, activation norms tend to grow exponentially through layers (each linear layer multiplies by some W; if ‖W‖ > 1, activations explode). LN keeps each block’s input in a bounded range. Without LN: residual connections alone allow gradient to flow but the activations themselves still drift, eventually saturating the softmax in attention or pushing the FFN into dead-ReLU territory. Composes: works with residuals by being placed BEFORE the sublayer (pre-norm), preserving the identity path.
Adam/AdamW (per-parameter adaptive learning rate). Fix per-parameter scale heterogeneity. Different parameters in a transformer have very different “natural scales” — attention W_O wants smaller updates than embedding tables. Adam estimates per-parameter variance and rescales gradients individually. Without Adam (with vanilla SGD): different parameter groups train at very different effective rates; some saturate while others underfit. Composes: applies after the residual + LN forward pass to use the resulting gradients.
Learning rate warmup + cosine decay. Fix the transient dynamics at the start of training. Without warmup, the initial gradients (when LN’s running statistics are wrong, when Adam’s variance estimates are uninitialised, when residual streams have no learned structure) can be huge and destabilise training. Warmup ramps lr from 0 to peak over the first few thousand steps; cosine decay anneals it down over training. Without it: training diverges in the first 1K steps. Composes: scales Adam’s update so it’s safe in the warmup phase and refined in the decay phase.

How they compose:

(1) residuals give the gradient a path back through all layers. (2) LN keeps activations bounded so the sublayers operate in a learnable regime. (3) Adam scales the per-parameter updates to whatever scale that parameter naturally wants. (4) Warmup ensures the first few hundred steps are gentle enough to let the running statistics in (2) and (3) settle.

Remove ANY one of these and large transformer training breaks: vanishing gradients (no residuals), exploding activations (no LN), divergent or underfit parameters (no Adam), or first-step divergence (no warmup). The four together make 80-layer 70B-parameter training routine.

This is why modern LLM training “just works” with these four ingredients across architecture variants — they each address a fundamental failure mode that any deep network encounters.

↳ §14.1 + §14.2 + §14.3 + Ch.8 Adam

END OF CH.14 — Normalization & residuals.
§1 (LayerNorm: per-token whitening, gradient analysis) · §2 (RMSNorm: drop the mean, pre-norm placement) · §3 (residual stream: gradient flow, additive contributions, why deep nets train).

END OF PART III — The neural network is fully assembled.
Every piece — embeddings, softmax, attention (with FlashAttention), normalization, residuals — is in place. Math + kernels + viz, end-to-end. From here we shift gears: Part IV–VI is about what makes an LLM specifically (architecture choices, pretraining, alignment, fine-tuning) and the systems engineering that runs it at scale (hardware, runtimes, inference, training).

Coming next: Ch.15 — The GPT architecture, end-to-end. Putting all of Part III together into a real, identifiable LLM forward pass.