The decoder-only stack — Llama 2 7B, end to end

Section 15.1

The decoder-only stack — Llama 2 7B, end to end

Ch.11-14 assembled every piece of a transformer block: embeddings, attention with FlashAttention, normalisation, residual stream. Time to put them together. This section walks the forward pass of a real decoder-only LLM — Llama 2 7B, the canonical reference of the post-Chinchilla era — token by token, layer by layer. We’ll trace shapes, count parameters, and end with logits ready for sampling. Everything is shape-checked against the public weights.

Llama 2 7B at a glance

Llama 2 7B configuration (from the model card): vocab_size V = 32000 (BPE tokenizer trained on English-heavy corpus) d_model d = 4096 (residual stream dimension) n_layers L = 32 (transformer blocks stacked) n_heads H = 32 (attention heads per block) n_kv_heads G = 32 (GQA group count — actually = H for 7B; only the 70B uses GQA) d_head = d / H = 128 d_ffn F = 11008 (hidden dim of FFN; ~2.7 · d for SwiGLU) context_len = 4096 (max sequence length) norm = RMSNorm positional = RoPE (base = 10000) activation = SwiGLU Total parameters: 6.7B (just under the 7B nominal name)

SwiGLU is the FFN activation Llama uses instead of GELU. It’s gated — one branch goes through Swish, another branch is a pure linear projection, the two are multiplied elementwise. Shazeer 2020 “GLU Variants Improve Transformer” showed SwiGLU is the best of several gated variants; every modern model uses it.

The full forward pass

For a single sequence of tokens t_1, t_2, …, t_N, the model produces logits over the vocabulary for the next token at every position. Here’s the complete shape-traced forward pass:

# Input: token IDs of shape (N,) — N is sequence length, up to 4096 # Step 0: embedding X = E[t_1..N] # E is the (V × d) embedding table # X has shape (N, d) = (N, 4096) # Step 1-L: 32 transformer blocks (pre-norm) for layer ℓ = 1..32: # attention sublayer X_norm = RMSNorm(X) # per-token, shape (N, d) Q, K, V = X_norm · W_Q, W_K, W_V # shape each (N, d) Q, K = apply_rope(Q, K) # RoPE positional encoding (Ch.11 §3) attn_out = FlashAttention(Q, K, V) # tile + online softmax + running O attn_out = attn_out · W_O # output projection, shape (N, d) X = X + attn_out # residual add (pre-norm: no LN after) # FFN sublayer (SwiGLU) X_norm = RMSNorm(X) # shape (N, d) gate = Swish(X_norm · W_gate) # shape (N, F) — F = 11008 up = X_norm · W_up # shape (N, F) ffn_out = (gate ⊙ up) · W_down # elementwise gate; project back, shape (N, d) X = X + ffn_out # residual add # Step L+1: final norm + unembed X_final = RMSNorm(X) # shape (N, d) logits = X_final · E^T # TIED unembed: reuses E from embedding step # shape (N, V) = (N, 32000) # Step L+2: softmax (only at inference, for sampling) probs = softmax(logits[-1]) # only need the last token's distribution # for the NEXT token to generate

That’s the whole architecture. Every component on the right-hand side of every line is in Ch.11-14. The transformer is exactly this — an embedding lookup, 32 alternating attention-and-FFN blocks each with pre-RMSNorm and a residual addition, a final norm, and a tied unembedding.

Tied embeddings parameter sharing Reusing the input embedding table E ∈ ℝ^{V × d} as the output projection matrix (transposed): logits = X · E^T. Saves V · d parameters (~131M for Llama 2's V = 32K, d = 4096). Press & Wolf 2017 showed tied embeddings match or beat untied on perplexity. The intuition: an embedding maps token → vector; the inverse (vector → logits per token) should naturally be the transpose. Used by GPT-2, Llama, most modern LLMs. — the (V × d) embedding table E is reused (transposed) for the output projection. The whole vocab-projection layer becomes parameter-free; logits = X · Eᵀ is just a matmul against the existing embedding.

— think, then check —

Block input x. Block output z (passed to next block).

x_norm = RMSNorm(x) — pre-norm before attention
y = x + Attention(x_norm) — attention sublayer with residual add
y_norm = RMSNorm(y) — pre-norm before FFN
z = y + FFN(y_norm) — FFN sublayer with residual add

Two RMSNorms (one before each sublayer); two residual additions (one after each); no LayerNorm at the end of the block — the next block’s pre-norm handles it. The Attention sublayer is multi-head with RoPE applied to Q, K, run through FlashAttention. The FFN sublayer is SwiGLU.

Shape preserved throughout: x, x_norm, y, y_norm, z all have shape (N, d_model). The residual stream never changes shape; it just gets additive contributions from each sublayer.

↳ §15.1 forward pass

The parameter count

Let’s count Llama 2 7B’s parameters from scratch to verify the math:

Layer-by-layer parameter count for Llama 2 7B (d = 4096, V = 32000, L = 32, F = 11008, H = 32, d_k = 128) Embedding: E: V · d = 32000 · 4096 = 131 M (Tied → also serves as output unembed, no separate cost.) Per attention sublayer: W_Q, W_K, W_V, W_O: 4 · (d · d) = 4 · 16.78 M = 67 M Per FFN sublayer (SwiGLU): W_gate, W_up: 2 · (d · F) = 2 · 45.10 M = 90 M W_down: (F · d) = 45.10 M = 45 M total per FFN: = 135 M Per RMSNorm: γ: d = 4096 (no β in RMSNorm) total per block: 2 norms · 4096 = 8 K Per block total: 67 M + 135 M + 8 K ≈ 202 M × 32 blocks: ≈ 6.46 B Final RMSNorm: d = 4096 ≈ 4 K Embedding (already counted, tied to unembed): = 0 Grand total: 131 M + 6.46 B + 4 K ≈ 6.59 B Official Llama 2 7B parameter count: 6.74 B (we're within rounding — minor weight in RoPE/scale)

The FFN is bigger than attention per block (135 M vs 67 M). For Llama 2 7B, FFN parameters are about 65% of the block; attention is about 35%. This is typical — modern transformers spend most parameters in the FFN, not attention.

— think, then check —

Per block, with d = 4096, F = 11008:

Attention: W_Q, W_K, W_V, W_O = 4 · d² = 4 · 16.78M = 67M
FFN: W_gate (d·F) + W_up (d·F) + W_down (F·d) = 3 · d·F = 3 · 45.1M = 135M

Per block: 67M + 135M = 202M. Ratio: FFN / total = 135/202 = 67%, attention = 33%.

Why FFN dominates:

Attention has 4 parameter matrices, each d × d. FFN has 3 parameter matrices (SwiGLU), each d × F, with F ≈ 2.7 · d. So FFN’s parameter count is 3 · d · 2.7d = 8.1 · d² vs attention’s 4 · d² — a factor of 2× more.

The structural reason: the FFN dimension F is sized to be the BOTTLENECK width of the model in some sense. Each token’s residual-stream activation is “expanded” into a wider space (F = 2.7 · d) where the FFN does its nonlinear computation, then projected back. This “expand → nonlinearity → contract” pattern is where most of the model’s representational capacity lives. Attention is for moving information BETWEEN tokens; FFN is for computation WITHIN tokens. The “within-token computation” budget is bigger.

This proportion is roughly constant across modern transformers: Mistral 7B has the same 67/33 split; GPT-3 175B has a similar split; only outliers like deep narrow networks (very large L, small d, small F) push the ratio.

↳ §15.1 parameter count

What you actually compute at inference (per generated token)

After the model is trained, generating a token from a prompt:

Prefill the prompt: run the forward pass through all N prompt tokens. KV cache fills with (K, V) at every position. Cost: O(N²) for attention (per head, per layer); O(N · d²) for the matmuls.
Decode one token: run the forward pass for ONLY the newest position. Q, K, V are computed for one new token; K and V are appended to the cache; attention is the new Q against ALL cached K. Cost per token: O(N · d) attention + O(d² + d · F) matmuls.

This is why prefill is much more expensive than decode for short generations. Most inference engines (vLLM, TGI, TensorRT-LLM) optimise these two phases differently.

— think, then check —

Structural difference:

GELU FFN: h = GELU(x · W₁); out = h · W₂. Two matrices, ungated.

SwiGLU FFN: gate = Swish(x · W_gate); up = x · W_up; h = gate ⊙ up; out = h · W_down. Three matrices, gated.

The “gated” part is the difference. SwiGLU computes TWO separate projections of x (Swish-activated and linear), multiplies them elementwise, then projects back. This is a “multiplicative” interaction that GELU lacks.

Why multiplicative interactions matter:

In GELU, every output dimension is a weighted sum of GELU-activated inputs. The activation acts as a “soft switch” per dimension. In SwiGLU, the Swish gate ⊙ up multiplication means each output dimension gets a per-input-pair product. This represents conjunctive “feature A AND feature B” relationships in one layer, where GELU would need two layers.

Empirically (Shazeer 2020): SwiGLU at the same parameter count beats GELU by ~0.5 perplexity points on standard LM benchmarks. So you don’t get SwiGLU as a free win — you adjust F (the FFN hidden dim) downward to compensate for the 3 matrices vs 2.

For Llama 2 7B: F = 11008. If they’d used GELU at the same parameter count, they’d have set F ≈ 16384 (since GELU has 2 matrices vs SwiGLU’s 3). The model would have been roughly the same parameter count either way; SwiGLU’s choice was “trade some FFN width for gated structure” and it paid off.

Why a small win is worth the complexity:

At LLM scale, every 0.5 perplexity point of model quality compounds into noticeable downstream task improvement. The same model at 1B vs 1.05B parameters costs ~5% more to train and run; if SwiGLU gives ~0.5 perplexity at parameter parity, that’s like getting 1-2% more model “for free.” Worth it.

This is a recurring pattern in modern architecture: incremental gated/multiplicative variants of the FFN (SwiGLU, GeGLU, ReGLU) each give small but compounding wins; the field has converged on SwiGLU as the sweet spot of complexity vs gain.

↳ §15.1 + activation history

Next: §15.2 — Encoder vs decoder vs encoder-decoder. Why “predict next token” turned out to be a universal task, and why every modern frontier model is decoder-only.