THE GPT ARCHITECTURE, END TO END
Section 15.1
01

The decoder-only stack — Llama 2 7B, end to end

Ch.11-14 assembled every piece of a transformer block: embeddings, attention with FlashAttention, normalisation, residual stream. Time to put them together. This section walks the forward pass of a real decoder-only LLM — Llama 2 7B, the canonical reference of the post-Chinchilla era — token by token, layer by layer. We’ll trace shapes, count parameters, and end with logits ready for sampling. Everything is shape-checked against the public weights.

Llama 2 7B at a glance

Llama 2 7B configuration (from the model card): vocab_size V = 32000 (BPE tokenizer trained on English-heavy corpus) d_model d = 4096 (residual stream dimension) n_layers L = 32 (transformer blocks stacked) n_heads H = 32 (attention heads per block) n_kv_heads G = 32 (GQA group count — actually = H for 7B; only the 70B uses GQA) d_head = d / H = 128 d_ffn F = 11008 (hidden dim of FFN; ~2.7 · d for SwiGLU) context_len = 4096 (max sequence length) norm = RMSNorm positional = RoPE (base = 10000) activation = SwiGLU Total parameters: 6.7B (just under the 7B nominal name)

SwiGLU is the FFN activation Llama uses instead of GELU. It’s gated — one branch goes through Swish, another branch is a pure linear projection, the two are multiplied elementwise. Shazeer 2020 “GLU Variants Improve Transformer” showed SwiGLU is the best of several gated variants; every modern model uses it.

The full forward pass

For a single sequence of tokens t_1, t_2, …, t_N, the model produces logits over the vocabulary for the next token at every position. Here’s the complete shape-traced forward pass:

# Input: token IDs of shape (N,) — N is sequence length, up to 4096 # Step 0: embedding X = E[t_1..N] # E is the (V × d) embedding table # X has shape (N, d) = (N, 4096) # Step 1-L: 32 transformer blocks (pre-norm) for layer ℓ = 1..32: # attention sublayer X_norm = RMSNorm(X) # per-token, shape (N, d) Q, K, V = X_norm · W_Q, W_K, W_V # shape each (N, d) Q, K = apply_rope(Q, K) # RoPE positional encoding (Ch.11 §3) attn_out = FlashAttention(Q, K, V) # tile + online softmax + running O attn_out = attn_out · W_O # output projection, shape (N, d) X = X + attn_out # residual add (pre-norm: no LN after) # FFN sublayer (SwiGLU) X_norm = RMSNorm(X) # shape (N, d) gate = Swish(X_norm · W_gate) # shape (N, F) — F = 11008 up = X_norm · W_up # shape (N, F) ffn_out = (gate ⊙ up) · W_down # elementwise gate; project back, shape (N, d) X = X + ffn_out # residual add # Step L+1: final norm + unembed X_final = RMSNorm(X) # shape (N, d) logits = X_final · E^T # TIED unembed: reuses E from embedding step # shape (N, V) = (N, 32000) # Step L+2: softmax (only at inference, for sampling) probs = softmax(logits[-1]) # only need the last token's distribution # for the NEXT token to generate

That’s the whole architecture. Every component on the right-hand side of every line is in Ch.11-14. The transformer is exactly this — an embedding lookup, 32 alternating attention-and-FFN blocks each with pre-RMSNorm and a residual addition, a final norm, and a tied unembedding.

Tied embeddings — the (V × d) embedding table E is reused (transposed) for the output projection. The whole vocab-projection layer becomes parameter-free; logits = X · Eᵀ is just a matmul against the existing embedding.

— think, then check —

Block input x. Block output z (passed to next block).

  1. x_norm = RMSNorm(x) — pre-norm before attention
  2. y = x + Attention(x_norm) — attention sublayer with residual add
  3. y_norm = RMSNorm(y) — pre-norm before FFN
  4. z = y + FFN(y_norm) — FFN sublayer with residual add

Two RMSNorms (one before each sublayer); two residual additions (one after each); no LayerNorm at the end of the block — the next block’s pre-norm handles it. The Attention sublayer is multi-head with RoPE applied to Q, K, run through FlashAttention. The FFN sublayer is SwiGLU.

Shape preserved throughout: x, x_norm, y, y_norm, z all have shape (N, d_model). The residual stream never changes shape; it just gets additive contributions from each sublayer.

The parameter count

Let’s count Llama 2 7B’s parameters from scratch to verify the math:

Layer-by-layer parameter count for Llama 2 7B (d = 4096, V = 32000, L = 32, F = 11008, H = 32, d_k = 128) Embedding: E: V · d = 32000 · 4096 = 131 M (Tied → also serves as output unembed, no separate cost.) Per attention sublayer: W_Q, W_K, W_V, W_O: 4 · (d · d) = 4 · 16.78 M = 67 M Per FFN sublayer (SwiGLU): W_gate, W_up: 2 · (d · F) = 2 · 45.10 M = 90 M W_down: (F · d) = 45.10 M = 45 M total per FFN: = 135 M Per RMSNorm: γ: d = 4096 (no β in RMSNorm) total per block: 2 norms · 4096 = 8 K Per block total: 67 M + 135 M + 8 K ≈ 202 M × 32 blocks: ≈ 6.46 B Final RMSNorm: d = 4096 ≈ 4 K Embedding (already counted, tied to unembed): = 0 Grand total: 131 M + 6.46 B + 4 K ≈ 6.59 B Official Llama 2 7B parameter count: 6.74 B (we're within rounding — minor weight in RoPE/scale)

The FFN is bigger than attention per block (135 M vs 67 M). For Llama 2 7B, FFN parameters are about 65% of the block; attention is about 35%. This is typical — modern transformers spend most parameters in the FFN, not attention.

— think, then check —

Per block, with d = 4096, F = 11008:

  • Attention: W_Q, W_K, W_V, W_O = 4 · d² = 4 · 16.78M = 67M
  • FFN: W_gate (d·F) + W_up (d·F) + W_down (F·d) = 3 · d·F = 3 · 45.1M = 135M

Per block: 67M + 135M = 202M. Ratio: FFN / total = 135/202 = 67%, attention = 33%.

Why FFN dominates:

Attention has 4 parameter matrices, each d × d. FFN has 3 parameter matrices (SwiGLU), each d × F, with F ≈ 2.7 · d. So FFN’s parameter count is 3 · d · 2.7d = 8.1 · d² vs attention’s 4 · d² — a factor of 2× more.

The structural reason: the FFN dimension F is sized to be the BOTTLENECK width of the model in some sense. Each token’s residual-stream activation is “expanded” into a wider space (F = 2.7 · d) where the FFN does its nonlinear computation, then projected back. This “expand → nonlinearity → contract” pattern is where most of the model’s representational capacity lives. Attention is for moving information BETWEEN tokens; FFN is for computation WITHIN tokens. The “within-token computation” budget is bigger.

This proportion is roughly constant across modern transformers: Mistral 7B has the same 67/33 split; GPT-3 175B has a similar split; only outliers like deep narrow networks (very large L, small d, small F) push the ratio.

What you actually compute at inference (per generated token)

After the model is trained, generating a token from a prompt:

  1. Prefill the prompt: run the forward pass through all N prompt tokens. KV cache fills with (K, V) at every position. Cost: O(N²) for attention (per head, per layer); O(N · d²) for the matmuls.
  2. Decode one token: run the forward pass for ONLY the newest position. Q, K, V are computed for one new token; K and V are appended to the cache; attention is the new Q against ALL cached K. Cost per token: O(N · d) attention + O(d² + d · F) matmuls.

This is why prefill is much more expensive than decode for short generations. Most inference engines (vLLM, TGI, TensorRT-LLM) optimise these two phases differently.

— think, then check —

Structural difference:

GELU FFN: h = GELU(x · W₁); out = h · W₂. Two matrices, ungated.

SwiGLU FFN: gate = Swish(x · W_gate); up = x · W_up; h = gate ⊙ up; out = h · W_down. Three matrices, gated.

The “gated” part is the difference. SwiGLU computes TWO separate projections of x (Swish-activated and linear), multiplies them elementwise, then projects back. This is a “multiplicative” interaction that GELU lacks.

Why multiplicative interactions matter:

In GELU, every output dimension is a weighted sum of GELU-activated inputs. The activation acts as a “soft switch” per dimension. In SwiGLU, the Swish gate ⊙ up multiplication means each output dimension gets a per-input-pair product. This represents conjunctive “feature A AND feature B” relationships in one layer, where GELU would need two layers.

Empirically (Shazeer 2020): SwiGLU at the same parameter count beats GELU by ~0.5 perplexity points on standard LM benchmarks. So you don’t get SwiGLU as a free win — you adjust F (the FFN hidden dim) downward to compensate for the 3 matrices vs 2.

For Llama 2 7B: F = 11008. If they’d used GELU at the same parameter count, they’d have set F ≈ 16384 (since GELU has 2 matrices vs SwiGLU’s 3). The model would have been roughly the same parameter count either way; SwiGLU’s choice was “trade some FFN width for gated structure” and it paid off.

Why a small win is worth the complexity:

At LLM scale, every 0.5 perplexity point of model quality compounds into noticeable downstream task improvement. The same model at 1B vs 1.05B parameters costs ~5% more to train and run; if SwiGLU gives ~0.5 perplexity at parameter parity, that’s like getting 1-2% more model “for free.” Worth it.

This is a recurring pattern in modern architecture: incremental gated/multiplicative variants of the FFN (SwiGLU, GeGLU, ReGLU) each give small but compounding wins; the field has converged on SwiGLU as the sweet spot of complexity vs gain.

Next: §15.2 — Encoder vs decoder vs encoder-decoder. Why “predict next token” turned out to be a universal task, and why every modern frontier model is decoder-only.