The GPT architecture, end to end

§1 The decoder-only stack — Llama 2 7B, end to end
Walk a single token from input ID to output logits through a real decoder-only transformer. Embedding (32K × 4096) → 32 × (RMSNorm + GQA-attention + RMSNorm + SwiGLU-FFN) → final RMSNorm → tied unembed (4096 × 32K) → softmax. Real shapes from Llama 2 7B (7 billion parameters, 4096-dim, 32 layers, 32 heads). The forward pass is the assembly of everything from Part III.
§2 Encoder vs decoder vs encoder-decoder
Three transformer architectures fought it out from 2018-2022: BERT (encoder-only, bidirectional, masked-LM), T5 (encoder-decoder, span corruption), GPT (decoder-only, causal next-token). Decoder-only won. Not because it was theoretically better — because next-token prediction turned out to be a universal training task, the causal mask let one model serve any prefix length, and the autoregressive structure aligned perfectly with how text is generated.
§3 Sampling — from logits to tokens
The model produces a probability distribution over the vocabulary at every step. How do you pick a token? Greedy (argmax) is deterministic but produces low-diversity output. Temperature scales logits before softmax. Top-k truncates to the k most likely. Top-p (nucleus) takes the smallest set with cumulative probability ≥ p. Beam search keeps multiple candidates. Real C kernel runs all five on a realistic logit distribution and compares the empirical token distributions.