Decoder-only stack, the residual stream as a highway, logits → sampling. Encoders vs decoders vs encoder-decoder — and why decoder-only won.
Walk a single token from input ID to output logits through a real decoder-only transformer. Embedding (32K × 4096) → 32 × (RMSNorm + GQA-attention + RMSNorm + SwiGLU-FFN) → final RMSNorm → tied unembed (4096 × 32K) → softmax. Real shapes from Llama 2 7B (7 billion parameters, 4096-dim, 32 layers, 32 heads). The forward pass is the assembly of everything from Part III.
Three transformer architectures fought it out from 2018-2022: BERT (encoder-only, bidirectional, masked-LM), T5 (encoder-decoder, span corruption), GPT (decoder-only, causal next-token). Decoder-only won. Not because it was theoretically better — because next-token prediction turned out to be a universal training task, the causal mask let one model serve any prefix length, and the autoregressive structure aligned perfectly with how text is generated.
The model produces a probability distribution over the vocabulary at every step. How do you pick a token? Greedy (argmax) is deterministic but produces low-diversity output. Temperature scales logits before softmax. Top-k truncates to the k most likely. Top-p (nucleus) takes the smallest set with cumulative probability ≥ p. Beam search keeps multiple candidates. Real C kernel runs all five on a realistic logit distribution and compares the empirical token distributions.
← ALL CHAPTERS