THE GPT ARCHITECTURE, END TO END
Section 15.2
02

Encoder vs decoder vs encoder-decoder

2018–2022 was a three-way race for the dominant transformer architecture. BERT — encoder-only, bidirectional attention, trained with masked language modelling — was the giant of 2018-2020 for classification, QA, and feature extraction. T5 — encoder-decoder with a separate encoder and an autoregressive decoder, trained with span corruption — was Google’s bet on a unified text-to-text task. GPT — decoder-only with causal attention, trained on next-token prediction — was the OpenAI bet that everyone else was wrong about. By 2023, the race was over. Every frontier model is decoder-only. This section explains why.

The three architectures

BERT (encoder-only): tokens → embed → bidirectional attention × L → [CLS] vector + per-token outputs Attention mask: full — every position attends to every other position. Training: 15% of input tokens replaced with [MASK]; predict the masked tokens. Use: extract features; fine-tune for classification, NER, QA. Cannot generate text autoregressively (the bidirectional attention sees the future). T5 (encoder-decoder): source → encoder (bidirectional) → encoder states K, V target → decoder (causal self-attn + cross-attn to encoder states) → autoregressive output Training: corrupt spans in source, predict spans in target. Use: any "text in, text out" task can be cast as: source → target. Generates well; but requires running both encoder and decoder at inference. GPT (decoder-only): tokens → embed → causal attention × L → per-position next-token logits Attention mask: lower-triangular — position i only attends to 1..i (NOT future). Training: predict the next token at every position, given prior tokens. Use: generation; classification by sampling; in-context learning; everything. Single forward pass produces N next-token predictions in parallel during training.

Causal mask is the load-bearing detail. Without it, “predict next token” is trivial — the model sees the next token in the input and outputs it. With the lower-triangular mask, each position is forced to predict based only on what came before. This single mask turns one forward pass on a length-N sequence into N independent next-token-prediction tasks, all learned in parallel.

Attention with causal mask, position i sees only positions 1..i: S_raw[i, j] = Q_i · K_jᵀ / √d_k (all i, j) S[i, j] = S_raw[i, j] if j ≤ i = −∞ if j > i ← the mask softmax(S) has zeros for all j > i ← future positions get zero attention attn(i) = Σ_{j ≤ i} weight[i, j] · V_j ← sum only over past positions

The mask is implemented as a constant lower-triangular matrix of zeros (j ≤ i) and −∞ (j > i) added to the score matrix before softmax. Costs nothing to compute; reduces the effective attention pattern by half (only the upper-triangle is wasted compute, which can be skipped in optimised kernels).

— think, then check —

The mask: a lower-triangular matrix. For position i attending to position j, the mask sets S[i,j] = 0 if j ≤ i (past or current position), or −∞ if j > i (future position). The mask is added to the attention scores BEFORE the softmax.

Effect: softmax applied to a row with some −∞ entries produces zeros at those positions. So position i’s attention weight on position j is 0 whenever j > i — position i cannot “see” any future token.

Why needed for next-token prediction:

The training objective is “given tokens 1..i, predict token i+1.” If position i could see token i+1 in its attention, the task is trivial (the answer is in the input). The causal mask FORCES position i to predict from positions 1..i only, making the task non-trivial and forcing the model to learn meaningful representations.

N training signals in one forward pass:

At position 1, the model predicts token 2 from token 1’s embedding.

At position 2, the model predicts token 3 from tokens 1..2.

At position i, the model predicts token i+1 from tokens 1..i.

One forward pass through the decoder produces N hidden states (one per position); each is followed by an unembedding to logits and a cross-entropy loss against the next token. Loss = mean of N per-position losses. The causal mask ensures these N predictions are independent (each only depends on the past), so they can all be computed in parallel within one forward pass.

This is why decoder-only transformers are so training-efficient: a length-N sequence gives N supervised examples for free. BERT only gets ~0.15·N (the 15% masking rate). T5 gets a similar fraction. Decoder-only models have 5-6× higher effective training-signal density per token of compute.

The training objectives, compared

BERT — masked language modelling (MLM): "The [MASK] sat on the mat." → predict "cat" - random 15% of tokens replaced with [MASK] - bidirectional context (can see both sides of [MASK]) - 0.15 supervised tokens per input token = 15% training signal density T5 — span corruption: "The <X> sat on the <Y>." → output: "<X> cat <Y> mat" - random spans masked with sentinel tokens - encoder sees the corrupted source; decoder generates the missing spans - typically 15% corruption rate = similar training signal density to BERT GPT — autoregressive language modelling: "The cat sat on the mat." → predict each next token - every position predicts the next position's token - causal mask prevents looking ahead - 100% supervised tokens per input token (minus the first one) = 6-7× the density

This is the most underappreciated reason decoder-only won: training-signal density. For the same training data and compute, a decoder-only model gets ~6× more supervised signal per token than BERT or T5. The empirical consequence: at any compute budget, decoder-only models reach lower perplexity faster.

— think, then check —

BERT’s training signal:

BERT randomly masks 15% of input tokens. For each masked position, the model produces a prediction and computes a cross-entropy loss. Tokens that aren’t masked produce no loss (their representations are computed, but no supervision).

So per N input tokens, BERT gets ~0.15 · N supervised predictions.

GPT’s training signal:

GPT predicts the next token at EVERY position. The causal mask ensures each position only sees prior tokens, so the prediction at each position is non-trivial. Per N input tokens, GPT gets N − 1 ≈ N supervised predictions (every position except the last has a “next token” target).

The ratio: N / 0.15·N = 6.7×. GPT gets ~6.7× more loss-contributing predictions per token of input data.

Consequence for compute efficiency:

To learn a useful representation, the model needs a certain TOTAL number of supervised predictions. BERT needs 6.7× more input tokens to get the same total signal that GPT extracts from 1× input tokens. Equivalently: for the same data budget, GPT learns 6.7× more.

At Chinchilla-scale (Ch.16 §3), this matters a lot. A 70B-parameter model needs ~1.4T training tokens to be compute-optimal as a decoder-only model. The same 70B as BERT-style would need ~9T tokens of equivalent signal — much more data, much more compute.

This isn’t the only reason decoder-only won (generation, in-context learning, simpler serving all matter too), but it’s the underrated one. Per dollar of training compute, you get a stronger model by giving it dense supervised signal.

Why decoder-only won

Three structural reasons:

  1. Training-signal density (above) — 6× more loss per token = much more efficient training.

  2. Unified prefix — a decoder-only model is structurally just “given a prefix, predict the suffix.” Any task can be cast in this form: classification (“Question: X. Answer:”) → next-token; QA (“Context: … Question: … Answer:”) → next-token; translation (“English: X. French:”) → next-token. In-context learning — the ability to learn a task from a few examples in the prompt — is most natural in a decoder-only architecture.

  3. Simpler serving — one model, one forward pass, one KV cache. T5’s encoder-decoder requires running the encoder once per request then the decoder per token; the KV cache is more complex; the deployment overhead is higher. Decoder-only is operationally simpler at every level.

— think, then check —

The strongest reason: emergence of in-context learning and task unification.

GPT-3 (Brown 2020) demonstrated that a sufficiently large decoder-only model could perform tasks given only examples in its prompt — no fine-tuning, no task-specific architecture. This was unique to autoregressive LM: the model “reads” the prefix as if it were just more training data, and the autoregressive structure naturally conditions on it.

BERT can’t do this: its bidirectional attention assumes a fixed format (with [MASK]s in specific places). T5 can do it weakly but the encoder-decoder split makes the prefix conditioning awkward.

Once in-context learning worked, the case for fine-tuning a separate model per task evaporated. One big decoder-only model + prompt engineering replaced thousands of fine-tuned BERTs.

Supporting reasons:

  1. Training signal density. Decoder-only gets ~6× more loss signal per input token (every position predicts the next). BERT/T5 get ~15% of tokens supervised. Same data budget → 6× more learned per token.
  2. Generation is what users want. Chat and content creation are autoregressive by nature. BERT can’t generate; T5 can but requires more infrastructure.
  3. Simpler serving. One forward pass, one KV cache, no encoder/decoder split. Operationally clean. Cheaper to deploy at scale.
  4. Scaling laws favoured decoder-only. Chinchilla and follow-up scaling-law work found that as you scale, the decoder-only architecture continues improving smoothly; BERT-style scaling plateaus earlier (less to learn from each token).

What BERT/T5 had that didn’t matter:

  • Bidirectional attention (BERT) — better for fixed-format classification, but classification with a decoder-only LLM via prompting matches BERT performance on standard benchmarks at sufficient scale.
  • Cleaner classification head (BERT) — at small scale, BERT was better. At large scale, the gap closes.
  • Unified text-to-text framing (T5) — turned out to be redundant; decoder-only does this naturally.
  • Encoder-side bidirectional context (T5) — useful for source comprehension in translation, but in-context learning provides this through prefix conditioning at scale.

The lesson: architecture differences that mattered at small scale (where data is limited) stopped mattering at large scale (where compute and data are abundant). The simpler architecture (decoder-only) won because it was simpler, more efficient, and reached the same capabilities at scale.

Next: §15.3 — Sampling. The model produces logits; the inference engine has to turn them into tokens. Greedy, temperature, top-k, top-p (nucleus), beam search, and speculative decoding (already covered in Ch.S).