Pretraining

§1 Pretraining — next-token prediction at trillion-token scale
Pretraining is conceptually a single line: minimise cross-entropy of next-token prediction over a trillion-token corpus. In practice it is the most expensive software-engineering project in human history — a 70B model takes ~10²⁵ FLOPs, ~1.4T tokens, ~30 days on ~2000 H100s, ~$30M of compute. This section walks the loop: tokenize a batch, forward in bf16, accumulate gradients across micro-batches, all-reduce, AdamW step in fp32 master copy. The architecture is fixed by Ch.11-15. The data and the budget are what matter.
§2 The data pipeline — Common Crawl to training tokens
Pretraining data is the biggest knob in LLM quality. The pipeline starts with Common Crawl (~10 PB of raw HTML / 100B web pages) and produces ~1-15T high-quality tokens through five stages: URL filtering, language ID, content extraction, quality classification, and deduplication. C4 → RefinedWeb → FineWeb is the evolution of this pipeline; each successive corpus is smaller but trains a better model than its predecessor. The lesson: more data isn‘t the answer — better data is.
§3 Chinchilla scaling laws — the compute-optimal frontier
For a fixed compute budget C, the model that minimises test loss has a specific (N, D) trade-off. Kaplan 2020 said large N, low D. Chinchilla (Hoffmann 2022) said smaller N, much larger D — roughly D ≈ 20 · N. GPT-3 was radically under-trained by Chinchilla standards. Llama 2/3 were intentionally OVER-trained (more D than compute-optimal) because inference cost compounds over a model‘s lifetime, while training compute is paid once.

← ALL CHAPTERS