IV — What Makes an LLM → Chapter 20
FROM SYSTEMS TO FRONTIER ML

Reasoning models

The 2024-2025 shift to test-time compute scaling. o1, R1, Claude extended thinking: train a base model, SFT on long chain-of-thought, then RL with verifiable rewards (GRPO + rule-based or process reward models). Inference cost goes 10-100× to buy quality. The biggest architectural shift of the post-Chinchilla era.

§1 The shift — sample more vs think more §2 Training with verifiable rewards — the DeepSeek R1 recipe §3 What works, what doesn't, and the open seams

← ALL CHAPTERS