Reasoning models

The 2024-2025 shift to test-time compute scaling. o1, R1, Claude extended thinking: train a base model, SFT on long chain-of-thought, then RL with verifiable rewards (GRPO + rule-based or process reward models). Inference cost goes 10-100× to buy quality. The biggest architectural shift of the post-Chinchilla era.

§1 The shift — sample more vs think more §2 Training with verifiable rewards — the DeepSeek R1 recipe §3 What works, what doesn't, and the open seams

§1 The shift — sample more vs think more
Pre-2024 LLM scaling was almost entirely about TRAINING compute — bigger model, more tokens, fixed cost per inference. o1 (Sep 2024) and DeepSeek R1 (Jan 2025) cracked open a second axis: TEST-TIME compute. Spend 10-100× more inference per question — long chain-of-thought, best-of-N, verifier-checked attempts — and quality goes up reliably for tasks with checkable answers. This is the biggest economic shift in LLM deployment since Chinchilla.
§2 Training with verifiable rewards — the DeepSeek R1 recipe
Reasoning models are trained in roughly four stages: pretrained base, SFT on long chain-of-thought data, RL with verifiable rewards (GRPO), and final distillation. DeepSeek R1 made the recipe public in January 2025 — including the surprising R1-Zero result, where pure RL on a base model (no SFT) bootstrapped reasoning capability on its own. This section walks the full pipeline, the verifier-design problem, and why GRPO turned out to be the right RL algorithm for this regime.
§3 What works, what doesn't, and the open seams
Reasoning models are SOTA on math, code, formal logic in mid-2026 — but show clear failure modes in subjective tasks, long-horizon planning, and tasks where the verifier signal is noisy or missing. This section maps where reasoning models genuinely advance the frontier, where they regress, and the open research seams worth watching. Closes the chapter with the inference-economics view.

← ALL CHAPTERS