The 2024-2025 shift to test-time compute scaling. o1, R1, Claude extended thinking: train a base model, SFT on long chain-of-thought, then RL with verifiable rewards (GRPO + rule-based or process reward models). Inference cost goes 10-100× to buy quality. The biggest architectural shift of the post-Chinchilla era.
§1 The shift — sample more vs think more
§2 Training with verifiable rewards — the DeepSeek R1 recipe
§3 What works, what doesn't, and the open seams