What works, what doesn’t, and the open seams
Eighteen months in, the picture is clearer. Reasoning models are SOTA on math, competition coding, formal theorem proving — domains where verifiers exist and reward signals are clean. They are NOT clearly better on subjective tasks (writing, judgment, social reasoning), and they introduce new failure modes that didn’t exist with standard LLMs (over-thinking simple questions, “reasoning hallucination”, expensive failures on under-specified problems). This section maps the real terrain — what’s genuinely better, what’s a wash, what got worse — and the open research seams worth tracking.
Where reasoning models genuinely win
These are the headline numbers. They are real. Math, code, and formal reasoning have undergone a step-change in the past eighteen months that has no precedent in the field’s prior decade.
Where reasoning models are a wash (or worse)
The reasoning vs non-reasoning choice isn’t “always one or always the other” — it’s task-dependent. Production deployment often routes queries: simple/subjective to standard, hard/verifiable to reasoning. ChatGPT’s interface defaulting to o4-mini and offering “Think” as an opt-in reflects this.
New failure modes
Over-thinking: a reasoning model can spend 5000 tokens of thought on a question whose answer is “Paris.” The RL trained it to “always think hard”; this generalises poorly to easy questions. Modern reasoning models (o3, Claude 3.7) include budget-controllers that route simple questions away from the reasoning path.
Reasoning hallucination: the model produces a confident, well-formatted CoT with a wrong premise. The chain LOOKS careful — bullet points, sub-conclusions, even a sanity check. But step 3 made an arithmetic error and step 6 cited a non-existent paper, and the final answer is wrong. This is MORE dangerous than standard hallucination because the chain makes the model’s confidence look justified.
Likely causes:
1. Over-thinking on simple questions. “What’s your return policy?” doesn’t need 8 seconds of thinking; users expect ~1 second response. The reasoning model’s CoT delay degrades perceived quality regardless of answer quality.
2. Reasoning hallucination in policy questions. The model “carefully reasons” through policy decisions that should just look up the company’s documented policy. CoT structure makes wrong answers LOOK more authoritative.
3. Format mismatch. Reasoning models output CoT visible to users by default. Customer support expects direct answers, not a chain showing “let me think about this…“
4. Cost. 10-100× per-query cost on customer support volume can be prohibitive. The team may have to throttle the model to keep budget under control, which degrades availability.
5. Tone mismatch. Reasoning models can sound cold and analytical. Customer support benefits from warmth and empathy — qualities a math-RL’d model isn’t trained on.
6. No verifier signal for the actual task. Customer support has no clean verifier. The model’s RL training doesn’t help; you’re just paying for slower inference.
What they should do differently:
- Use a standard non-reasoning model as the primary handler.
- Reserve the reasoning model for cases that genuinely benefit: math/billing calculations, technical troubleshooting with clear right answers.
- Route based on query type. Easy questions → standard. Hard technical → reasoning.
- Hide the reasoning chain from the customer; just show the answer.
- Consider fine-tuning a smaller model on customer-support-specific data instead — likely better fit than a generic reasoning model.
The general lesson: reasoning models are a tool for hard verifiable problems. Deploying them for everything is over-engineering and counterproductive.
The inference economics of reasoning models
Tax-form preparation profile:
- Task is computational (numbers, rules, look-up).
- Errors have HIGH cost — wrong tax form → user audited → product reputation destroyed.
- Verifier exists partially: arithmetic can be checked; rule application can be cross-referenced with tax code.
- Per-form margin is probably $5-50 depending on tier.
- User-facing latency: ~30 seconds is acceptable for a tax preparation tool.
Reasoning model fit:
Strong fit. The combination of (a) verifiable computations, (b) catastrophic cost of errors, (c) tolerable latency, (d) high margin all favor reasoning.
Cost-per-correct math:
- Standard model (gpt-4o): $0.10/form × ~80% accuracy = $0.13 per correct form.
- Reasoning model (o3): $2.00/form × ~98% accuracy = $2.04 per correct form.
Raw cost-per-correct favors standard. But:
- Errors in tax filing CAUSE $1000s in penalties (user-perceived cost).
- Each error damages brand trust and triggers customer support cost.
- The marginal $2 to halve errors is overwhelmingly worth it for a product where errors are catastrophic.
Recommended architecture:
- Standard model for routine extraction (most fields are clear).
- Reasoning model for “should this deduction apply?” judgment calls and arithmetic verification.
- Reasoning model for the FINAL CHECK: re-derive the totals and flag any inconsistency.
- Human review for any flagged inconsistency or edge case.
This hybrid uses the cheaper standard model for the 80% of work that’s mechanical, and the expensive reasoning model where it matters most.
Cost: ~$0.50/form. Effective accuracy: ~99%+. Catastrophic-error rate: very low. Margin: viable.
Open research seams
Most impactful AND accessible to a small team:
(3) Inference-time test-time compute scaling beyond pass@N.
This is doable on rented compute. Take an open reasoning model (R1, Qwen 3.5 thinking), implement different inference strategies (tree search, self-consistency variants, learned aggregators over multiple CoTs), measure scaling curves. The compute requirement is modest — you’re not training, just inferring.
The impact is high because better TTC scaling translates directly to lower cost per correct answer in production. A 2× efficiency improvement on TTC is worth tens of millions to inference providers.
Related: (6) cost reduction for reasoning inference. Same skill set; similarly accessible.
Most impactful but frontier-scale only:
(1) Verifiers for subjective domains.
This requires both massive labeling investment (PRMs need expert human labels for thousands of complex tasks) and large-scale RL training to actually evaluate whether the verifier improves model behavior. Frontier labs are the only ones with both the labeling budget and the compute.
If solved, this UNLOCKS reasoning model improvements for subjective tasks (writing, judgment, social reasoning) — which is most of what humans value in LLMs. This is the most impactful seam by far, but it’s not accessible to a small team.
Adjacent: (4) reasoning model alignment. Frontier labs are best positioned here too. Requires understanding the model deeply, having broad red-teaming infrastructure, and the ability to iterate on alignment techniques at scale.
For a small team, focus on:
- TTC scaling research (problem 3).
- Inference cost reduction (problem 6).
- Open evaluation benchmarks for reasoning (related to problem 2).
- Reasoning model applications in specific domains (medicine, law) where domain knowledge is the bottleneck.
- Tool integration for reasoning models — making them better at incorporating tool outputs back into their reasoning.
Avoid trying to compete with frontier labs on:
- Training larger reasoning models from scratch.
- Building reward models at scale.
- Alignment for reasoning models broadly.
These need frontier infrastructure. Better to focus where small-team productivity is competitive.
The bigger trajectory
This is the most significant economic shift in LLM deployment since the original transformer release. The book’s pretraining chapter (Ch.16) was written from the Chinchilla perspective: compute-optimal training, fixed inference cost. That perspective remains correct for standard models. For reasoning models, the economics flip: per-query cost varies 100×, training matters less, inference matters more.
For a senior engineer transitioning into the field in 2026: this is the chapter to internalise. Reasoning models are the new substrate of premium LLM applications. Understanding HOW they’re trained (verifiable rewards), WHY they work (test-time compute scaling), and WHERE they fail (subjective tasks, over-thinking) is the foundation for shipping anything more sophisticated than chat in 2026 and beyond.
END OF CH.20 — Reasoning models.
§1 (test-time compute scaling: log-linear in B, verifier vs no-verifier asymmetry) ·
§2 (the DeepSeek R1 recipe: SFT → GRPO with verifiable rewards → distillation; R1-Zero emergence) ·
§3 (what works, what doesn’t, the open seams).
Next: Ch.21 — Beyond transformers. SSMs, Mamba, hybrid architectures, honest assessment of where alternative architectures stand. The book’s other “what comes next” chapter.