Training with verifiable rewards — the DeepSeek R1 recipe
OpenAI’s o1 (Sep 2024) shipped without explaining how it was trained. DeepSeek R1 (Jan 2025) published the recipe end-to-end. The recipe — pretrained base → SFT on long chain-of-thought → RL with verifiable rewards via GRPO → distillation back into smaller models — is now the open template every lab is iterating on. The two unusual choices: rule-based rewards instead of learned reward models, and GRPO (Ch.18 §3) instead of PPO. Both turn out to be load-bearing. This section walks the pipeline with the actual training numbers, then unpacks the R1-Zero result — where pure RL on a base model (no SFT cold-start at all) bootstrapped reasoning on its own.
The full R1 pipeline
Cold-start SFT isn’t strictly necessary — DeepSeek’s R1-Zero variant skipped it. But it makes the RL stage faster and produces a cleaner final format. Every commercial reasoning model uses cold-start SFT in the recipe.
Verifiable rewards — rule-based, not learned
The Ch.18 alignment recipe used a LEARNED reward model: train a Bradley-Terry reward on human preferences, optimize the policy to maximize predicted reward. That works for chat (“which response do humans prefer”). For reasoning it doesn’t, because reasoning has a different failure mode: the model can produce a confidently-wrong CoT that “sounds right” — and a learned reward model would reward it.
The R1 fix: use a HARD verifier as the reward.
Verifiable rewards are simple, fast, and tamper-proof. The model can’t game them by writing convincing-sounding nonsense. The cost: you’re limited to problems with objective answers. That’s why R1 is dramatically better at math than at chat — its RL signal only existed for math.
Stage 0 — Pretrained base:
Input: DeepSeek-V3 base (or any frontier base model).
Output: Same model, ready for SFT.
Skip → no shortcut here; you need a strong base. The reasoning capability ceiling is set by the base’s pretraining.
Stage 1 — Cold-start SFT:
Input: thousands of (problem → long CoT → answer) examples.
Output: model that produces CoT in a consistent format before answering.
Skip → RL still works but converges slower and produces “R1-Zero style” output (inconsistent format, sometimes mixes languages, uses unusual symbols). Useful for research; not production-ready.
Stage 2 — Reasoning RL with verifiable rewards (GRPO):
Input: cold-started model + problems with ground-truth answers.
Output: model with dramatically improved per-attempt accuracy on math/code.
Skip → no reasoning capability improvement. The cold-start gave you the format; this gave you the reasoning quality.
Stage 3 — Rejection sampling + supervised refinement:
Input: RL’d model + broader prompts (chat, writing, general questions).
Output: model that retains reasoning capability AND handles non-reasoning tasks gracefully.
Skip → model is great at math but bad at chat. Heavy RL on math damages other capabilities. This stage REPAIRS the damage.
Stage 4 — Distillation:
Input: large R1 + smaller base models (7B, 14B, 32B).
Output: smaller models with R1-like behavior but proportionally lower capability ceiling.
Skip → only the large model gets the benefit. For commercial inference economics (serving 7B is much cheaper than serving 671B), distillation is essential.
The cascade: Stage 0 provides the capability ceiling, Stages 1-2 unlock reasoning, Stage 3 recovers generality, Stage 4 makes it economical to deploy.
Why GRPO
Recall from Ch.18 §3 that GRPO eliminates PPO’s value head by using group-relative advantages from K sampled outputs per prompt. For reasoning specifically, this is the right algorithm:
DeepSeek’s R1 paper explicitly notes that PPO with a learned value head was tried first and failed: value learning was too unstable on long CoT trajectories. Switching to GRPO made training viable. Every subsequent open reasoning model has copied this choice (Qwen 3.5 thinking, Mistral Magistral, etc.).
Outcome rewards (what R1 used):
Score only the final answer. Reward = 1 if correct, 0 if wrong.
Pros:
- Cheap (one check per trajectory).
- Unbiased (no learned reward model that can be wrong).
- Tamper-proof (model can’t game the verifier).
Cons:
- Credit assignment is hard. If the trajectory was 3000 tokens long and the final answer was wrong, the policy update penalises ALL 3000 tokens equally. But maybe the first 2900 were correct reasoning and just the last 100 tokens were a transcription error.
- Sparse signal. One bit (correct/wrong) per ~5000-token trajectory means a lot of compute for very little gradient signal.
- No partial credit. A trajectory that’s “almost right but goes wrong on step 3” gets the same reward as one that’s nonsense throughout.
Process rewards (PRMs — Lightman 2023):
Train a separate reward model that scores EACH STEP of the CoT. Reward at the end = sum/average of per-step rewards.
Pros:
- Dense signal. Each step gets feedback.
- Better credit assignment. The PRM can flag which step went wrong.
- Partial credit possible. A trajectory that solves the first 4 of 5 steps gets some reward.
Cons:
- PRM is learned. It’s a neural network trained on human-labeled step-by-step traces. Limited by quality and quantity of labels.
- PRM can be gamed. RL will exploit any quirks in the PRM’s scoring.
- Expensive to label. Annotating ~100K step-by-step reasoning traces requires expert human labelers.
When process reward wins:
- Math competition problems: PRM800K (Lightman 2023’s dataset) is publicly available; combining outcome + process rewards gives ~3-5 point gain on AIME.
- Multi-step coding: each function/test gets individual reward; helps with very long solutions.
- Formal proofs: Lean tactic-level rewards work much better than “proof accepted/rejected” alone.
When outcome reward wins:
- Resource-constrained training (no time to build PRM): outcome-only is simpler and works at scale.
- Novel domains: outcome is generic; PRMs require domain-specific labels.
- R1’s actual production recipe: started with outcome-only, added process rewards later for specific subdomains where the gain justified the labelling cost.
The 2026 state: both are used. Outcome rewards are the default; PRMs are added selectively where the gain is worth the engineering cost.
R1-Zero — pure RL from base, no SFT
The most surprising result in the R1 paper: R1-Zero, trained with pure RL on the base model (no cold-start SFT), worked. The training started with a base that had no special instruction tuning, applied verifiable rewards directly via GRPO, and… reasoning behaviour EMERGED.
R1-Zero is significant because it demonstrates that long-form reasoning is a learnable behavior, not something the model must be taught by imitating human examples. Just having a verifier and a base capable enough to occasionally produce correct answers is sufficient for RL to discover reasoning as an emergent strategy.
This was unexpected. Pre-R1, the consensus was that long CoT had to be bootstrapped via SFT on human-written or distilled examples. R1-Zero showed it can emerge from RL alone. The implication for future training: the SFT cold-start may be unnecessary if you’re willing to accept the format quirks.
What R1-Zero demonstrates:
Long-form reasoning is a LEARNABLE BEHAVIOR that emerges from RL with verifiable rewards on a sufficiently capable base. It does NOT require imitating human reasoning examples (SFT).
The base model already “knows” the relevant math facts and operations from pretraining. What it lacks is the BEHAVIOR of stringing them together coherently across a long generation. The verifier provides the signal for which strings of tokens lead to correct answers.
RL discovers that emitting reasoning steps is rewarded (because reasoning leads to correct final answers more often). The model develops this behavior without ever being shown an example of it.
The emergent CoT has unusual surface features (mixed languages, weird formatting) precisely BECAUSE no SFT enforced a specific format. The model converges on whatever surface form happens to work for the verifier.
What this means for training pipelines:
1. SFT cold-start is for FORMAT, not CAPABILITY. The SFT step gives the model a consistent surface form (English-only CoT, clean step markers, etc.) but doesn’t fundamentally improve reasoning capability. You can skip it if you don’t care about format.
2. Pure RL is feasible at scale. If you have a strong base and good verifiers, you can train reasoning models without expensive SFT data collection. Smaller labs and academic research can do this.
3. Reasoning is “in the weights” of any good base model. The capability exists; RL just unlocks it. This explains why models that fail at pre-RL CoT prompting can succeed after RL: the capability was always there, just not the behavior.
4. Verifier quality is the binding constraint. The whole pipeline depends on having a reliable verifier. Better verifiers = better reasoning models. This is where the engineering effort now concentrates.
5. Transferability is uncertain. R1-Zero’s RL was on math/code verifiers. Whether the emerged reasoning transfers to non-verifier domains (philosophy, social judgment) is an open question. Some transfer is observed; full transfer is not.
The deeper question:
If reasoning emerges from RL on verifiable rewards, are LLMs “thinking” or are they “finding policies that maximise verifier scores”? Empirically the two are hard to distinguish — the policy that maximises a math verifier IS one that produces valid reasoning. But this raises philosophical questions about the relationship between capability and behavior that aren’t fully resolved.
Next: §20.3 — What works, what doesn’t, and the open problems. The trade-off curves, the failure modes, and the 2026 state of reasoning model deployment. Where the genuine progress is happening and where the hype runs ahead of the science.