REASONING MODELS
Section 20.2
01

Training with verifiable rewards — the DeepSeek R1 recipe

OpenAI’s o1 (Sep 2024) shipped without explaining how it was trained. DeepSeek R1 (Jan 2025) published the recipe end-to-end. The recipe — pretrained base → SFT on long chain-of-thought → RL with verifiable rewards via GRPO → distillation back into smaller models — is now the open template every lab is iterating on. The two unusual choices: rule-based rewards instead of learned reward models, and GRPO (Ch.18 §3) instead of PPO. Both turn out to be load-bearing. This section walks the pipeline with the actual training numbers, then unpacks the R1-Zero result — where pure RL on a base model (no SFT cold-start at all) bootstrapped reasoning on its own.

The full R1 pipeline

DeepSeek R1 training pipeline (4 stages): Stage 0 — Pretrained base. Start from DeepSeek-V3 (671B MoE / 37B active, see Ch.17). The base has been trained on ~15T tokens of standard pretraining. No reasoning behaviour yet — it just predicts the next token. Stage 1 — Cold-start SFT. Collect ~thousands of long chain-of-thought examples by: - Prompting a strong model (R1-Zero or GPT-4) to solve problems "step by step." - Filtering for correctness + clean format. - Optional human cleanup. SFT the base on (problem → long CoT → final answer) pairs. Result: model "knows" to produce CoT before answering. Format settled. Stage 2 — Reasoning RL with verifiable rewards. Generate K = 64 candidate solutions per problem (Ch.18 §3 GRPO). Score each: reward = +1 if final answer matches ground truth (math) | +1 if all unit tests pass (code) | +1 if formal proof checker accepts (theorem proving) | 0 otherwise Plus small format reward: +0.1 if CoT is well-formed. GRPO advantage: A_i = (r_i − mean(r)) / std(r) over the K samples. PPO-style clipped policy update; KL to reference. Run for ~1000s of steps. Math/code accuracy climbs from ~60% to ~95%. Stage 3 — Rejection sampling + supervised refinement. Sample many outputs from the RL'd model on broader tasks. Keep only the high-quality ones (rejected by a quality classifier). SFT on this filtered dataset to recover non-reasoning capabilities (chat, formatting, instruction-following) that pure-math RL damaged. Stage 4 — Distillation (optional). Distil R1's behavior into smaller dense models (7B, 14B, 32B). Result: R1-Distill-Qwen-32B has o1-mini-level math/code on its own. Distillation transfers the reasoning BEHAVIOR (writing long CoT) but with the smaller model's CAPABILITY ceiling.

Cold-start SFT isn’t strictly necessary — DeepSeek’s R1-Zero variant skipped it. But it makes the RL stage faster and produces a cleaner final format. Every commercial reasoning model uses cold-start SFT in the recipe.

Verifiable rewards — rule-based, not learned

The Ch.18 alignment recipe used a LEARNED reward model: train a Bradley-Terry reward on human preferences, optimize the policy to maximize predicted reward. That works for chat (“which response do humans prefer”). For reasoning it doesn’t, because reasoning has a different failure mode: the model can produce a confidently-wrong CoT that “sounds right” — and a learned reward model would reward it.

The R1 fix: use a HARD verifier as the reward.

Verifiable reward types: Math: Extract the final answer (often boxed: \boxed{42}). Compare to ground truth via SymPy or string normalisation. Reward = 1.0 if match, 0.0 otherwise. No partial credit, no ambiguity. Code: Generated solution is run against hidden unit tests. Reward = (tests passed) / (tests total). Hidden tests prevent the model from memorizing test cases. Formal proofs (Lean, Coq): The proof checker accepts or rejects. Reward = 1.0 if accepted, 0.0 otherwise. No bluffing past a proof assistant. Multi-step reasoning (with intermediate checks): Each step gets verified by a process reward model (Lightman 2023). More expensive but improves credit assignment.

Verifiable rewards are simple, fast, and tamper-proof. The model can’t game them by writing convincing-sounding nonsense. The cost: you’re limited to problems with objective answers. That’s why R1 is dramatically better at math than at chat — its RL signal only existed for math.

— think, then check —

Stage 0 — Pretrained base:

Input: DeepSeek-V3 base (or any frontier base model).

Output: Same model, ready for SFT.

Skip → no shortcut here; you need a strong base. The reasoning capability ceiling is set by the base’s pretraining.

Stage 1 — Cold-start SFT:

Input: thousands of (problem → long CoT → answer) examples.

Output: model that produces CoT in a consistent format before answering.

Skip → RL still works but converges slower and produces “R1-Zero style” output (inconsistent format, sometimes mixes languages, uses unusual symbols). Useful for research; not production-ready.

Stage 2 — Reasoning RL with verifiable rewards (GRPO):

Input: cold-started model + problems with ground-truth answers.

Output: model with dramatically improved per-attempt accuracy on math/code.

Skip → no reasoning capability improvement. The cold-start gave you the format; this gave you the reasoning quality.

Stage 3 — Rejection sampling + supervised refinement:

Input: RL’d model + broader prompts (chat, writing, general questions).

Output: model that retains reasoning capability AND handles non-reasoning tasks gracefully.

Skip → model is great at math but bad at chat. Heavy RL on math damages other capabilities. This stage REPAIRS the damage.

Stage 4 — Distillation:

Input: large R1 + smaller base models (7B, 14B, 32B).

Output: smaller models with R1-like behavior but proportionally lower capability ceiling.

Skip → only the large model gets the benefit. For commercial inference economics (serving 7B is much cheaper than serving 671B), distillation is essential.

The cascade: Stage 0 provides the capability ceiling, Stages 1-2 unlock reasoning, Stage 3 recovers generality, Stage 4 makes it economical to deploy.

Why GRPO

Recall from Ch.18 §3 that GRPO eliminates PPO’s value head by using group-relative advantages from K sampled outputs per prompt. For reasoning specifically, this is the right algorithm:

Why GRPO fits reasoning training: 1. Long trajectories. Reasoning CoTs are 500-5000 tokens. PPO's value head must predict the value of EACH STATE along the trajectory. For long trajectories with sparse rewards (binary correct/wrong only at the end), value learning is high-variance. GRPO sidesteps: just compare to group mean. No value learning needed. 2. Binary rewards. Verifiable rewards are 0/1. PPO's advantage A = r + γ·V(t+1) − V(t) is essentially (r − V) at the end. With binary r and learned V, the advantage signal is noisy. GRPO normalises: A_i = (r_i − mean(r)) / std(r) across K samples. With K = 64 samples per prompt: clean per-prompt baseline; advantages are well-distributed [-1.5, +1.5]. 3. Compute efficiency. K = 64 samples per prompt costs K× inference. But: the K samples ARE the diverse exploration that's needed for RL. No separate value-network training. No critic. Simpler. 4. Easy parallelism. The K samples per prompt are embarrassingly parallel. DeepSeek runs this at scale: thousands of GPUs sampling, scoring, updating.

DeepSeek’s R1 paper explicitly notes that PPO with a learned value head was tried first and failed: value learning was too unstable on long CoT trajectories. Switching to GRPO made training viable. Every subsequent open reasoning model has copied this choice (Qwen 3.5 thinking, Mistral Magistral, etc.).

— think, then check —

Outcome rewards (what R1 used):

Score only the final answer. Reward = 1 if correct, 0 if wrong.

Pros:

  • Cheap (one check per trajectory).
  • Unbiased (no learned reward model that can be wrong).
  • Tamper-proof (model can’t game the verifier).

Cons:

  • Credit assignment is hard. If the trajectory was 3000 tokens long and the final answer was wrong, the policy update penalises ALL 3000 tokens equally. But maybe the first 2900 were correct reasoning and just the last 100 tokens were a transcription error.
  • Sparse signal. One bit (correct/wrong) per ~5000-token trajectory means a lot of compute for very little gradient signal.
  • No partial credit. A trajectory that’s “almost right but goes wrong on step 3” gets the same reward as one that’s nonsense throughout.

Process rewards (PRMs — Lightman 2023):

Train a separate reward model that scores EACH STEP of the CoT. Reward at the end = sum/average of per-step rewards.

Pros:

  • Dense signal. Each step gets feedback.
  • Better credit assignment. The PRM can flag which step went wrong.
  • Partial credit possible. A trajectory that solves the first 4 of 5 steps gets some reward.

Cons:

  • PRM is learned. It’s a neural network trained on human-labeled step-by-step traces. Limited by quality and quantity of labels.
  • PRM can be gamed. RL will exploit any quirks in the PRM’s scoring.
  • Expensive to label. Annotating ~100K step-by-step reasoning traces requires expert human labelers.

When process reward wins:

  • Math competition problems: PRM800K (Lightman 2023’s dataset) is publicly available; combining outcome + process rewards gives ~3-5 point gain on AIME.
  • Multi-step coding: each function/test gets individual reward; helps with very long solutions.
  • Formal proofs: Lean tactic-level rewards work much better than “proof accepted/rejected” alone.

When outcome reward wins:

  • Resource-constrained training (no time to build PRM): outcome-only is simpler and works at scale.
  • Novel domains: outcome is generic; PRMs require domain-specific labels.
  • R1’s actual production recipe: started with outcome-only, added process rewards later for specific subdomains where the gain justified the labelling cost.

The 2026 state: both are used. Outcome rewards are the default; PRMs are added selectively where the gain is worth the engineering cost.

R1-Zero — pure RL from base, no SFT

The most surprising result in the R1 paper: R1-Zero, trained with pure RL on the base model (no cold-start SFT), worked. The training started with a base that had no special instruction tuning, applied verifiable rewards directly via GRPO, and… reasoning behaviour EMERGED.

R1-Zero training: Input: DeepSeek-V3 base. No SFT. Run GRPO directly with math/code verifiable rewards. Day 0: Base sometimes guesses right by accident. ~10% accuracy. Day 1: Model starts emitting some reasoning steps. ~20% accuracy. Day 5: Long CoT emerges spontaneously. ~50%. Day 14: The model is generating sophisticated multi-step reasoning. ~85%. Quirky output characteristics (no SFT to enforce format): - Sometimes mixes Chinese + English mid-trajectory. - Uses arbitrary symbols like "wait..." or "actually..." inconsistently. - Format jumps between bullets, prose, equations without convention. - Occasional gibberish at the end of long generations. But the REASONING content was sound. R1-Zero on AIME 2024 hit ~74%. For comparison: o1 hit ~83% with substantially more training.

R1-Zero is significant because it demonstrates that long-form reasoning is a learnable behavior, not something the model must be taught by imitating human examples. Just having a verifier and a base capable enough to occasionally produce correct answers is sufficient for RL to discover reasoning as an emergent strategy.

This was unexpected. Pre-R1, the consensus was that long CoT had to be bootstrapped via SFT on human-written or distilled examples. R1-Zero showed it can emerge from RL alone. The implication for future training: the SFT cold-start may be unnecessary if you’re willing to accept the format quirks.

— think, then check —

What R1-Zero demonstrates:

Long-form reasoning is a LEARNABLE BEHAVIOR that emerges from RL with verifiable rewards on a sufficiently capable base. It does NOT require imitating human reasoning examples (SFT).

The base model already “knows” the relevant math facts and operations from pretraining. What it lacks is the BEHAVIOR of stringing them together coherently across a long generation. The verifier provides the signal for which strings of tokens lead to correct answers.

RL discovers that emitting reasoning steps is rewarded (because reasoning leads to correct final answers more often). The model develops this behavior without ever being shown an example of it.

The emergent CoT has unusual surface features (mixed languages, weird formatting) precisely BECAUSE no SFT enforced a specific format. The model converges on whatever surface form happens to work for the verifier.

What this means for training pipelines:

1. SFT cold-start is for FORMAT, not CAPABILITY. The SFT step gives the model a consistent surface form (English-only CoT, clean step markers, etc.) but doesn’t fundamentally improve reasoning capability. You can skip it if you don’t care about format.

2. Pure RL is feasible at scale. If you have a strong base and good verifiers, you can train reasoning models without expensive SFT data collection. Smaller labs and academic research can do this.

3. Reasoning is “in the weights” of any good base model. The capability exists; RL just unlocks it. This explains why models that fail at pre-RL CoT prompting can succeed after RL: the capability was always there, just not the behavior.

4. Verifier quality is the binding constraint. The whole pipeline depends on having a reliable verifier. Better verifiers = better reasoning models. This is where the engineering effort now concentrates.

5. Transferability is uncertain. R1-Zero’s RL was on math/code verifiers. Whether the emerged reasoning transfers to non-verifier domains (philosophy, social judgment) is an open question. Some transfer is observed; full transfer is not.

The deeper question:

If reasoning emerges from RL on verifiable rewards, are LLMs “thinking” or are they “finding policies that maximise verifier scores”? Empirically the two are hard to distinguish — the policy that maximises a math verifier IS one that produces valid reasoning. But this raises philosophical questions about the relationship between capability and behavior that aren’t fully resolved.

Next: §20.3 — What works, what doesn’t, and the open problems. The trade-off curves, the failure modes, and the 2026 state of reasoning model deployment. Where the genuine progress is happening and where the hype runs ahead of the science.