GRPO + RLAIF + the modern simplifications
DPO showed that the entire PPO RLHF pipeline could collapse to a supervised loss. But PPO had genuine virtues — on-policy sampling, fine-grained per-token credit assignment, multi-component rewards — that DPO doesn’t have. The post-DPO landscape is a cluster of methods that take DPO’s simplicity and add back specific PPO virtues. GRPO (DeepSeek 2024) keeps on-policy sampling but drops the value head. RLAIF / Constitutional AI replaces human raters with strong LLMs. SimPO, IPO, KTO, ORPO tweak the DPO loss for specific behaviours (length bias, paired vs unpaired data, regret minimisation). This section maps the zoo and explains where each method lives.
GRPO — group-relative policy optimisation
DeepSeek 2024 (the “DeepSeekMath” and DeepSeek V3 papers) introduced GRPO as a middle ground between PPO and DPO.
The key insight: the value network’s job in PPO is to provide a baseline for advantage computation (so the policy gradient has low variance). With K sampled responses per prompt, the group’s mean reward IS a baseline — no separate value network needed.
GRPO’s advantages over PPO:
- One fewer network (no value head): ~25% less memory.
- No value loss to balance: one fewer hyperparameter.
- Group-relative advantages are naturally normalised: less reward-scale tuning needed.
- Still on-policy (samples fresh trajectories every step), unlike DPO.
GRPO’s advantages over DPO:
- On-policy: can sample new data continuously, doesn’t need pre-collected preference dataset.
- Supports continuous reward signals (not just preferences).
- Per-token credit assignment via the PPO clipping mechanism.
RLAIF — replacing humans with LLMs
The other major axis of simplification: RLAIF (RL from AI Feedback). Bai 2022 “Constitutional AI” (Anthropic) replaced human preference annotators with a strong LLM acting as a critic.
The pipeline:
RLAIF’s economics are profound. Human preference annotation costs ~$2-10 per comparison for high-quality work. A 1M-comparison dataset is $2-10M and takes months. An AI critic generates the same dataset for ~$0.01 per comparison (LLM API cost) in hours. The quality is comparable or better for many tasks — strong LLMs have implicit “preferences” that match human values closely enough.
What V_ψ does in PPO:
The value network V_ψ(x, y_(<=t)) estimates the expected return from each state — what reward the policy will accumulate from this point forward. It’s used to compute the advantage:
A_t = r_t + γ · V_ψ(t+1) - V_ψ(t).
The advantage tells the policy “did this action perform better or worse than the value-network baseline expected?” Subtracting a baseline reduces gradient variance — critical for PPO to learn at all.
What GRPO does instead:
Sample K responses (y_1, …, y_K) per prompt x. Compute their rewards r_1, …, r_K. The group baseline is the mean reward of the group; the advantage of response i is:
A_i = (r_i - mean(r_1..K)) / std(r_1..K).
This is a NORMALISED group-relative score. Response i with above-average reward has A_i > 0 (push policy toward it); response i with below-average reward has A_i < 0 (push policy away from it).
The same advantage is used for all tokens in response i (no per-token credit).
Trade-off:
Wins:
- ~25% less memory (no value network).
- ~25% less compute (no value-network forward + backward).
- Fewer hyperparameters (no value loss weight c_v, no value learning rate).
- Naturally normalised advantages: less reward-scale tuning.
- K samples per prompt give richer signal than PPO’s single rollout.
Losses:
- No per-token credit assignment. Whole response gets the same advantage.
- K-fold compute overhead per prompt (K forward passes per training example).
- Variance reduction from K samples is less than V_ψ’s variance reduction (which captures structure across many trajectories).
When GRPO wins: long-form responses where the WHOLE response’s quality matters (math reasoning, code generation, multi-step tasks). DeepSeek-R1’s training used GRPO for exactly this reason — reasoning trajectories are evaluated holistically, not per-token.
When PPO still wins: short responses where per-token credit matters (chat completion, classification). Also when sample efficiency per prompt is critical and you can’t afford K samples per prompt.
The 2024+ alignment zoo
The post-DPO landscape splintered into many methods, each addressing specific shortcomings of DPO. A partial map:
A non-exhaustive tour of the headline methods:
- IPO: replace σ with identity. Fewer length-bias issues.
- SimPO: doesn’t need a reference policy. Uses length-normalised log-probs.
- KTO: works with thumbs-up/down ratings instead of pair comparisons. Practical for production telemetry.
- ORPO: combines SFT and DPO into a single training stage.
The failure mode:
The judge LLM has its own biases, learned during ITS pretraining and alignment. If the judge prefers verbose responses (which most aligned LLMs do — they were RLHF’d to seem helpful and detailed), then RLAIF-aligned models will become MORE verbose. If the judge prefers a specific style or worldview, the trained model will pick that up.
This is “preference distillation” — the trained model inherits the judge’s biases, including pathological ones, without the original human signal to anchor it.
Concretely: GPT-4 as a judge tends to prefer responses written in ChatGPT’s style. A model trained via RLAIF with GPT-4 as judge ends up sounding like ChatGPT, even if its base model was different.
How Constitutional AI mitigates this:
The judge prompt includes an explicit “constitution” — a set of principles the judge should apply (e.g., “responses should be helpful, honest, and harmless; should not contain medical advice unless the user is a professional; should not encourage illegal activity”).
The judge applies these PRINCIPLES rather than its raw preferences. The result is a “principle-aligned” preference signal, not just “what the judge model finds nice.”
Operationally: the constitution is a few-paragraph system prompt fed to the judge before each comparison. The constitution defines the alignment target; the judge applies it consistently across the dataset.
Why this is still imperfect:
- The judge has to interpret the constitution. Different LLMs interpret the same principles differently.
- The constitution is written by humans, with their own value bias.
- Edge cases where principles conflict (helpful but also unsafe) get resolved by the judge’s implicit preferences.
The empirical reality: RLAIF + constitution dramatically reduces the “judge style leakage” problem vs naive RLAIF, but doesn’t eliminate it. Frontier labs (Anthropic, Meta) use this as ONE signal among several, not as a sole alignment source.
The economics still win: even with quality caveats, RLAIF is 100-1000× cheaper than human annotation. For data scaling beyond what humans can produce, it’s the only option.
What’s actually used in production
A rough survey of what production alignment looks like in 2024-2025:
- OpenAI GPT-4 / 4.5 / 5: PPO RLHF with extensive in-house tooling. Multi-component reward (helpful, harmless, factual). Heavy use of RLAIF for preference scaling. Human feedback for high-quality “gold” preferences.
- Anthropic Claude 3/4: Constitutional AI + PPO. The constitution provides the principle layer; RLAIF generates preference data; PPO does the optimisation. Smaller human annotation budget vs OpenAI; more weight on AI-generated signal.
- Meta Llama 3/4: SFT + DPO at the open-source releases; internal versions use rejection sampling + PPO + DPO in some combination. Llama 3.5 added GRPO for math/code training.
- DeepSeek V3 / R1: GRPO. Heavy use of rule-based rewards (math correctness, code correctness via execution) instead of preference models. RLAIF for chat alignment.
- Mistral, Qwen, Gemma: SFT + DPO with various tweaks. Most open-weight models in 2024-2025.
The pattern: frontier labs use whatever combination works best for their specific deployment; everyone else uses SFT + DPO as the default.
Customer support chatbot path:
Step 1: SFT on a curated instruction dataset (Open Hermes, UltraChat, or domain-specific instructions). This gets the model into the right format and basic helpfulness. ~3-5 epochs, ~1 day on 8× H100.
Step 2: DPO on the 20K human preferences plus expanded preferences via RLAIF.
- Why DPO: simple, fast (hours not days), well-understood, no rollouts.
- Why expand with RLAIF: 20K human preferences is enough to get most of the alignment signal, but expanding to ~100K via GPT-4 judging additional response pairs adds diversity and robustness.
- For RLAIF: write a constitutional system prompt aligned with the customer-support persona (helpful, concise, factual about products, escalate when unsure).
- Optional Step 2.5: KTO if you also have thumbs-up/down telemetry from existing chat logs. Cheap to add.
Why not PPO/GRPO? For a chat use case with no rule-based rewards and small annotation budget, the operational overhead of PPO/GRPO isn’t worth it. DPO matches PPO on chat benchmarks in published comparisons.
Reasoning model path:
Step 1: SFT on math/code instruction data (MetaMath, MathInstruct, CodeAlpaca). Gets the model into the right reasoning format.
Step 2: GRPO with rule-based rewards.
- For math: sample K responses; check final answer against ground truth; reward = 1 if correct, 0 if not.
- For code: sample K responses; execute against unit tests; reward = pass rate.
- No reward model needed — the rule-based reward is exact and tamper-proof.
- GRPO’s group-relative advantage is perfect here: among K sampled solutions to the same problem, the correct ones get positive advantage, incorrect ones negative.
Why GRPO over DPO for reasoning:
- Reasoning quality is binary (correct vs incorrect), not pairwise preferences. DPO doesn’t fit; GRPO does.
- On-policy sampling is necessary for chain-of-thought exploration. The model has to TRY long reasoning traces and learn from which traces succeed.
- K=8 or K=16 rollouts per prompt is feasible at training time and gives strong learning signal.
Why not PPO for reasoning:
Value networks are hard to train for long reasoning traces — the value function has to predict eventual success from intermediate states, which is inherently noisy. GRPO’s group-baseline approach sidesteps this entirely.
Why not just SFT for reasoning:
SFT on correct reasoning traces (chain-of-thought distillation) works for narrow domains but doesn’t induce exploration. GRPO trains the model to GENERATE chains that are MORE LIKELY TO BE CORRECT — a fundamentally different objective.
DeepSeek-R1 and OpenAI’s o1/o3 reasoning models all use GRPO-like training. SFT alone wouldn’t have produced their capabilities.
END OF CH.18 — Alignment.
§1 (classical RLHF: SFT, Bradley-Terry reward model, PPO with KL constraint) ·
§2 (DPO derivation: closed-form constrained optimum, BT loss collapses to log-ratio sigmoid) ·
§3 (GRPO + RLAIF + the modern zoo: IPO, SimPO, KTO, ORPO, Constitutional AI).
Next: Ch.19 — Fine-tuning, LoRA, PEFT. The chapter you asked us to add. The full math behind why a rank-8 update matrix is enough to fine-tune a 70B model.