GRPO + RLAIF + the modern simplifications

Section 18.3

GRPO + RLAIF + the modern simplifications

DPO showed that the entire PPO RLHF pipeline could collapse to a supervised loss. But PPO had genuine virtues — on-policy sampling, fine-grained per-token credit assignment, multi-component rewards — that DPO doesn’t have. The post-DPO landscape is a cluster of methods that take DPO’s simplicity and add back specific PPO virtues. GRPO (DeepSeek 2024) keeps on-policy sampling but drops the value head. RLAIF / Constitutional AI replaces human raters with strong LLMs. SimPO, IPO, KTO, ORPO tweak the DPO loss for specific behaviours (length bias, paired vs unpaired data, regret minimisation). This section maps the zoo and explains where each method lives.

GRPO — group-relative policy optimisation

DeepSeek 2024 (the “DeepSeekMath” and DeepSeek V3 papers) introduced GRPO alignment method A PPO-style RL method (DeepSeek 2024) that eliminates the value network by computing advantages from a GROUP of K sampled responses per prompt. The advantage of each response is its reward minus the group's mean reward, divided by the group's std. PPO clipping and KL regularisation are kept, but the value head is gone. Operationally simpler than PPO, supports on-policy training (unlike DPO), and is the alignment method used to train DeepSeek-R1 and other reasoning-focused models. as a middle ground between PPO and DPO.

GRPO setup: For each prompt x: 1. Sample K responses y_1, ..., y_K ∼ π_θ(· | x) 2. Score each: r_i = reward(x, y_i) 3. Compute group-relative advantage: A_i = (r_i - mean(r_1..K)) / std(r_1..K) (A_i tells you how much better/worse y_i was than the group average) PPO-style loss (clipped) — no value network needed: L = - Σ_i (1/T_i) Σ_t min(ρ_t · A_i, clip(ρ_t, 1-ε, 1+ε) · A_i) + β · KL(π_θ ‖ π_ref) Where ρ_t = π_θ(y_t)/π_old(y_t) is the per-token importance ratio, T_i is the length of response i, and the advantage A_i is shared across all tokens in response i.

The key insight: the value network’s job in PPO is to provide a baseline for advantage computation (so the policy gradient has low variance). With K sampled responses per prompt, the group’s mean reward IS a baseline — no separate value network needed.

GRPO’s advantages over PPO:

One fewer network (no value head): ~25% less memory.
No value loss to balance: one fewer hyperparameter.
Group-relative advantages are naturally normalised: less reward-scale tuning needed.
Still on-policy (samples fresh trajectories every step), unlike DPO.

GRPO’s advantages over DPO:

On-policy: can sample new data continuously, doesn’t need pre-collected preference dataset.
Supports continuous reward signals (not just preferences).
Per-token credit assignment via the PPO clipping mechanism.

RLAIF — replacing humans with LLMs

The other major axis of simplification: RLAIF data sourcing Replacing human raters with a strong LLM acting as judge for generating preference data. The judge LLM compares pairs of responses and provides the preference label (usually with chain-of-thought reasoning); these AI-generated preferences are used in place of human preferences for downstream alignment (PPO, DPO, GRPO). Introduced by Anthropic's Constitutional AI (Bai 2022). Reduces human annotation cost by 10-100×. Quality depends on the judge LLM's alignment. (RL from AI Feedback). Bai 2022 “Constitutional AI” (Anthropic) replaced human preference annotators with a strong LLM acting as a critic.

The pipeline:

Constitutional AI (Bai 2022): 1. SFT on demonstration data (same as classical pipeline). 2. GENERATE preference data with a "critic" LLM: - Sample pairs of responses to prompts. - Use a critic LLM with a constitutional system prompt: "Compare these two responses. Which is more helpful, honest, and harmless? Reasoning step-by-step..." - The critic produces a preference label. - These AI-generated preferences are the training data. 3. RLHF (or DPO/GRPO) on the AI-generated preference dataset. Result: human annotation work goes from ~100K-1M comparisons to ~10K "constitution" prompts that define the desired behaviour. The bulk of preference data is generated by the AI critic.

RLAIF’s economics are profound. Human preference annotation costs ~$2-10 per comparison for high-quality work. A 1M-comparison dataset is $2-10M and takes months. An AI critic generates the same dataset for ~$0.01 per comparison (LLM API cost) in hours. The quality is comparable or better for many tasks — strong LLMs have implicit “preferences” that match human values closely enough.

— think, then check —

What V_ψ does in PPO:

The value network V_ψ(x, y_(<=t)) estimates the expected return from each state — what reward the policy will accumulate from this point forward. It’s used to compute the advantage:

A_t = r_t + γ · V_ψ(t+1) - V_ψ(t).

The advantage tells the policy “did this action perform better or worse than the value-network baseline expected?” Subtracting a baseline reduces gradient variance — critical for PPO to learn at all.

What GRPO does instead:

Sample K responses (y_1, …, y_K) per prompt x. Compute their rewards r_1, …, r_K. The group baseline is the mean reward of the group; the advantage of response i is:

A_i = (r_i - mean(r_1..K)) / std(r_1..K).

This is a NORMALISED group-relative score. Response i with above-average reward has A_i > 0 (push policy toward it); response i with below-average reward has A_i < 0 (push policy away from it).

The same advantage is used for all tokens in response i (no per-token credit).

Trade-off:

Wins:

~25% less memory (no value network).
~25% less compute (no value-network forward + backward).
Fewer hyperparameters (no value loss weight c_v, no value learning rate).
Naturally normalised advantages: less reward-scale tuning.
K samples per prompt give richer signal than PPO’s single rollout.

Losses:

No per-token credit assignment. Whole response gets the same advantage.
K-fold compute overhead per prompt (K forward passes per training example).
Variance reduction from K samples is less than V_ψ’s variance reduction (which captures structure across many trajectories).

When GRPO wins: long-form responses where the WHOLE response’s quality matters (math reasoning, code generation, multi-step tasks). DeepSeek-R1’s training used GRPO for exactly this reason — reasoning trajectories are evaluated holistically, not per-token.

When PPO still wins: short responses where per-token credit matters (chat completion, classification). Also when sample efficiency per prompt is critical and you can’t afford K samples per prompt.

↳ §18.3 + DeepSeek 2024

The 2024+ alignment zoo

The post-DPO landscape splintered into many methods, each addressing specific shortcomings of DPO. A partial map:

Alignment method comparison (the 2024-2025 zoo): Method Data Reward Model Value Net Sampling Key benefit ---------------------------------------------------------------------------------- PPO RLHF prefs yes yes on-policy Full credit DPO prefs no no offline Simplicity GRPO rewards yes NO on-policy No value net IPO prefs no no offline Less length bias SimPO prefs no no offline No ref policy needed KTO ratings no no offline Single-side ratings ORPO prefs no no offline Combined SFT+pref RLAIF AI prefs yes/no depends varies No human labelers Constitutional rules+AI no no varies Norm-based safety Key axes: - Preference data (pairs) vs reward (scalar) vs ratings (single side) - Need for reward model: more complex but enables multi-component - Need for value network: PPO's expensive piece - On-policy (sample new) vs offline (use static dataset)

A non-exhaustive tour of the headline methods:

IPO: replace σ with identity. Fewer length-bias issues.
SimPO: doesn’t need a reference policy. Uses length-normalised log-probs.
KTO: works with thumbs-up/down ratings instead of pair comparisons. Practical for production telemetry.
ORPO: combines SFT and DPO into a single training stage.

— think, then check —

The failure mode:

The judge LLM has its own biases, learned during ITS pretraining and alignment. If the judge prefers verbose responses (which most aligned LLMs do — they were RLHF’d to seem helpful and detailed), then RLAIF-aligned models will become MORE verbose. If the judge prefers a specific style or worldview, the trained model will pick that up.

This is “preference distillation” — the trained model inherits the judge’s biases, including pathological ones, without the original human signal to anchor it.

Concretely: GPT-4 as a judge tends to prefer responses written in ChatGPT’s style. A model trained via RLAIF with GPT-4 as judge ends up sounding like ChatGPT, even if its base model was different.

How Constitutional AI mitigates this:

The judge prompt includes an explicit “constitution” — a set of principles the judge should apply (e.g., “responses should be helpful, honest, and harmless; should not contain medical advice unless the user is a professional; should not encourage illegal activity”).

The judge applies these PRINCIPLES rather than its raw preferences. The result is a “principle-aligned” preference signal, not just “what the judge model finds nice.”

Operationally: the constitution is a few-paragraph system prompt fed to the judge before each comparison. The constitution defines the alignment target; the judge applies it consistently across the dataset.

Why this is still imperfect:

The judge has to interpret the constitution. Different LLMs interpret the same principles differently.
The constitution is written by humans, with their own value bias.
Edge cases where principles conflict (helpful but also unsafe) get resolved by the judge’s implicit preferences.

The empirical reality: RLAIF + constitution dramatically reduces the “judge style leakage” problem vs naive RLAIF, but doesn’t eliminate it. Frontier labs (Anthropic, Meta) use this as ONE signal among several, not as a sole alignment source.

The economics still win: even with quality caveats, RLAIF is 100-1000× cheaper than human annotation. For data scaling beyond what humans can produce, it’s the only option.

↳ §18.3 + Constitutional AI

What’s actually used in production

A rough survey of what production alignment looks like in 2024-2025:

OpenAI GPT-4 / 4.5 / 5: PPO RLHF with extensive in-house tooling. Multi-component reward (helpful, harmless, factual). Heavy use of RLAIF for preference scaling. Human feedback for high-quality “gold” preferences.
Anthropic Claude 3/4: Constitutional AI + PPO. The constitution provides the principle layer; RLAIF generates preference data; PPO does the optimisation. Smaller human annotation budget vs OpenAI; more weight on AI-generated signal.
Meta Llama 3/4: SFT + DPO at the open-source releases; internal versions use rejection sampling + PPO + DPO in some combination. Llama 3.5 added GRPO for math/code training.
DeepSeek V3 / R1: GRPO. Heavy use of rule-based rewards (math correctness, code correctness via execution) instead of preference models. RLAIF for chat alignment.
Mistral, Qwen, Gemma: SFT + DPO with various tweaks. Most open-weight models in 2024-2025.

The pattern: frontier labs use whatever combination works best for their specific deployment; everyone else uses SFT + DPO as the default.

— think, then check —

Customer support chatbot path:

Step 1: SFT on a curated instruction dataset (Open Hermes, UltraChat, or domain-specific instructions). This gets the model into the right format and basic helpfulness. ~3-5 epochs, ~1 day on 8× H100.

Step 2: DPO on the 20K human preferences plus expanded preferences via RLAIF.

Why DPO: simple, fast (hours not days), well-understood, no rollouts.
Why expand with RLAIF: 20K human preferences is enough to get most of the alignment signal, but expanding to ~100K via GPT-4 judging additional response pairs adds diversity and robustness.
For RLAIF: write a constitutional system prompt aligned with the customer-support persona (helpful, concise, factual about products, escalate when unsure).
Optional Step 2.5: KTO if you also have thumbs-up/down telemetry from existing chat logs. Cheap to add.

Why not PPO/GRPO? For a chat use case with no rule-based rewards and small annotation budget, the operational overhead of PPO/GRPO isn’t worth it. DPO matches PPO on chat benchmarks in published comparisons.

Reasoning model path:

Step 1: SFT on math/code instruction data (MetaMath, MathInstruct, CodeAlpaca). Gets the model into the right reasoning format.

Step 2: GRPO with rule-based rewards.

For math: sample K responses; check final answer against ground truth; reward = 1 if correct, 0 if not.
For code: sample K responses; execute against unit tests; reward = pass rate.
No reward model needed — the rule-based reward is exact and tamper-proof.
GRPO’s group-relative advantage is perfect here: among K sampled solutions to the same problem, the correct ones get positive advantage, incorrect ones negative.

Why GRPO over DPO for reasoning:

Reasoning quality is binary (correct vs incorrect), not pairwise preferences. DPO doesn’t fit; GRPO does.
On-policy sampling is necessary for chain-of-thought exploration. The model has to TRY long reasoning traces and learn from which traces succeed.
K=8 or K=16 rollouts per prompt is feasible at training time and gives strong learning signal.

Why not PPO for reasoning:

Value networks are hard to train for long reasoning traces — the value function has to predict eventual success from intermediate states, which is inherently noisy. GRPO’s group-baseline approach sidesteps this entirely.

Why not just SFT for reasoning:

SFT on correct reasoning traces (chain-of-thought distillation) works for narrow domains but doesn’t induce exploration. GRPO trains the model to GENERATE chains that are MORE LIKELY TO BE CORRECT — a fundamentally different objective.

DeepSeek-R1 and OpenAI’s o1/o3 reasoning models all use GRPO-like training. SFT alone wouldn’t have produced their capabilities.

↳ §18.3 + production deployment 2024-2025

END OF CH.18 — Alignment.
§1 (classical RLHF: SFT, Bradley-Terry reward model, PPO with KL constraint) · §2 (DPO derivation: closed-form constrained optimum, BT loss collapses to log-ratio sigmoid) · §3 (GRPO + RLAIF + the modern zoo: IPO, SimPO, KTO, ORPO, Constitutional AI).

Next: Ch.19 — Fine-tuning, LoRA, PEFT. The chapter you asked us to add. The full math behind why a rank-8 update matrix is enough to fine-tune a 70B model.