ALIGNMENT: RLHF → DPO → GRPO
Section 18.2
02

DPO — the closed-form RLHF

PPO RLHF is operationally complex: three networks in memory, fragile hyperparameters, expensive autoregressive sampling for rollouts, the famous “RLHF is finicky” reputation. Rafailov 2023 “Direct Preference Optimization” demonstrated that all of this complexity is unnecessary — at least when your alignment signal is preference comparisons. The derivation is beautiful: the OPTIMAL policy under KL-constrained reward maximisation has a closed-form expression in terms of the reward. Substitute that back into the reward-model loss and the reward function vanishes; only the policy and a reference policy remain. The result: a simple supervised cross-entropy on preference pairs, no reward model, no rollouts, no value head. Empirically matches or beats PPO. Now the default for everything from open-source fine-tunes to Anthropic-class production deployments.

The derivation

Start with the KL-constrained RLHF objective:

maximise over π: E_{x, y ∼ π} [ r(x, y) ] - β · KL(π(· | x) ‖ π_ref(· | x)) This is solved analytically. The constrained optimum (over all measurable π) is: π*(y | x) = (1 / Z(x)) · π_ref(y | x) · exp(r(x, y) / β) where Z(x) = Σ_y' π_ref(y' | x) · exp(r(x, y') / β) is the partition function. This is a standard result: the constrained maximum-entropy distribution given a reference and a reward is an exponential tilt of the reference. Rearranging to express r in terms of π* and π_ref: r(x, y) = β · log [ π*(y | x) / π_ref(y | x) ] + β · log Z(x)

So far, no surprises. The closed form for the optimal policy in terms of the reward is classical. Rafailov’s insight is in the next step.

Recall the Bradley-Terry loss for reward modeling: L_RM = - E_{(x, y_w, y_l)} [ log σ(r(x, y_w) - r(x, y_l)) ] Substitute the expression for r in terms of π*: r(x, y_w) - r(x, y_l) = β · log [π*(y_w|x) / π_ref(y_w|x)] - β · log [π*(y_l|x) / π_ref(y_l|x)] + β · log Z(x) - β · log Z(x) The Z(x) terms CANCEL — they depend only on x, not on y_w or y_l. So: r(x, y_w) - r(x, y_l) = β · log [ π*(y_w|x) / π_ref(y_w|x) ] - β · log [ π*(y_l|x) / π_ref(y_l|x) ] And the Bradley-Terry loss becomes: L_DPO = - E_{(x, y_w, y_l)} [ log σ( β · log[π_θ(y_w|x) / π_ref(y_w|x)] - β · log[π_θ(y_l|x) / π_ref(y_l|x))] ) ] (where π_θ is the policy being trained; this is the optimal policy in disguise.)

This is the DPO loss — a direct supervised loss on preference triples (x, y_w, y_l). No reward model. No rollouts. No value network. No KL term computed explicitly. The KL constraint is BAKED INTO the loss via the log-ratio structure.

DPO reduces the alignment pipeline to: SFT, then DPO. Two stages, no reward model in between, no RL.

— think, then check —

Step 1 — The constrained optimum:

max E_y[r(x, y)] - β · KL(π‖π_ref) gives π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp(r(x,y)/β), where Z(x) = Σ_y’ π_ref(y’|x) · exp(r(x,y’)/β).

(Standard Lagrangian / Boltzmann-distribution result.)

Step 2 — Express r in terms of π*:

Solving for r: r(x, y) = β · log[π*(y|x)/π_ref(y|x)] + β · log Z(x).

Step 3 — Substitute into Bradley-Terry:

L_RM = -E[log σ(r(y_w) - r(y_l))].

The difference r(y_w) - r(y_l) is:

= β log[π*(y_w)/π_ref(y_w)] + β log Z(x) - β log[π*(y_l)/π_ref(y_l)] - β log Z(x)

= β log[π*(y_w)/π_ref(y_w)] - β log[π*(y_l)/π_ref(y_l)].

Where the simplifications come from:

(1) Z(x) cancels. The partition function depends on x but is the same for y_w and y_l. When we take the difference r(y_w) - r(y_l), the β log Z(x) terms appear twice with opposite signs and cancel exactly. This is crucial — Z(x) is intractable to compute (it’s a sum over all y), so cancellation is what makes DPO usable.

(2) The reward function “disappears.” The right-hand side is purely in terms of π* and π_ref — the reward r is implicit in the optimal policy itself. The Bradley-Terry preference probability is now expressed entirely in policy-space, with no explicit reward.

Step 4 — Replace π* with trainable π_θ:

π* is the (unknown) optimal policy we want to recover. We parameterise a trainable π_θ that we hope converges to π*. The DPO loss is:

L_DPO(θ) = -E[log σ(β log[π_θ(y_w)/π_ref(y_w)] - β log[π_θ(y_l)/π_ref(y_l)])].

Minimising this maximises π_θ(y_w)/π_ref(y_w) relative to π_θ(y_l)/π_ref(y_l) — moving probability mass toward winners, away from losers, in a way that preserves the SFT reference structure (via the log-ratio normalisation).

The conceptual punchline: the optimal policy IS the reward function (up to a function of x alone, which Bradley-Terry cancels). So you can train the policy directly on preferences without ever instantiating a separate reward model.

Now make it run

The kernel implements DPO on a tiny 3-class policy. The setup: prompts where the chosen response should be “yes” or “no” depending on a feature, and the rejected is always “maybe.” SFT learns the chosen labels via cross-entropy; DPO further suppresses “maybe.”

dpo.c — dpo_step C · DPO loss with log-ratio sigmoid
    float probs[V]; memcpy(probs, logits, sizeof(probs));
    softmax(probs, V);
    /* gradient: ∂L/∂W[j, v] = (probs[v] - y[v]) · x[j], y is one-hot at y_chosen */
    for (int j = 0; j < D; j++)
        for (int v = 0; v < V; v++) {
            float grad = (probs[v] - (v == y_chosen ? 1.0f : 0.0f)) * x[j];
            p->W[j * V + v] -= LR * grad;
        }
}

/* DPO step: minimise -log σ(β · (log π_θ(c|x) - log π_ref(c|x) - log π_θ(r|x) + log π_ref(r|x))). */
static void dpo_step(policy_t* p, const policy_t* ref, const float* x,
                     int y_chosen, int y_rejected, float beta)
{
    /* h(x) = β · ((log π_θ(c|x) − log π_ref(c|x)) − (log π_θ(r|x) − log π_ref(r|x))) */
    float log_pc_th = log_prob(p,   x, y_chosen);
    float log_pr_th = log_prob(p,   x, y_rejected);
    float log_pc_rf = log_prob(ref, x, y_chosen);
    float log_pr_rf = log_prob(ref, x, y_rejected);
    float h = beta * ((log_pc_th - log_pc_rf) - (log_pr_th - log_pr_rf));
    float sigma = sigmoidf(-h);   /* L = -log σ(h); dL/dh = -σ(-h) = -sigma */
    float dL_dh = -sigma;          /* the gradient w.r.t. h */

    /* ∂h/∂W comes from log_prob gradients. log π(y|x) wrt W[j,v]:
       (∂/∂W[j,v]) log π(y|x) = (1[v=y] - π_v) · x[j]
       So:
       ∂(log π(c) - log π(r))/∂W[j,v]  =  ((1[v=c] - π_v) - (1[v=r] - π_v)) · x[j]
                                       =  (1[v=c] - 1[v=r]) · x[j]    (π_v cancels!)
       times β. */
    float dh_dWj[V];
    dh_dWj[0] = 0; dh_dWj[1] = 0; dh_dWj[2] = 0;

Output:

Direct Preference Optimization (DPO) — 3-class policy
Setup: chosen = (yes if x[0] > 0 else no); rejected = always maybe.
       SFT trains on chosen labels (CE loss).
       DPO further suppresses 'maybe' beyond what SFT learned.

Stage              accuracy   avg P(rejected='maybe')
Base (random)      9 / 20      0.297
After SFT          19 / 20      0.017
After DPO          18 / 20      0.000

Read: SFT gets the chosen-label accuracy to 19/20 — strong baseline. SFT also reduces “maybe” probability from 0.30 to 0.017, but it’s not explicitly trained to do so (the CE loss only sees the chosen label, not the rejected). DPO drives “maybe” to effectively zero by EXPLICITLY using both chosen and rejected information. The mechanism matches the math: DPO’s gradient simultaneously pushes π up for chosen AND down for rejected, in a single sigmoid loss.

DPO vs PPO — operational comparison

PPO RLHF pipeline: Networks held in memory: policy + ref policy + reward model + value network ≈ 4× Hyperparameters: lr, β, ε, c_v, c_h, reward scale, sampling T, batch ≈ 8 Stages: SFT → reward model training → PPO with rollouts Per-step work: rollouts (slow, autoregressive) + value train + policy update DPO pipeline: Networks held in memory: policy + ref policy ≈ 2× Hyperparameters: lr, β ≈ 2 Stages: SFT → DPO Per-step work: forward pass on preference pairs + supervised loss + update Roughly: DPO uses 2× less memory, 4-8× less compute per step, and is ~10× less finicky. Quality: matched or slightly better in most benchmarks (see Rafailov 2023 results).
— think, then check —

Quantitative simplification:

  • Networks: PPO needs policy + ref + reward + value (4 networks loaded). DPO needs policy + ref (2 networks). 2× less memory.
  • Hyperparameters: PPO has ~8 (β, clip ε, value loss weight c_v, entropy coef c_h, reward scale, sampling T, batch size, lr). DPO has 2 (β, lr). 4× fewer knobs.
  • Per-step compute: PPO needs to sample full rollouts from the policy (expensive autoregressive decode) before each gradient step. DPO works on STATIC preference pairs — no sampling. 4-8× less compute per training step in typical settings.
  • Training time: PPO RLHF runs take days-weeks; DPO runs take hours-days for the same dataset.

Why frontier labs still use PPO-style (or descendants):

  1. Online learning. PPO can sample on-policy, score the samples with a reward model, and update — closing the loop in real-time as the policy changes. DPO requires the preference dataset to be pre-collected and FIXED. For continuous learning from fresh preferences (which frontier labs do), PPO’s online nature is valuable.
  2. Reward model ensembles. The reward model in PPO can be an ensemble of multiple RMs trained on different data slices, reducing reward hacking. DPO has no separate reward model to ensemble.
  3. Multi-step objectives. PPO supports per-token credit assignment via the value head. For long-form responses with mid-response quality variations, this matters. DPO assigns credit to whole responses uniformly.
  4. Constitutional / multi-objective reward. Anthropic’s Constitutional AI uses multiple reward components (helpfulness, harmlessness, honesty). PPO naturally combines these into a scalar reward. DPO operates on a single preference ordering, which is hard to decompose.
  5. Sample efficiency at the frontier. When you have nearly-infinite compute and labour for preference annotation, the data efficiency advantage of DPO matters less than the online-learning advantage of PPO.

The current convergence: hybrid methods. Methods like Online DPO, Iterative DPO, and DPO + Reward Model (where DPO is regularly re-evaluated against a held-out reward model) bridge the gap. GRPO (next section) is essentially a stripped-down PPO that keeps the online property without the value network.

For most teams below frontier scale, DPO is the right default. For frontier labs with infinite resources, the operational complexity of PPO is acceptable in exchange for online learning and richer reward signals.

The role of β

β controls how much the policy can deviate from the reference: L_DPO = -E[ log σ( β · log[π_θ(y_w)/π_ref(y_w)] - β · log[π_θ(y_l)/π_ref(y_l)] ) ] Large β: - The log-ratio is amplified before the sigmoid. - Small differences in policy from reference produce large preference signals. - The policy stays CLOSE to π_ref (small updates). - Useful when π_ref is already good and you want minor refinements. Small β: - Log-ratio is attenuated; the sigmoid saturates more easily. - The policy can move FURTHER from π_ref. - Useful when you want stronger preference signal but risk drift. Typical β: 0.1 to 0.5. Practitioners tune this; lower values for "more aggressive learning", higher for "stay safe."
— think, then check —

Setup: L_DPO = -E[log σ(β · (log π_θ(y_w)/π_ref(y_w) - log π_θ(y_l)/π_ref(y_l)))]

β = 0 limit:

The argument of σ becomes 0 · (log ratio) = 0. So L = -log σ(0) = -log(0.5) = log 2 — a constant.

The gradient is zero. The policy doesn’t update. So β = 0 is “no learning.”

(Inside the σ, the “preference signal” is zero — the loss doesn’t depend on the policy.)

β → ∞ limit:

The log-ratio term, multiplied by β, becomes very large. The sigmoid saturates: σ(very large positive) = 1 if y_w more likely under policy than reference, σ(very large negative) = 0 otherwise.

The loss collapses: for most preference pairs, the gradient is either zero (already saturated correctly) or huge (incorrectly saturated).

Effectively: any small deviation of π_θ from π_ref produces an enormous gradient. The policy is constrained to stay very close to π_ref.

Intermediate β (~0.1-0.5):

The sigmoid argument is moderate; the loss is well-behaved.

Each update moves the policy toward y_w preference, AWAY from y_l preference, but the magnitude is controlled by β.

Smaller β → larger per-update policy movement → more learning, but risk of drifting too far from π_ref.

Larger β → smaller per-update movement → more conservative, stays closer to π_ref.

Analogy to PPO’s KL coefficient:

In PPO: reward + β · KL constraint. β controls the strength of “stay near reference.”

In DPO: β multiplies the log-ratio inside σ. Effectively the same role: it controls how much the policy can move per gradient step.

The same intuition applies: too low → reward hacking / instability; too high → no learning.

Practical recipe: start at β = 0.1; if the model collapses, increase to 0.3-0.5; if learning plateaus too early, decrease to 0.05.

Next: §18.3 — GRPO and the modern simplifications. DeepSeek’s GRPO (2024) eliminates the value head from PPO; Constitutional AI uses RLAIF (AI feedback instead of human); recent variants (SimPO, IPO, KTO) push further. The trend: simpler is better.