IV — What Makes an LLM → Chapter 18
FROM SYSTEMS TO FRONTIER ML

Alignment: RLHF → DPO → GRPO

Turning a next-token predictor into an assistant. Reward models, preference optimization, the modern simplifications.

§1 SFT + RLHF — the classical alignment pipeline §2 DPO — the closed-form RLHF §3 GRPO + RLAIF + the modern simplifications

← ALL CHAPTERS