01The policies and the reward, on one axis
The reference $\pi_{\mathrm{ref}}$ is what the base language model would have output. The reward $r(a)$ is the human-preferred behavior — high where we want the model to spend its probability mass. The KL-regularized optimum $\pi^\star$ is the answer to "how should the policy reshape itself to chase $r$ without straying too far from $\pi_{\mathrm{ref}}$?" — analytically,
$$\pi^\star(a) \;\propto\; \pi_{\mathrm{ref}}(a)\,\exp\!\left(r(a)/\beta\right).$$
As you shrink $\beta$, $\pi^\star$ sharpens onto the reward; as you grow $\beta$, it relaxes back to $\pi_{\mathrm{ref}}$.
02The loss landscape
| Quantity | Value |
|---|---|
| $\mathbb{E}_{\pi_\theta}\!\left[r(a)\right]$ expected reward of $\pi_\theta$ | — |
| $\mathrm{KL}\!\left(\pi_\theta \,\Vert\, \pi_{\mathrm{ref}}\right)$ distance from the base model | — |
| $J(\pi_\theta)$ $\mathbb{E}[r] - \beta\,\mathrm{KL}$ — what RLHF maximizes | — |
| $\left\Vert \nabla_{\!\theta}\, J \right\Vert$ gradient magnitude | — |
| $J(\pi^\star)$ objective at the closed-form optimum | — |
| $\mu^\star,\ \sigma^\star$ moments of $\pi^\star$ (numerical) | — |
The red dot is the current $\pi_\theta$. The green star is the moment-matched $\pi^\star$. Press Play under Exact RLHF gradient to roll the policy uphill on this landscape — that's what an idealized, full-batch RLHF run looks like.
Five algorithms, one landscape. Click any tab below to swap the bottom half of the page — including the sidebar's method hyperparameters — to that algorithm. Each tab tells the same story (the policy climbs toward $\pi^\star$) but with a different gradient estimator: an idealized analytic gradient, a vanilla sampled gradient, a clipped group-relative variant, a preference-only logistic loss, or a regression on a learned baseline. The last tab, Compare, runs all five from the same starting point and overlays their trajectories on a single map.
03RLHF — climbing the exact gradient
In this idealized setting both $\mathbb{E}_{\pi_\theta}[r(a)]$ and $\mathrm{KL}(\pi_\theta\Vert\pi_{\mathrm{ref}})$ have closed forms, so the full gradient $\nabla_{\!\theta} J$ is computable. Each step rolls the policy uphill on the objective surface above — no sampling, no clipping. This is the limit that GRPO approximates as $G \to \infty$.
$$\theta \;\leftarrow\; \theta \,+\, \eta\,\nabla_{\!\theta}\!\left[\,\mathbb{E}_{\pi_\theta}[r(a)] \,-\, \beta\,\mathrm{KL}(\pi_\theta\,\Vert\,\pi_{\mathrm{ref}})\,\right]$$
Press Play in the sidebar to roll $\pi_\theta$ uphill on $J$. The blue path is the analytic field $\nabla_{\!\theta} J$ — what an infinite-batch RLHF run would do. Compare it to the GRPO and DPO tabs to see how stochastic estimators bend this idealized trajectory.
03Policy Gradient — the simplest sampled estimator
Before GRPO, there's just policy gradient (REINFORCE with a mean baseline). At each step, draw a fresh batch of $N$ actions from the current $\pi_\theta$ — no frozen $\pi_{\mathrm{old}}$, so no importance ratio and no clipping. Score them with $r$, subtract the in-batch mean as a control variate, and follow the resulting Monte-Carlo estimate of $\nabla_{\!\theta}J$.
$$\widehat{\nabla_{\!\theta} J} \;=\; \frac{1}{N}\sum_{i=1}^{N}\big(r_i \,-\, \bar r\big)\,\nabla_{\!\theta}\log\pi_\theta(a_i) \;-\; \beta\,\nabla_{\!\theta}\mathrm{KL}\!\left(\pi_\theta\,\Vert\,\pi_{\mathrm{ref}}\right),$$
and then $\theta \leftarrow \theta + \eta\,\widehat{\nabla_{\!\theta} J}$. As $N \to \infty$ this estimator converges to the exact RLHF gradient from the previous tab; for finite $N$ the path is noisier and small-$N$ batches can overshoot. GRPO (next tab) is essentially this estimator, except it pins $\pi_{\mathrm{old}}$ across multiple updates (importance ratios), clips those ratios, and normalizes the advantage by its in-batch standard deviation.
04PG vs. exact gradient — Monte-Carlo noise around the analytic path
Run both Play buttons. The blue path is the analytic gradient field; the rose path is PG with the current batch size. Shrink $N$ — the rose trace gets visibly jittery; grow $N$ — it converges onto the blue. Compare to GRPO (next tab): GRPO adds three modifications (group normalization, importance ratios, ratio clipping) on top of this same idea.
03GRPO — one step, in slow motion
Real RL on language models can't compute $\mathbb{E}_{\pi_\theta}[r(a)]$ in closed form. Instead, it samples a group of $G$ completions $a_1,\ldots,a_G \sim \pi_{\mathrm{old}}$, scores each, and uses the in-group mean as the baseline. GRPO then takes a clipped policy-gradient step that maximizes
$$\mathcal{L}_{\mathrm{GRPO}}(\theta) \;=\; \frac{1}{G}\sum_{i=1}^{G} \min\!\Big(\rho_i\,\hat A_i,\ \mathrm{clip}(\rho_i,\,1{-}\varepsilon,\,1{+}\varepsilon)\,\hat A_i\Big) \;-\; \beta\,\mathrm{KL}\!\left(\pi_\theta \,\Vert\, \pi_{\mathrm{ref}}\right),$$
where $\rho_i = \pi_\theta(a_i)/\pi_{\mathrm{old}}(a_i)$ is the importance ratio and $\hat A_i$ is the group-normalized advantage.
04GRPO vs. exact gradient — same landscape, different paths
Run all three Play buttons. The blue path is the analytic limit; the rose path is plain PG (REINFORCE with the in-batch mean baseline); the red dotted path is GRPO with the current $G$, $\varepsilon$, and group-normalized advantages. PG and GRPO are both Monte-Carlo estimates of the same gradient — GRPO adds importance ratios, clipping, and std-normalization on top.
03DPO — learning from preferences, without a reward model
DPO (Rafailov et al., 2023) sidesteps the reward model entirely. Given a dataset of preference pairs $(a_w \succ a_l)$ — typically labeled by humans — it trains $\pi_\theta$ directly via a logistic loss whose implicit reward is $\beta_{\mathrm{DPO}}\,\log\pi_\theta(a)/\pi_{\mathrm{ref}}(a)$. In expectation, the minimizer is the same KL-regularized optimum $\pi^\star$ as RLHF.
$$\mathcal{L}_{\mathrm{DPO}}(\theta) \;=\; -\,\mathbb{E}_{(a_w,\,a_l)\sim\mathcal{D}}\!\left[\log\sigma\!\left(\underbrace{\beta_{\mathrm{DPO}}\,\log\dfrac{\pi_\theta(a_w)}{\pi_{\mathrm{ref}}(a_w)} \;-\; \beta_{\mathrm{DPO}}\,\log\dfrac{\pi_\theta(a_l)}{\pi_{\mathrm{ref}}(a_l)}}_{h\;=\;\text{implicit margin}}\right)\right]$$
Here we synthesize preferences: draw pairs from $\pi_{\mathrm{ref}}$ and label them with a noisy Bradley-Terry oracle on $r(a)$, so we know the ground truth and can compare DPO's trajectory to RLHF's.
Background is still $J(\theta) = \mathbb{E}[r] - \beta\,\mathrm{KL}$ — what RLHF maximizes. DPO doesn't see $J$; it minimizes the logistic loss above. But because the implicit reward $\beta\log\pi_\theta/\pi_{\mathrm{ref}}$ is, up to a constant, the unique reward consistent with $\pi^\star$, the purple path still climbs toward the green star — through a different route, often staying closer to $\pi_{\mathrm{ref}}$.
03DRO — fitting an MSE loss to the optimal policy
Direct Reward Optimisation (Richemond et al., 2024) sidesteps preference pairs and pairwise loss entirely. Given offline scalar-reward data $(a_i, r_i)$ — here, $a_i \sim \pi_{\mathrm{ref}}$ and $r_i = r(a_i)$ — DRO jointly fits the policy $\pi_\theta$ and a scalar baseline $V$ by minimizing
$$\mathcal{L}_{\mathrm{DRO}}(\theta, V) \;=\; \tfrac{1}{2}\,\mathbb{E}_{a\sim\pi_{\mathrm{ref}}}\!\left[\,\Big(\underbrace{r(a) \,-\, V \,-\, \beta_{\mathrm{DRO}}\log\dfrac{\pi_\theta(a)}{\pi_{\mathrm{ref}}(a)}}_{\delta(a)\;=\;\text{residual}}\Big)^{2}\,\right].$$
$V$ is another learned parameter, not a hand-tuned constant: each Step descends both partial derivatives of the same MSE loss,
$$\theta \;\leftarrow\; \theta \,+\, \eta_\theta\;\beta_{\mathrm{DRO}}\,\mathbb{E}\!\left[\delta(a)\,\nabla_{\!\theta}\log\pi_\theta(a)\right], \qquad V \;\leftarrow\; V \,+\, \eta_V\,\mathbb{E}\!\left[\delta(a)\right].$$
So $V$ slides up or down with the running mean residual until that residual averages to zero. At the minimizer the residual vanishes pointwise, which forces $\pi_\theta = \pi^\star$ and $V = \beta_{\mathrm{DRO}}\log Z$ — the log-partition of $\pi^\star$. Watch its current value in the sidebar. DRO is yet another route to the same green star: a regression loss instead of a logistic one. No pairs, no clipping, no importance ratios.
Background is the RLHF objective $J(\theta) = \mathbb{E}[r] - \beta\,\mathrm{KL}$. DRO doesn't see $J$; it descends the squared residual above. Because the unique zero of that residual is the same exponential reweighting that defines $\pi^\star$, the teal path still climbs toward the green star — and it converges fastest when $\beta_{\mathrm{DRO}}$ matches the RLHF $\beta$.
03Compare — five trajectories, one landscape
Everything we've seen so far, one map. Press Run all in the sidebar to launch each algorithm from the same initial point $(\mu_\theta,\sigma_\theta)$ — the current red dot — with hardcoded default hyperparameters (so the comparison is fair regardless of how you've tweaked the sliders in other tabs). Each algorithm runs for the chosen number of steps; the trajectories are then drawn on top of the shared objective surface.
A few things to look for: (1) the blue exact path is deterministic; everything else is a Monte-Carlo run, so re-running gives different but qualitatively similar traces. (2) PG and GRPO hug the analytic gradient field with sampling noise. (3) DPO and DRO minimize different losses — they don't see $J$ directly — yet their paths still climb toward the same $\pi^\star$, often through different intermediate $(\mu,\sigma)$ regions. (4) Switch the reward preset to Two bumps and re-run: the sampling-based methods are more sensitive to which basin they get pulled into early on.