01The policies and the reward, on one axis

Action axis · densities and reward landscape

The reference $\pi_{\mathrm{ref}}$ is what the base language model would have output. The reward $r(a)$ is the human-preferred behavior — high where we want the model to spend its probability mass. The KL-regularized optimum $\pi^\star$ is the answer to "how should the policy reshape itself to chase $r$ without straying too far from $\pi_{\mathrm{ref}}$?" — analytically,

$$\pi^\star(a) \;\propto\; \pi_{\mathrm{ref}}(a)\,\exp\!\left(r(a)/\beta\right).$$

As you shrink $\beta$, $\pi^\star$ sharpens onto the reward; as you grow $\beta$, it relaxes back to $\pi_{\mathrm{ref}}$.

02The loss landscape

Objective surface · brighter = higher
$J(\mu_\theta,\, \sigma_\theta) \;=\; \mathbb{E}_{\pi_\theta}\!\left[r(a)\right] \;-\; \beta\,\mathrm{KL}\!\left(\pi_\theta \,\Vert\, \pi_{\mathrm{ref}}\right)$
Current state
QuantityValue
$\mathbb{E}_{\pi_\theta}\!\left[r(a)\right]$
expected reward of $\pi_\theta$
$\mathrm{KL}\!\left(\pi_\theta \,\Vert\, \pi_{\mathrm{ref}}\right)$
distance from the base model
$J(\pi_\theta)$
$\mathbb{E}[r] - \beta\,\mathrm{KL}$ — what RLHF maximizes
$\left\Vert \nabla_{\!\theta}\, J \right\Vert$
gradient magnitude
$J(\pi^\star)$
objective at the closed-form optimum
$\mu^\star,\ \sigma^\star$
moments of $\pi^\star$ (numerical)

The red dot is the current $\pi_\theta$. The green star is the moment-matched $\pi^\star$. Press Play under Exact RLHF gradient to roll the policy uphill on this landscape — that's what an idealized, full-batch RLHF run looks like.

Five algorithms, one landscape. Click any tab below to swap the bottom half of the page — including the sidebar's method hyperparameters — to that algorithm. Each tab tells the same story (the policy climbs toward $\pi^\star$) but with a different gradient estimator: an idealized analytic gradient, a vanilla sampled gradient, a clipped group-relative variant, a preference-only logistic loss, or a regression on a learned baseline. The last tab, Compare, runs all five from the same starting point and overlays their trajectories on a single map.

03RLHF — climbing the exact gradient

In this idealized setting both $\mathbb{E}_{\pi_\theta}[r(a)]$ and $\mathrm{KL}(\pi_\theta\Vert\pi_{\mathrm{ref}})$ have closed forms, so the full gradient $\nabla_{\!\theta} J$ is computable. Each step rolls the policy uphill on the objective surface above — no sampling, no clipping. This is the limit that GRPO approximates as $G \to \infty$.

$$\theta \;\leftarrow\; \theta \,+\, \eta\,\nabla_{\!\theta}\!\left[\,\mathbb{E}_{\pi_\theta}[r(a)] \,-\, \beta\,\mathrm{KL}(\pi_\theta\,\Vert\,\pi_{\mathrm{ref}})\,\right]$$

Exact-gradient trajectory · blue = $\nabla_{\!\theta} J$ path

Press Play in the sidebar to roll $\pi_\theta$ uphill on $J$. The blue path is the analytic field $\nabla_{\!\theta} J$ — what an infinite-batch RLHF run would do. Compare it to the GRPO and DPO tabs to see how stochastic estimators bend this idealized trajectory.

03Policy Gradient — the simplest sampled estimator

Before GRPO, there's just policy gradient (REINFORCE with a mean baseline). At each step, draw a fresh batch of $N$ actions from the current $\pi_\theta$ — no frozen $\pi_{\mathrm{old}}$, so no importance ratio and no clipping. Score them with $r$, subtract the in-batch mean as a control variate, and follow the resulting Monte-Carlo estimate of $\nabla_{\!\theta}J$.

$$\widehat{\nabla_{\!\theta} J} \;=\; \frac{1}{N}\sum_{i=1}^{N}\big(r_i \,-\, \bar r\big)\,\nabla_{\!\theta}\log\pi_\theta(a_i) \;-\; \beta\,\nabla_{\!\theta}\mathrm{KL}\!\left(\pi_\theta\,\Vert\,\pi_{\mathrm{ref}}\right),$$

and then $\theta \leftarrow \theta + \eta\,\widehat{\nabla_{\!\theta} J}$. As $N \to \infty$ this estimator converges to the exact RLHF gradient from the previous tab; for finite $N$ the path is noisier and small-$N$ batches can overshoot. GRPO (next tab) is essentially this estimator, except it pins $\pi_{\mathrm{old}}$ across multiple updates (importance ratios), clips those ratios, and normalizes the advantage by its in-batch standard deviation.

A fresh batch of $N$ samples
$a_1,\,\ldots,\,a_N \,\sim\, \pi_\theta$  ·  marker shade $\propto r(a_i)$  ·  dashed = current $\mu_\theta$
Per-sample advantage
$\hat A_i \;=\; r_i \,-\, \bar r$,   with $\bar r = \tfrac{1}{N}\sum_i r_i$

04PG vs. exact gradient — Monte-Carlo noise around the analytic path

Trajectories on the objective surface · rose = PG path · blue = exact $\nabla_{\!\theta} J$

Run both Play buttons. The blue path is the analytic gradient field; the rose path is PG with the current batch size. Shrink $N$ — the rose trace gets visibly jittery; grow $N$ — it converges onto the blue. Compare to GRPO (next tab): GRPO adds three modifications (group normalization, importance ratios, ratio clipping) on top of this same idea.

03GRPO — one step, in slow motion

Real RL on language models can't compute $\mathbb{E}_{\pi_\theta}[r(a)]$ in closed form. Instead, it samples a group of $G$ completions $a_1,\ldots,a_G \sim \pi_{\mathrm{old}}$, scores each, and uses the in-group mean as the baseline. GRPO then takes a clipped policy-gradient step that maximizes

$$\mathcal{L}_{\mathrm{GRPO}}(\theta) \;=\; \frac{1}{G}\sum_{i=1}^{G} \min\!\Big(\rho_i\,\hat A_i,\ \mathrm{clip}(\rho_i,\,1{-}\varepsilon,\,1{+}\varepsilon)\,\hat A_i\Big) \;-\; \beta\,\mathrm{KL}\!\left(\pi_\theta \,\Vert\, \pi_{\mathrm{ref}}\right),$$

where $\rho_i = \pi_\theta(a_i)/\pi_{\mathrm{old}}(a_i)$ is the importance ratio and $\hat A_i$ is the group-normalized advantage.

A group of $G$ samples
$a_1,\,\ldots,\,a_G \,\sim\, \pi_{\mathrm{old}}$  ·  marker shade $\propto r(a_i)$  ·  dashed = current $\mu_\theta$
Per-sample advantage
$\hat A_i \;=\; \dfrac{r_i \,-\, \bar r}{\mathrm{std}(r)}$,   with $\bar r = \tfrac{1}{G}\sum_i r_i$
Importance ratio
$\rho_i \;=\; \dfrac{\pi_\theta(a_i)}{\pi_{\mathrm{old}}(a_i)}$  ·  dashed band $= [\,1{-}\varepsilon,\, 1{+}\varepsilon\,]$  ·  grey = clipped (zero gradient)
Per-sample surrogate contribution
$\min\!\big(\rho_i\,\hat A_i,\ \mathrm{clip}(\rho_i,\,1{-}\varepsilon,\,1{+}\varepsilon)\,\hat A_i\big)$

04GRPO vs. exact gradient — same landscape, different paths

Trajectories on the objective surface · red dotted = GRPO · rose dashed = PG · blue = exact $\nabla_{\!\theta} J$

Run all three Play buttons. The blue path is the analytic limit; the rose path is plain PG (REINFORCE with the in-batch mean baseline); the red dotted path is GRPO with the current $G$, $\varepsilon$, and group-normalized advantages. PG and GRPO are both Monte-Carlo estimates of the same gradient — GRPO adds importance ratios, clipping, and std-normalization on top.

03DPO — learning from preferences, without a reward model

DPO (Rafailov et al., 2023) sidesteps the reward model entirely. Given a dataset of preference pairs $(a_w \succ a_l)$ — typically labeled by humans — it trains $\pi_\theta$ directly via a logistic loss whose implicit reward is $\beta_{\mathrm{DPO}}\,\log\pi_\theta(a)/\pi_{\mathrm{ref}}(a)$. In expectation, the minimizer is the same KL-regularized optimum $\pi^\star$ as RLHF.

$$\mathcal{L}_{\mathrm{DPO}}(\theta) \;=\; -\,\mathbb{E}_{(a_w,\,a_l)\sim\mathcal{D}}\!\left[\log\sigma\!\left(\underbrace{\beta_{\mathrm{DPO}}\,\log\dfrac{\pi_\theta(a_w)}{\pi_{\mathrm{ref}}(a_w)} \;-\; \beta_{\mathrm{DPO}}\,\log\dfrac{\pi_\theta(a_l)}{\pi_{\mathrm{ref}}(a_l)}}_{h\;=\;\text{implicit margin}}\right)\right]$$

Here we synthesize preferences: draw pairs from $\pi_{\mathrm{ref}}$ and label them with a noisy Bradley-Terry oracle on $r(a)$, so we know the ground truth and can compare DPO's trajectory to RLHF's.

DPO trajectory on the RLHF landscape · purple = DPO · red dotted = GRPO · blue = exact $\nabla_{\!\theta} J$

Background is still $J(\theta) = \mathbb{E}[r] - \beta\,\mathrm{KL}$ — what RLHF maximizes. DPO doesn't see $J$; it minimizes the logistic loss above. But because the implicit reward $\beta\log\pi_\theta/\pi_{\mathrm{ref}}$ is, up to a constant, the unique reward consistent with $\pi^\star$, the purple path still climbs toward the green star — through a different route, often staying closer to $\pi_{\mathrm{ref}}$.

Preference pairs · color = reward margin $r(a_w)-r(a_l)$
$(a_w, a_l) \,\sim\, \pi_{\mathrm{ref}}$,  labeled by  $P(a \succ b) = \sigma\!\left((r(a)-r(b))/\nu\right)$
Implicit margin per pair · positive = pair ordered correctly by $\pi_\theta$
$h_i \;=\; \beta_{\mathrm{DPO}}\,\log\dfrac{\pi_\theta(a_w^i)}{\pi_{\mathrm{ref}}(a_w^i)} \,-\, \beta_{\mathrm{DPO}}\,\log\dfrac{\pi_\theta(a_l^i)}{\pi_{\mathrm{ref}}(a_l^i)}$

03DRO — fitting an MSE loss to the optimal policy

Direct Reward Optimisation (Richemond et al., 2024) sidesteps preference pairs and pairwise loss entirely. Given offline scalar-reward data $(a_i, r_i)$ — here, $a_i \sim \pi_{\mathrm{ref}}$ and $r_i = r(a_i)$ — DRO jointly fits the policy $\pi_\theta$ and a scalar baseline $V$ by minimizing

$$\mathcal{L}_{\mathrm{DRO}}(\theta, V) \;=\; \tfrac{1}{2}\,\mathbb{E}_{a\sim\pi_{\mathrm{ref}}}\!\left[\,\Big(\underbrace{r(a) \,-\, V \,-\, \beta_{\mathrm{DRO}}\log\dfrac{\pi_\theta(a)}{\pi_{\mathrm{ref}}(a)}}_{\delta(a)\;=\;\text{residual}}\Big)^{2}\,\right].$$

$V$ is another learned parameter, not a hand-tuned constant: each Step descends both partial derivatives of the same MSE loss,

$$\theta \;\leftarrow\; \theta \,+\, \eta_\theta\;\beta_{\mathrm{DRO}}\,\mathbb{E}\!\left[\delta(a)\,\nabla_{\!\theta}\log\pi_\theta(a)\right], \qquad V \;\leftarrow\; V \,+\, \eta_V\,\mathbb{E}\!\left[\delta(a)\right].$$

So $V$ slides up or down with the running mean residual until that residual averages to zero. At the minimizer the residual vanishes pointwise, which forces $\pi_\theta = \pi^\star$ and $V = \beta_{\mathrm{DRO}}\log Z$ — the log-partition of $\pi^\star$. Watch its current value in the sidebar. DRO is yet another route to the same green star: a regression loss instead of a logistic one. No pairs, no clipping, no importance ratios.

DRO trajectory on the RLHF landscape · teal = DRO · purple = DPO · blue = exact $\nabla_{\!\theta} J$

Background is the RLHF objective $J(\theta) = \mathbb{E}[r] - \beta\,\mathrm{KL}$. DRO doesn't see $J$; it descends the squared residual above. Because the unique zero of that residual is the same exponential reweighting that defines $\pi^\star$, the teal path still climbs toward the green star — and it converges fastest when $\beta_{\mathrm{DRO}}$ matches the RLHF $\beta$.

Samples $(a_i, r_i)$ · marker color = signed residual $\delta_i$
$a_i \,\sim\, \pi_{\mathrm{ref}}$,  $r_i = r(a_i)$
Per-sample residual · bars near zero ⇒ regression target is met
$\delta_i \;=\; r_i \,-\, V \,-\, \beta_{\mathrm{DRO}}\log\dfrac{\pi_\theta(a_i)}{\pi_{\mathrm{ref}}(a_i)}$

03Compare — five trajectories, one landscape

Everything we've seen so far, one map. Press Run all in the sidebar to launch each algorithm from the same initial point $(\mu_\theta,\sigma_\theta)$ — the current red dot — with hardcoded default hyperparameters (so the comparison is fair regardless of how you've tweaked the sliders in other tabs). Each algorithm runs for the chosen number of steps; the trajectories are then drawn on top of the shared objective surface.

A few things to look for: (1) the blue exact path is deterministic; everything else is a Monte-Carlo run, so re-running gives different but qualitatively similar traces. (2) PG and GRPO hug the analytic gradient field with sampling noise. (3) DPO and DRO minimize different losses — they don't see $J$ directly — yet their paths still climb toward the same $\pi^\star$, often through different intermediate $(\mu,\sigma)$ regions. (4) Switch the reward preset to Two bumps and re-run: the sampling-based methods are more sensitive to which basin they get pulled into early on.

All five trajectories on the objective surface · blue = exact $\nabla_{\!\theta} J$ · rose = PG · red = GRPO · purple = DPO · teal = DRO
A pedagogical playground for the geometry of RLHF, PG, GRPO, DPO, and DRO — built with Claude. Open the source to tinker with the math.