feat(rl): add REINFORCE advantage estimator by EazyReal · Pull Request #2083 · THUDM/slime

EazyReal · 2026-06-15T20:44:22Z

What changed

Adds a reinforce option to --advantage-estimator. It reuses the GRPO group-normalized advantages and applies the plain additive surrogate -A * log pi_theta (new compute_reinforce_loss in slime/utils/ppo_utils.py) — no importance-sampling ratio, no clipping; gradient flows only through log_probs, and clipfrac is identically zero.

Wiring:

compute_advantages_and_returns (slime/backends/megatron_utils/loss.py): reinforce routed through the existing GRPO returns path (get_grpo_returns).
policy_loss_function (slime/backends/megatron_utils/loss.py): dispatches to compute_reinforce_loss for reinforce.
Reward group-normalization (slime/ray/rollout.py): reinforce added to the mean-centering and optional std-normalization sets, identical to GRPO.
--advantage-estimator choices/help (slime/utils/arguments.py).

Why

REINFORCE with a group baseline is a useful low-overhead estimator: same group normalization as GRPO but without the PPO clip/IS machinery. It is the on-policy base that off-policy importance-sampling corrections can layer on top of.

Validation

CPU unit test tests/test_reinforce.py (registered in the cpu-unittest matrix), run with pytest tests/test_reinforce.py:

compute_reinforce_loss matches the closed form -A * log_probs and returns all-zero clipfrac.
Backprop produces d/d log_probs = -A (gradient flows only through log_probs).

End-to-end: with the dispatch wired into compute_advantages_and_returns, --advantage-estimator reinforce runs through the GRPO returns + group-normalization path and no longer raises NotImplementedError.

EazyReal · 2026-06-25T08:45:51Z

@zhuzilin could you review this one? This adds the missing REINFORCE estimator as a narrow advantage path, keeping the estimator interface aligned with the existing GRPO/RLOO style instead of adding a separate training special case.

EazyReal mentioned this pull request Jun 15, 2026

feat(rl): add off-policy IS correction hook (current policy vs rollout) #2084

Open

EazyReal force-pushed the reinforce-estimator branch from 0a6ea75 to 8f1c408 Compare June 15, 2026 20:47

feat(rl): add REINFORCE advantage estimator

ea4859d

EazyReal force-pushed the reinforce-estimator branch from 8f1c408 to ea4859d Compare June 24, 2026 03:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(rl): add REINFORCE advantage estimator#2083

feat(rl): add REINFORCE advantage estimator#2083
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:reinforce-estimator

EazyReal commented Jun 15, 2026 •

edited

Loading

Uh oh!

EazyReal commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

EazyReal commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Validation

Uh oh!

EazyReal commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EazyReal commented Jun 15, 2026 •

edited

Loading