Skip to content

feat(rl): add REINFORCE advantage estimator#2083

Open
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:reinforce-estimator
Open

feat(rl): add REINFORCE advantage estimator#2083
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:reinforce-estimator

Conversation

@EazyReal

@EazyReal EazyReal commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What changed

Adds a reinforce option to --advantage-estimator. It reuses the GRPO group-normalized advantages and applies the plain additive surrogate -A * log pi_theta (new compute_reinforce_loss in slime/utils/ppo_utils.py) — no importance-sampling ratio, no clipping; gradient flows only through log_probs, and clipfrac is identically zero.

Wiring:

  • compute_advantages_and_returns (slime/backends/megatron_utils/loss.py): reinforce routed through the existing GRPO returns path (get_grpo_returns).
  • policy_loss_function (slime/backends/megatron_utils/loss.py): dispatches to compute_reinforce_loss for reinforce.
  • Reward group-normalization (slime/ray/rollout.py): reinforce added to the mean-centering and optional std-normalization sets, identical to GRPO.
  • --advantage-estimator choices/help (slime/utils/arguments.py).

Why

REINFORCE with a group baseline is a useful low-overhead estimator: same group normalization as GRPO but without the PPO clip/IS machinery. It is the on-policy base that off-policy importance-sampling corrections can layer on top of.

Validation

CPU unit test tests/test_reinforce.py (registered in the cpu-unittest matrix), run with pytest tests/test_reinforce.py:

  • compute_reinforce_loss matches the closed form -A * log_probs and returns all-zero clipfrac.
  • Backprop produces d/d log_probs = -A (gradient flows only through log_probs).

End-to-end: with the dispatch wired into compute_advantages_and_returns, --advantage-estimator reinforce runs through the GRPO returns + group-normalization path and no longer raises NotImplementedError.

@EazyReal EazyReal force-pushed the reinforce-estimator branch from 8f1c408 to ea4859d Compare June 24, 2026 03:18
@EazyReal

Copy link
Copy Markdown
Contributor Author

@zhuzilin could you review this one? This adds the missing REINFORCE estimator as a narrow advantage path, keeping the estimator interface aligned with the existing GRPO/RLOO style instead of adding a separate training special case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant