feat(loss): split surrogate (--policy-loss) and IS granularity (--is-level) out of --advantage-estimator by EazyReal · Pull Request #2 · EazyReal/slime

EazyReal · 2026-06-21T17:58:23Z

What

--advantage-estimator currently decides three independent things: the advantage (credit assignment), the policy-loss surrogate (cispo), and the IS-ratio granularity (gspo). compute_advantages_and_returns (loss.py:716) maps grpo/gspo/cispo to the same get_grpo_returns, which shows gspo/cispo are not advantage choices at all — they are re-read as the surrogate (loss.py:974) and the granularity (loss.py:960). The consequence is that valid combinations (e.g. CISPO bounding on a GAE advantage) cannot be expressed.

This PR makes the three axes independent:

--policy-loss {ppo, cispo} (default ppo) — the surrogate / bounding rule.
--is-level {token, sequence} (default token) — the IS-ratio granularity; sequence is GSPO.
--advantage-estimator now means credit assignment only.

Behaviour preservation

Dispatch-only — compute_policy_loss / compute_cispo_loss / compute_gspo_kl are untouched, so every existing preset reproduces identically. Legacy --advantage-estimator values are remapped for one release with a FutureWarning:

legacy	maps to
`--advantage-estimator gspo`	`--advantage-estimator grpo --is-level sequence`
`--advantage-estimator cispo`	`--advantage-estimator grpo --policy-loss cispo`

The legacy values are kept in the --advantage-estimator choices (so old commands still parse) and remapped in slime_validate_args; a follow-up can drop them from the choices once the deprecation window closes.

Changes

slime/utils/arguments.py — add --policy-loss / --is-level; deprecation shim in slime_validate_args; the CISPO single-sided-clip warning now keys off --policy-loss.
slime/backends/megatron_utils/loss.py — surrogate dispatch keys on --policy-loss, granularity (incl. need_full_log_probs) on --is-level; drop the now-unreachable gspo/cispo literals from the advantage branch.
slime/backends/megatron_utils/actor.py — the log-prob-reuse optimisation gate keys on --is-level instead of == "gspo".
slime/ray/rollout.py — drop the dead gspo/cispo literals from the reward-normalisation groups (legacy values are remapped to grpo before this runs).
tests/test_megatron_argument_validation.py — unit tests for the shim (gspo/cispo remap + warning) and for orthogonal pass-through; the validate-args factory gains the new fields.

Testing

tests/test_megatron_argument_validation.py — 19 passed (16 existing + 3 new) under a CPU venv. No GPU required.

Scope / non-goals

No change to loss numerics, the Megatron/CP machinery, the reducer, or value/SFT losses.
REINFORCE (feat(rl): add REINFORCE advantage estimator THUDM/slime#2083) slots in as a --policy-loss reinforce value once it rebases onto this.
Decoupled reference structure, a surrogate registry, and additive regularizer assembly are separate follow-ups.

Implements increment #1 of the loss-axis RFC (#1).

…level) out of --advantage-estimator --advantage-estimator conflated three orthogonal axes: credit assignment, the policy-loss surrogate (cispo), and the IS ratio granularity (gspo). compute_advantages_and_returns maps grpo/gspo/cispo to identical returns, confirming gspo/cispo were never advantage choices — they only selected the surrogate and granularity, so valid combinations (e.g. CISPO on a GAE advantage) were inexpressible. Add two orthogonal flags: --policy-loss {ppo, cispo} bounding-rule surrogate --is-level {token, sequence} IS ratio granularity (sequence = GSPO) Dispatch-only: the loss math is unchanged. Legacy --advantage-estimator gspo/cispo are remapped for one release with a FutureWarning. Unit tests cover the shim and the orthogonal pass-through.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(loss): split surrogate (--policy-loss) and IS granularity (--is-level) out of --advantage-estimator#2

feat(loss): split surrogate (--policy-loss) and IS granularity (--is-level) out of --advantage-estimator#2
EazyReal wants to merge 1 commit into
mainfrom
upstream-pr/policy-loss-axis

EazyReal commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EazyReal commented Jun 21, 2026

What

Behaviour preservation

Changes

Testing

Scope / non-goals

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant