Skip to content

feat(loss): split surrogate (--policy-loss) and IS granularity (--is-level) out of --advantage-estimator#2

Open
EazyReal wants to merge 1 commit into
mainfrom
upstream-pr/policy-loss-axis
Open

feat(loss): split surrogate (--policy-loss) and IS granularity (--is-level) out of --advantage-estimator#2
EazyReal wants to merge 1 commit into
mainfrom
upstream-pr/policy-loss-axis

Conversation

@EazyReal

Copy link
Copy Markdown
Owner

What

--advantage-estimator currently decides three independent things: the advantage (credit assignment), the policy-loss surrogate (cispo), and the IS-ratio granularity (gspo). compute_advantages_and_returns (loss.py:716) maps grpo/gspo/cispo to the same get_grpo_returns, which shows gspo/cispo are not advantage choices at all — they are re-read as the surrogate (loss.py:974) and the granularity (loss.py:960). The consequence is that valid combinations (e.g. CISPO bounding on a GAE advantage) cannot be expressed.

This PR makes the three axes independent:

  • --policy-loss {ppo, cispo} (default ppo) — the surrogate / bounding rule.
  • --is-level {token, sequence} (default token) — the IS-ratio granularity; sequence is GSPO.
  • --advantage-estimator now means credit assignment only.

Behaviour preservation

Dispatch-only — compute_policy_loss / compute_cispo_loss / compute_gspo_kl are untouched, so every existing preset reproduces identically. Legacy --advantage-estimator values are remapped for one release with a FutureWarning:

legacy maps to
--advantage-estimator gspo --advantage-estimator grpo --is-level sequence
--advantage-estimator cispo --advantage-estimator grpo --policy-loss cispo

The legacy values are kept in the --advantage-estimator choices (so old commands still parse) and remapped in slime_validate_args; a follow-up can drop them from the choices once the deprecation window closes.

Changes

  • slime/utils/arguments.py — add --policy-loss / --is-level; deprecation shim in slime_validate_args; the CISPO single-sided-clip warning now keys off --policy-loss.
  • slime/backends/megatron_utils/loss.py — surrogate dispatch keys on --policy-loss, granularity (incl. need_full_log_probs) on --is-level; drop the now-unreachable gspo/cispo literals from the advantage branch.
  • slime/backends/megatron_utils/actor.py — the log-prob-reuse optimisation gate keys on --is-level instead of == "gspo".
  • slime/ray/rollout.py — drop the dead gspo/cispo literals from the reward-normalisation groups (legacy values are remapped to grpo before this runs).
  • tests/test_megatron_argument_validation.py — unit tests for the shim (gspo/cispo remap + warning) and for orthogonal pass-through; the validate-args factory gains the new fields.

Testing

tests/test_megatron_argument_validation.py — 19 passed (16 existing + 3 new) under a CPU venv. No GPU required.

Scope / non-goals

  • No change to loss numerics, the Megatron/CP machinery, the reducer, or value/SFT losses.
  • REINFORCE (feat(rl): add REINFORCE advantage estimator THUDM/slime#2083) slots in as a --policy-loss reinforce value once it rebases onto this.
  • Decoupled reference structure, a surrogate registry, and additive regularizer assembly are separate follow-ups.

Implements increment #1 of the loss-axis RFC (#1).

…level) out of --advantage-estimator

--advantage-estimator conflated three orthogonal axes: credit assignment,
the policy-loss surrogate (cispo), and the IS ratio granularity (gspo).
compute_advantages_and_returns maps grpo/gspo/cispo to identical returns,
confirming gspo/cispo were never advantage choices — they only selected the
surrogate and granularity, so valid combinations (e.g. CISPO on a GAE
advantage) were inexpressible.

Add two orthogonal flags:
  --policy-loss {ppo, cispo}    bounding-rule surrogate
  --is-level {token, sequence}  IS ratio granularity (sequence = GSPO)

Dispatch-only: the loss math is unchanged. Legacy --advantage-estimator
gspo/cispo are remapped for one release with a FutureWarning. Unit tests
cover the shim and the orthogonal pass-through.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant