feat(loss): split surrogate (--policy-loss) and IS granularity (--is-level) out of --advantage-estimator#2
Open
EazyReal wants to merge 1 commit into
Open
feat(loss): split surrogate (--policy-loss) and IS granularity (--is-level) out of --advantage-estimator#2EazyReal wants to merge 1 commit into
EazyReal wants to merge 1 commit into
Conversation
…level) out of --advantage-estimator
--advantage-estimator conflated three orthogonal axes: credit assignment,
the policy-loss surrogate (cispo), and the IS ratio granularity (gspo).
compute_advantages_and_returns maps grpo/gspo/cispo to identical returns,
confirming gspo/cispo were never advantage choices — they only selected the
surrogate and granularity, so valid combinations (e.g. CISPO on a GAE
advantage) were inexpressible.
Add two orthogonal flags:
--policy-loss {ppo, cispo} bounding-rule surrogate
--is-level {token, sequence} IS ratio granularity (sequence = GSPO)
Dispatch-only: the loss math is unchanged. Legacy --advantage-estimator
gspo/cispo are remapped for one release with a FutureWarning. Unit tests
cover the shim and the orthogonal pass-through.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
--advantage-estimatorcurrently decides three independent things: the advantage (credit assignment), the policy-loss surrogate (cispo), and the IS-ratio granularity (gspo).compute_advantages_and_returns(loss.py:716) mapsgrpo/gspo/cispoto the sameget_grpo_returns, which showsgspo/cispoare not advantage choices at all — they are re-read as the surrogate (loss.py:974) and the granularity (loss.py:960). The consequence is that valid combinations (e.g. CISPO bounding on a GAE advantage) cannot be expressed.This PR makes the three axes independent:
--policy-loss {ppo, cispo}(defaultppo) — the surrogate / bounding rule.--is-level {token, sequence}(defaulttoken) — the IS-ratio granularity;sequenceis GSPO.--advantage-estimatornow means credit assignment only.Behaviour preservation
Dispatch-only —
compute_policy_loss/compute_cispo_loss/compute_gspo_klare untouched, so every existing preset reproduces identically. Legacy--advantage-estimatorvalues are remapped for one release with aFutureWarning:--advantage-estimator gspo--advantage-estimator grpo --is-level sequence--advantage-estimator cispo--advantage-estimator grpo --policy-loss cispoThe legacy values are kept in the
--advantage-estimatorchoices (so old commands still parse) and remapped inslime_validate_args; a follow-up can drop them from the choices once the deprecation window closes.Changes
slime/utils/arguments.py— add--policy-loss/--is-level; deprecation shim inslime_validate_args; the CISPO single-sided-clip warning now keys off--policy-loss.slime/backends/megatron_utils/loss.py— surrogate dispatch keys on--policy-loss, granularity (incl.need_full_log_probs) on--is-level; drop the now-unreachablegspo/cispoliterals from the advantage branch.slime/backends/megatron_utils/actor.py— the log-prob-reuse optimisation gate keys on--is-levelinstead of== "gspo".slime/ray/rollout.py— drop the deadgspo/cispoliterals from the reward-normalisation groups (legacy values are remapped togrpobefore this runs).tests/test_megatron_argument_validation.py— unit tests for the shim (gspo/cispo remap + warning) and for orthogonal pass-through; the validate-args factory gains the new fields.Testing
tests/test_megatron_argument_validation.py— 19 passed (16 existing + 3 new) under a CPU venv. No GPU required.Scope / non-goals
--policy-loss reinforcevalue once it rebases onto this.Implements increment #1 of the loss-axis RFC (#1).