[feat:] Add CISPO loss#2531
Open
pengdurice wants to merge 12 commits into
Open
Conversation
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
…ove / check later Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Continue working on the existing CISPO PR #2187.
Adds CISPO support to the GRPO loss path and provides matched GRPO/CISPO Qwen3-30B-A3B high-off-policy recipes to validate the clipped importance-sampling objective under repeated updates per rollout.
Notes:
This PR at this moment includes the yaml and sh files that can be used to fully reproduce the results below, can remove them later.
PR with test files cleaned.
Issues
N/A
Usage
You can potentially add a usage example below.
The recipes use Qwen3-30B-A3B with Megatron policy training and colocated vLLM generation. They are matched except for the loss configuration:
Validation Results
High-off-policy validation was run with Qwen3-30B-A3B on 2 nodes x 8 GPUs. The setup uses 32 prompts x 16 generations = 512 trajectories per rollout and
train_global_batch_size=32, giving 16 policy updates per rollout.Mechanistic checks:
train/probs_ratio_clamped_maxtrain/cispo_diag/grpo_would_clip_fracaveragetrain/cispo_diag/would_clip_and_low_prob_fracaverageAsync lag-1 high-off-policy validation was also run with the same model and rollout/update ratio, but with non-colocated async vLLM generation and
max_trajectory_age_steps=1.Async lag-1 mechanistic checks:
train/probs_ratio_clamped_maxtrain/rewardaveragetrain/policy_kl_erroraveragetrain/token_mult_prob_erroraverageSummary: Across the synchronous high-off-policy and async lag-1 validations, CISPO shows modest positive evidence: average train reward is roughly tied to slightly higher, average validation accuracy is slightly higher, and the async lag-1 run finishes with higher final validation accuracy. The effect is not a clean sweep at every checkpoint, but CISPO is at least competitive with GRPO and appears more stable on the ratio/KL outlier diagnostics in the async lag-1 setting.
Before your PR is "Ready for review"
Pre checks:
Additional Information
CISPO validation used the final high-off-policy recipes:
examples/configs/recipes/llm/cispo-mm1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-grpo.yamlexamples/configs/recipes/llm/cispo-mm1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-cispo.yamlexamples/configs/recipes/llm/cispo-mm1-async-lag1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-grpo.yamlexamples/configs/recipes/llm/cispo-mm1-async-lag1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-cispo.yamlThe corresponding launch scripts are:
tests/test_suites/llm/cispo-mm1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-grpo.shtests/test_suites/llm/cispo-mm1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-cispo.shtests/test_suites/llm/cispo-mm1-async-lag1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-grpo.shtests/test_suites/llm/cispo-mm1-async-lag1-highoffpolicy-qwen3-30ba3b-2n8g-megatron-cispo.sh