Skip to content

Commit 96f9b2e

Browse files
slikhite-1pengdurice
authored andcommitted
assert removed
Signed-off-by: slikhite-1 <slikhite@nvidia.com>
1 parent 617bc93 commit 96f9b2e

3 files changed

Lines changed: 4 additions & 16 deletions

File tree

docs/guides/grpo.md

Lines changed: 0 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -456,20 +456,6 @@ grpo:
456456
457457
Set `overlong_filtering` to true when training on tasks where truncation at the maximum sequence length is expected, such as long-form reasoning or mathematical proofs.
458458

459-
#### CISPO (Clipped IS-weight Policy Optimization)
460-
461-
CISPO introduced in [MiniMax-M1 paper](https://arxiv.org/abs/2506.13585) clips the importance sampling weight itself and applies stop-gradient.
462-
463-
The loss is:
464-
465-
$$
466-
L(\theta) = E_{x \sim \pi_{\theta_{\text{old}}}} \Big[ \text{sg}\big(\text{clip}(r(\theta), 1-\varepsilon_{\text{low}}, 1+\varepsilon_{\text{high}})\big) \cdot A_t \cdot \log \pi_\theta(x) \Big]
467-
$$
468-
469-
where $r(\theta) = \frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}$, $\text{sg}$ denotes stop-gradient, and $\varepsilon_{\text{low}}$, $\varepsilon_{\text{high}}$ are the IS-weight clipping bounds. Dual-clipping (`ratio_clip_c`) is ignored when CISPO is enabled.
470-
471-
To use CISPO, set `loss_fn.use_cispo: true` in your config. Tune `ratio_clip_min` and `ratio_clip_max` (mapping to $\varepsilon_{\text{low}}$ and $\varepsilon_{\text{high}}$). It is recommended to use a large `ratio_clip_min` (e.g. 1.0) and tune `ratio_clip_max` (e.g. 0.8). Example: [examples/configs/cispo_math_8B.yaml](../../examples/configs/cispo_math_8B.yaml).
472-
473459
#### Top-p and top-k filtering
474460

475461
The implementation aligns with vLLM’s top-p and top-k filtering by applying an equivalent process to the logits.

examples/configs/cispo_math_8B.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
1-
# CISPO Algorithm Configuration
21
defaults: "grpo_math_1B.yaml"
32

43
# ============================================================================
5-
# CISPO: Clipped IS-weight Policy Optimization
4+
# CISPO: Clipped Importance Sampling Policy Optimization
65
# CISPO clips the IS weight itself and applies stop-gradient, then multiplies by
76
# advantage and log-probability.
87
# ratio_clip_min / ratio_clip_max control the IS weight clipping bounds (ε_IS_low / ε_IS_high).

nemo_rl/algorithms/loss/loss_functions.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,9 @@ def __init__(self, cfg: ClippedPGLossConfig):
239239
assert self.ratio_clip_c is None, (
240240
"use_cispo is incompatible with ratio_clip_c; "
241241
"ratio_clip_c is not supported when use_cispo=True"
242+
if self.truncated_importance_sampling_ratio is not None:
243+
assert self.use_importance_sampling_correction, (
244+
"truncated_importance_sampling_ratio is only supported when use_importance_sampling_correction is True"
242245
)
243246
if self.truncated_importance_sampling_type is not None:
244247
assert self.use_importance_sampling_correction, (

0 commit comments

Comments
 (0)