Skip to content

Add Delightful Policy Gradient loss and Kondo Gate to GRPO#1628

Open
finbarrtimbers wants to merge 18 commits into
mainfrom
finbarr/delight
Open

Add Delightful Policy Gradient loss and Kondo Gate to GRPO#1628
finbarrtimbers wants to merge 18 commits into
mainfrom
finbarr/delight

Conversation

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

@finbarrtimbers finbarrtimbers commented Apr 20, 2026

Summary

  • Delightful Policy Gradient (--use_delight, Osband 2026): gates per-token PG terms with sigmoid(advantage * (-new_logprobs.detach())), i.e. sigmoid(delight) where delight = advantage × action surprisal. Temperature η fixed to 1 as in the paper.
  • Kondo gate (--use_kondo_gate, arXiv:2603.20526): per-sample Bernoulli backward-skip. For each sample draws G ~ Ber(σ((χ − λ)/η)) where χ is token-weighted delight and λ = quantile_{1-ρ}(history) over a rolling buffer; when G=0, skip the backward pass entirely. Targets a fraction ρ of samples receiving backward.
    • DP-rank sync via all-reduce of (delight_sum, mask_sum) and an identically-seeded torch.Generator for the Bernoulli draw — DeepSpeed/FSDP collectives stay in sync.
    • Shared KondoGateState helper used by both grpo_fast.py (DeepSpeed) and olmo_core_train_modules.py (FSDP) paths.
    • Flags: --kondo_gate_rate (ρ, default 1.0 = always pass), --kondo_gate_temperature (η), --kondo_gate_history_size, --kondo_gate_warmup.
  • Enabled both in scripts/train/debug/large_test_script.sh with --use_delight true --use_kondo_gate true --kondo_gate_rate 0.5 --kondo_gate_warmup 16.

Test plan

  • Unit tests: uv run pytest open_instruct/test_olmo_core_train_modules.py — delight formula + TestKondoGateState (warmup, quantile, rate-in-expectation) pass.
  • Delight run: https://beaker.org/ex/01KPPGBGR97WNERSE83Q6H88H0
  • Kondo gate run: https://beaker.org/ex/01KPPZ7N0SDW1MM342QR8RRHMB — confirmed working:
    • kondo_gate_backward_frac drops from 1.0 (warmup) to 0.25–0.75, averaging ≈ρ=0.5 ✓
    • kondo_lambda finite and stable (e.g. -0.27)
    • Quantile probes frac_buf>lam ≈ 0.500 at q=0.5 ✓
    • Gate correctly skips high-chi samples and passes low-chi samples
    • Training loss continues to decrease; grad_norm finite

🤖 Generated with Claude Code

…able it in large_test_script. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…O. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the 'Delightful Policy Gradient' gating mechanism for GRPO loss, adding a use_delight configuration option and updating the loss computation logic. A correction was suggested for a typo in the arXiv reference link provided in the documentation.

"""Whether to use DAPO or CISPO loss function."""
use_delight: bool = False
"""Whether to gate per-token policy-gradient terms with the Delightful Policy Gradient sigmoid
of delight = advantage * surprisal (https://arxiv.org/abs/2603.14608)."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The arXiv link provided in the docstring contains a typo. The correct arXiv ID for the 'Delightful Policy Gradient' paper by Osband et al. (2024) is 2403.14608, not 2603.14608.

    of delight = advantage * surprisal (https://arxiv.org/abs/2403.14608)."

…n-weighted chi, cached quantile. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…hored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… warmup. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@finbarrtimbers finbarrtimbers changed the title Add Delightful Policy Gradient gate to GRPO loss Add Delightful Policy Gradient + Kondo gate to GRPO loss Apr 21, 2026
@finbarrtimbers finbarrtimbers changed the title Add Delightful Policy Gradient + Kondo gate to GRPO loss Add Delightful Policy Gradient loss and Kondo Gate to GRPO Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant