Add Delightful Policy Gradient loss and Kondo Gate to GRPO#1628
Open
finbarrtimbers wants to merge 18 commits into
Open
Add Delightful Policy Gradient loss and Kondo Gate to GRPO#1628finbarrtimbers wants to merge 18 commits into
finbarrtimbers wants to merge 18 commits into
Conversation
…able it in large_test_script. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…7 <noreply@anthropic.com>
…O. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request implements the 'Delightful Policy Gradient' gating mechanism for GRPO loss, adding a use_delight configuration option and updating the loss computation logic. A correction was suggested for a typo in the arXiv reference link provided in the documentation.
| """Whether to use DAPO or CISPO loss function.""" | ||
| use_delight: bool = False | ||
| """Whether to gate per-token policy-gradient terms with the Delightful Policy Gradient sigmoid | ||
| of delight = advantage * surprisal (https://arxiv.org/abs/2603.14608).""" |
Contributor
…n-weighted chi, cached quantile. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…hored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… warmup. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…: Claude Opus 4.7 <noreply@anthropic.com>
…red-By: Claude Opus 4.7 <noreply@anthropic.com>
…: Claude Opus 4.7 <noreply@anthropic.com>
# Conflicts: # CHANGELOG.md # open_instruct/olmo_core_train_modules.py
…l. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--use_delight, Osband 2026): gates per-token PG terms withsigmoid(advantage * (-new_logprobs.detach())), i.e.sigmoid(delight)where delight = advantage × action surprisal. Temperature η fixed to 1 as in the paper.--use_kondo_gate, arXiv:2603.20526): per-sample Bernoulli backward-skip. For each sample drawsG ~ Ber(σ((χ − λ)/η))where χ is token-weighted delight andλ = quantile_{1-ρ}(history)over a rolling buffer; whenG=0, skip the backward pass entirely. Targets a fraction ρ of samples receiving backward.torch.Generatorfor the Bernoulli draw — DeepSpeed/FSDP collectives stay in sync.KondoGateStatehelper used by bothgrpo_fast.py(DeepSpeed) andolmo_core_train_modules.py(FSDP) paths.--kondo_gate_rate(ρ, default 1.0 = always pass),--kondo_gate_temperature(η),--kondo_gate_history_size,--kondo_gate_warmup.scripts/train/debug/large_test_script.shwith--use_delight true --use_kondo_gate true --kondo_gate_rate 0.5 --kondo_gate_warmup 16.Test plan
uv run pytest open_instruct/test_olmo_core_train_modules.py— delight formula +TestKondoGateState(warmup, quantile, rate-in-expectation) pass.kondo_gate_backward_fracdrops from 1.0 (warmup) to 0.25–0.75, averaging ≈ρ=0.5 ✓kondo_lambdafinite and stable (e.g. -0.27)frac_buf>lam ≈ 0.500at q=0.5 ✓grad_normfinite🤖 Generated with Claude Code