fix: prevent buffer normal-pool drain when difficulty pools are enabled#2667
Open
tonywu71 wants to merge 4 commits into
Open
fix: prevent buffer normal-pool drain when difficulty pools are enabled#2667tonywu71 wants to merge 4 commits into
tonywu71 wants to merge 4 commits into
Conversation
- add max_easy_fraction / max_hard_fraction caps to BufferConfig (default 0.5) - recycle oldest sidelined task back to normal when a pool exceeds its cap - extract _classify() and _recycle_overflow() from update_pools for clarity - add regression tests: drain reproduces without cap, never occurs with cap - document pool caps in algorithms.md
- enforce max_easy_fraction + max_hard_fraction < 1.0 in BufferConfig validator - use math.floor in _recycle_overflow so the combined cap stays strictly below num_total - adjust max_hard_fraction default from 0.5 to 0.4 to satisfy the new constraint - replace behavioral drain test with config-level validator test
Disambiguates the pool-occupancy caps from the resume-time easy_fraction / hard_fraction recycle knobs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
The orchestrator buffer sorts each environment's tasks into three difficulty pools:
normal,easy, andhard. Onlynormalis ever sampled. Theeasyandhardpools are holding pens that sideline tasks the policy already aces or can never touch, so compute concentrates where the gradient signal is strongest.This bucketing is opt-in. Both
easy_thresholdandhard_thresholddefault toNone, and with neither set every task stays innormaland nothing is sidelined.Issue
The bug described below can only trigger once you set one or both
easy_thresholdandhard_threshold(which is why a default-config run never hits it). Before this PR, enabling the pools meant setting just the thresholds (there were no cap fields):The problem is that the task flow is one-way. After every rollout,
update_poolsevicts a task fromnormalintoeasy/hardwhen its recent reward crosses a threshold. But the only code that moves tasks back,convert_to_normal, is a closure insideBuffer.load, which runs only on checkpoint resume. Even then it early-returns, becauseeasy_fraction/hard_fractiondefault to0.0. So during a normal training loop,normalonly ever shrinks.The drain gets worse the better the policy does: eviction is triggered by high reward, so a stronger policy empties the buffer faster. On a small task set (tens of tasks)
normaldrains to empty andsample_examplesraisesValueError: No environments left with examples.mid-run. On large task sets (hundreds of tasks) the bug stays hidden, because enough tasks stay innormalfor the run's duration.Fix
Bound each sidelined pool at a fraction of the env's own task count. When a pool exceeds its cap, recycle its oldest member (FIFO) back into
normal. The cap rounds down (floor), and the config enforcesmax_easy_pool_fraction + max_hard_pool_fraction < 1. Together these give a hard floor onnormalthat holds regardless of reward trajectory:It also re-tests sidelined tasks under the current policy. A task parked as "easy" early eventually returns to be re-evaluated, instead of being exiled on one noisy measurement. The caps only bite when a pool grows large relative to the env's own size, so large task sets are barely touched.
Manual test
Two regression tests in
tests/unit/orchestrator/test_buffer.pypin the behavior:test_buffer_config_rejects_uncapped_poolschecks the guarantee at the config layer.BufferConfigrejects anymax_easy_pool_fraction + max_hard_pool_fraction >= 1.0, so the floor can never be configured away.test_buffer_cap_recycles_and_never_drainschecks the runtime behavior. With the default0.5 / 0.4caps, 20 consecutive "master everything" passes never drainnormal, and sampling keeps working.All 10 tests in the file pass.
Note
convert_to_normalinBuffer.loadis intentionally kept. The resume-time reshuffle offeasy_fraction/hard_fractionis orthogonal to the per-step caps.raiseinsample_examplesis intentionally kept as a canary. With the hard floor in place, a truly emptynormalwould point to a separate bug worth surfacing loudly.0.5 / 0.4. The config rejects any pair summing to1.0or more, so at least one task always stays innormal.Changes
Click here to expand
max_easy_pool_fraction/max_hard_pool_fractiononBufferConfig(defaults0.5/0.4): cap each sidelined pool as a share of the env's tasks; config rejectsmax_easy + max_hard >= 1.0_recycle_overflowin_EnvBuffer: recycles the oldest sidelined task back tonormalwhenever a pool exceeds its cap, called on every eviction inupdate_pools_classifyextracted fromupdate_pools: pulls the reward-to-pool decision into its own helper, with no behavior change>= 1.0; default caps recycle sonormalnever drainsalgorithms.mdNote
Medium Risk
Changes core orchestrator sampling/curriculum behavior when difficulty pools are enabled; mitigated by config validation and unit tests, but training mix may differ from pre-fix runs.
Overview
Fixes a one-way drain of the orchestrator difficulty buffer: with
easy_threshold/hard_thresholdenabled, tasks could leave thenormalpool forever andsample_examplescould raise No environments left with examples.Adds
max_easy_pool_fractionandmax_hard_pool_fractiononBufferConfig(defaults0.5/0.4), with validation that their sum is < 1.0. On each eviction inupdate_pools,_recycle_overflowFIFO-recycles the oldest sidelined task back intonormalwhen a pool exceedsfloor(num_total × fraction).Pool classification is moved to
_classify;docs/algorithms.mddocuments the caps. New unit tests cover invalid config and repeated “master all tasks” without drainingnormal.Reviewed by Cursor Bugbot for commit d376f2b. Bugbot is set up for automated code reviews on this repo. Configure here.