Skip to content

fix: prevent buffer normal-pool drain when difficulty pools are enabled#2667

Open
tonywu71 wants to merge 4 commits into
PrimeIntellect-ai:mainfrom
tonywu71:fix-buffer-drain
Open

fix: prevent buffer normal-pool drain when difficulty pools are enabled#2667
tonywu71 wants to merge 4 commits into
PrimeIntellect-ai:mainfrom
tonywu71:fix-buffer-drain

Conversation

@tonywu71
Copy link
Copy Markdown

@tonywu71 tonywu71 commented May 29, 2026

Context

The orchestrator buffer sorts each environment's tasks into three difficulty pools: normal, easy, and hard. Only normal is ever sampled. The easy and hard pools are holding pens that sideline tasks the policy already aces or can never touch, so compute concentrates where the gradient signal is strongest.

This bucketing is opt-in. Both easy_threshold and hard_threshold default to None, and with neither set every task stays in normal and nothing is sidelined.

Issue

The bug described below can only trigger once you set one or both easy_threshold and hard_threshold (which is why a default-config run never hits it). Before this PR, enabling the pools meant setting just the thresholds (there were no cap fields):

[orchestrator.buffer]
easy_threshold = 0.9
hard_threshold = 0.1

The problem is that the task flow is one-way. After every rollout, update_pools evicts a task from normal into easy/hard when its recent reward crosses a threshold. But the only code that moves tasks back, convert_to_normal, is a closure inside Buffer.load, which runs only on checkpoint resume. Even then it early-returns, because easy_fraction/hard_fraction default to 0.0. So during a normal training loop, normal only ever shrinks.

The drain gets worse the better the policy does: eviction is triggered by high reward, so a stronger policy empties the buffer faster. On a small task set (tens of tasks) normal drains to empty and sample_examples raises ValueError: No environments left with examples. mid-run. On large task sets (hundreds of tasks) the bug stays hidden, because enough tasks stay in normal for the run's duration.

Fix

Bound each sidelined pool at a fraction of the env's own task count. When a pool exceeds its cap, recycle its oldest member (FIFO) back into normal. The cap rounds down (floor), and the config enforces max_easy_pool_fraction + max_hard_pool_fraction < 1. Together these give a hard floor on normal that holds regardless of reward trajectory:

len(easy) ≤ ⌊num_total · max_easy_pool_fraction⌋
len(hard) ≤ ⌊num_total · max_hard_pool_fraction⌋
max_easy_pool_fraction + max_hard_pool_fraction < 1
⇒ ⌊N · e⌋ + ⌊N · h⌋ < N  ⇒  normal always holds ≥ 1 task
fix-buffer-drain

It also re-tests sidelined tasks under the current policy. A task parked as "easy" early eventually returns to be re-evaluated, instead of being exiled on one noisy measurement. The caps only bite when a pool grows large relative to the env's own size, so large task sets are barely touched.

[orchestrator.buffer]
easy_threshold = 0.95
hard_threshold = 0.05
+ max_easy_pool_fraction = 0.5
+ max_hard_pool_fraction = 0.4
# config rejects max_easy_pool_fraction + max_hard_pool_fraction >= 1.0

Manual test

Two regression tests in tests/unit/orchestrator/test_buffer.py pin the behavior:

  • test_buffer_config_rejects_uncapped_pools checks the guarantee at the config layer. BufferConfig rejects any max_easy_pool_fraction + max_hard_pool_fraction >= 1.0, so the floor can never be configured away.
  • test_buffer_cap_recycles_and_never_drains checks the runtime behavior. With the default 0.5 / 0.4 caps, 20 consecutive "master everything" passes never drain normal, and sampling keeps working.
uv run pytest tests/unit/orchestrator/test_buffer.py -v

All 10 tests in the file pass.

Note

  • convert_to_normal in Buffer.load is intentionally kept. The resume-time reshuffle off easy_fraction/hard_fraction is orthogonal to the per-step caps.
  • The raise in sample_examples is intentionally kept as a canary. With the hard floor in place, a truly empty normal would point to a separate bug worth surfacing loudly.
  • Defaults are 0.5 / 0.4. The config rejects any pair summing to 1.0 or more, so at least one task always stays in normal.

Changes

Click here to expand
  • max_easy_pool_fraction / max_hard_pool_fraction on BufferConfig (defaults 0.5 / 0.4): cap each sidelined pool as a share of the env's tasks; config rejects max_easy + max_hard >= 1.0
  • _recycle_overflow in _EnvBuffer: recycles the oldest sidelined task back to normal whenever a pool exceeds its cap, called on every eviction in update_pools
  • _classify extracted from update_pools: pulls the reward-to-pool decision into its own helper, with no behavior change
  • Regression tests: config rejects pools summing to >= 1.0; default caps recycle so normal never drains
  • Docs: document the pool caps in algorithms.md

Note

Medium Risk
Changes core orchestrator sampling/curriculum behavior when difficulty pools are enabled; mitigated by config validation and unit tests, but training mix may differ from pre-fix runs.

Overview
Fixes a one-way drain of the orchestrator difficulty buffer: with easy_threshold / hard_threshold enabled, tasks could leave the normal pool forever and sample_examples could raise No environments left with examples.

Adds max_easy_pool_fraction and max_hard_pool_fraction on BufferConfig (defaults 0.5 / 0.4), with validation that their sum is < 1.0. On each eviction in update_pools, _recycle_overflow FIFO-recycles the oldest sidelined task back into normal when a pool exceeds floor(num_total × fraction).

Pool classification is moved to _classify; docs/algorithms.md documents the caps. New unit tests cover invalid config and repeated “master all tasks” without draining normal.

Reviewed by Cursor Bugbot for commit d376f2b. Bugbot is set up for automated code reviews on this repo. Configure here.

tonywu71 added 4 commits May 29, 2026 12:00
- add max_easy_fraction / max_hard_fraction caps to BufferConfig (default 0.5)
- recycle oldest sidelined task back to normal when a pool exceeds its cap
- extract _classify() and _recycle_overflow() from update_pools for clarity
- add regression tests: drain reproduces without cap, never occurs with cap
- document pool caps in algorithms.md
- enforce max_easy_fraction + max_hard_fraction < 1.0 in BufferConfig validator
- use math.floor in _recycle_overflow so the combined cap stays strictly below num_total
- adjust max_hard_fraction default from 0.5 to 0.4 to satisfy the new constraint
- replace behavioral drain test with config-level validator test
Disambiguates the pool-occupancy caps from the resume-time easy_fraction /
hard_fraction recycle knobs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant