fix: prevent buffer normal-pool drain when difficulty pools are enabled by tonywu71 · Pull Request #2667 · PrimeIntellect-ai/prime-rl

tonywu71 · 2026-05-29T12:47:52Z

Context

The orchestrator buffer sorts each environment's tasks into three difficulty pools: normal, easy, and hard. Only normal is ever sampled. The easy and hard pools are holding pens that sideline tasks the policy already aces or can never touch, so compute concentrates where the gradient signal is strongest.

This bucketing is opt-in. Both easy_threshold and hard_threshold default to None, and with neither set every task stays in normal and nothing is sidelined.

Issue

The bug described below can only trigger once you set one or both easy_threshold and hard_threshold (which is why a default-config run never hits it). Before this PR, enabling the pools meant setting just the thresholds (there were no cap fields):

[orchestrator.buffer]
easy_threshold = 0.9
hard_threshold = 0.1

The problem is that the task flow is one-way. After every rollout, update_pools evicts a task from normal into easy/hard when its recent reward crosses a threshold. But the only code that moves tasks back, convert_to_normal, is a closure inside Buffer.load, which runs only on checkpoint resume. Even then it early-returns, because easy_fraction/hard_fraction default to 0.0. So during a normal training loop, normal only ever shrinks.

The drain gets worse the better the policy does: eviction is triggered by high reward, so a stronger policy empties the buffer faster. On a small task set (tens of tasks) normal drains to empty and sample_examples raises ValueError: No environments left with examples. mid-run. On large task sets (hundreds of tasks) the bug stays hidden, because enough tasks stay in normal for the run's duration.

Fix

Bound each sidelined pool at a fraction of the env's own task count. When a pool exceeds its cap, recycle its oldest member (FIFO) back into normal. The cap rounds down (floor), and the config enforces max_easy_pool_fraction + max_hard_pool_fraction < 1. Together these give a hard floor on normal that holds regardless of reward trajectory:

len(easy) ≤ ⌊num_total · max_easy_pool_fraction⌋
len(hard) ≤ ⌊num_total · max_hard_pool_fraction⌋
max_easy_pool_fraction + max_hard_pool_fraction < 1
⇒ ⌊N · e⌋ + ⌊N · h⌋ < N  ⇒  normal always holds ≥ 1 task

It also re-tests sidelined tasks under the current policy. A task parked as "easy" early eventually returns to be re-evaluated, instead of being exiled on one noisy measurement. The caps only bite when a pool grows large relative to the env's own size, so large task sets are barely touched.

[orchestrator.buffer]
easy_threshold = 0.95
hard_threshold = 0.05
+ max_easy_pool_fraction = 0.5
+ max_hard_pool_fraction = 0.4
# config rejects max_easy_pool_fraction + max_hard_pool_fraction >= 1.0

Manual test

Two regression tests in tests/unit/orchestrator/test_buffer.py pin the behavior:

test_buffer_config_rejects_uncapped_pools checks the guarantee at the config layer. BufferConfig rejects any max_easy_pool_fraction + max_hard_pool_fraction >= 1.0, so the floor can never be configured away.
test_buffer_cap_recycles_and_never_drains checks the runtime behavior. With the default 0.5 / 0.4 caps, 20 consecutive "master everything" passes never drain normal, and sampling keeps working.

uv run pytest tests/unit/orchestrator/test_buffer.py -v

All 10 tests in the file pass.

Note

convert_to_normal in Buffer.load is intentionally kept. The resume-time reshuffle off easy_fraction/hard_fraction is orthogonal to the per-step caps.
The raise in sample_examples is intentionally kept as a canary. With the hard floor in place, a truly empty normal would point to a separate bug worth surfacing loudly.
Defaults are 0.5 / 0.4. The config rejects any pair summing to 1.0 or more, so at least one task always stays in normal.

Changes

Click here to expand

max_easy_pool_fraction / max_hard_pool_fraction on BufferConfig (defaults 0.5 / 0.4): cap each sidelined pool as a share of the env's tasks; config rejects max_easy + max_hard >= 1.0
_recycle_overflow in _EnvBuffer: recycles the oldest sidelined task back to normal whenever a pool exceeds its cap, called on every eviction in update_pools
_classify extracted from update_pools: pulls the reward-to-pool decision into its own helper, with no behavior change
Regression tests: config rejects pools summing to >= 1.0; default caps recycle so normal never drains
Docs: document the pool caps in algorithms.md

Note

Medium Risk
Changes core orchestrator sampling/curriculum behavior when difficulty pools are enabled; mitigated by config validation and unit tests, but training mix may differ from pre-fix runs.

Overview
Fixes a one-way drain of the orchestrator difficulty buffer: with easy_threshold / hard_threshold enabled, tasks could leave the normal pool forever and sample_examples could raise No environments left with examples.

Adds max_easy_pool_fraction and max_hard_pool_fraction on BufferConfig (defaults 0.5 / 0.4), with validation that their sum is < 1.0. On each eviction in update_pools, _recycle_overflow FIFO-recycles the oldest sidelined task back into normal when a pool exceeds floor(num_total × fraction).

Pool classification is moved to _classify; docs/algorithms.md documents the caps. New unit tests cover invalid config and repeated “master all tasks” without draining normal.

^{Reviewed by Cursor Bugbot for commit d376f2b. Bugbot is set up for automated code reviews on this repo. Configure here.}

- add max_easy_fraction / max_hard_fraction caps to BufferConfig (default 0.5) - recycle oldest sidelined task back to normal when a pool exceeds its cap - extract _classify() and _recycle_overflow() from update_pools for clarity - add regression tests: drain reproduces without cap, never occurs with cap - document pool caps in algorithms.md

- enforce max_easy_fraction + max_hard_fraction < 1.0 in BufferConfig validator - use math.floor in _recycle_overflow so the combined cap stays strictly below num_total - adjust max_hard_fraction default from 0.5 to 0.4 to satisfy the new constraint - replace behavioral drain test with config-level validator test

Disambiguates the pool-occupancy caps from the resume-time easy_fraction / hard_fraction recycle knobs.

tonywu71 added 4 commits May 29, 2026 12:00

chore: trim verbose comments in buffer and docs

dffbd0b

refactor: rename max_easy/hard_fraction to max_easy/hard_pool_fraction

d376f2b

Disambiguates the pool-occupancy caps from the resume-time easy_fraction / hard_fraction recycle knobs.

rasdani requested review from faresobeid, mikasenghaas, rasdani and samsja May 29, 2026 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent buffer normal-pool drain when difficulty pools are enabled#2667

fix: prevent buffer normal-pool drain when difficulty pools are enabled#2667
tonywu71 wants to merge 4 commits into
PrimeIntellect-ai:mainfrom
tonywu71:fix-buffer-drain

tonywu71 commented May 29, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tonywu71 commented May 29, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Issue

Fix

Manual test

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tonywu71 commented May 29, 2026 •

edited by cursor Bot

Loading