feat(orchestrator): per-env advantage strategy by hallerite · Pull Request #2721 · PrimeIntellect-ai/prime-rl

hallerite · 2026-06-05T13:42:22Z

What

Make advantage computation configurable per training environment instead of a single global setting.

Why

Mixed-env runs currently share one advantage function. Different envs often want different advantage computation — plain GRPO for one, a custom or length-penalized advantage for another. This lets each env carry its own.

Changes

TrainEnvConfig gains an advantage field (default / custom), inheriting the top-level [orchestrator.advantage] when unset — the same inheritance pattern already used for group_size.
TrainSink builds one advantage fn per env and applies it in process_group via self.advantage_fns[env_name]; the global advantage_config param and its call-site arg are removed.
Moved the advantage / length-penalty config classes above EnvConfig so TrainEnvConfig can reference AdvantageConfig (the module doesn't use from __future__ import annotations, so annotations evaluate eagerly).
Doc note in docs/algorithms.md.

Behavior

Behavior-preserving: with a default config, every env inherits the global default → identical advantages to before.

[orchestrator.advantage]
type = "default"            # inherited by every env unless overridden

[[orchestrator.train.env]]
id = "math-env"             # inherits the default

[[orchestrator.train.env]]
id = "agent-env"
advantage = { type = "custom", import_path = "my_module.normalized_advantage" }

Testing

ruff check + ruff format --check clean on edited files.
pytest tests/unit/orchestrator/test_advantage.py → 17 passed; tests/unit/test_configs.py → 106 passed.
Verified per-env inheritance, per-env override, and the rollouts_per_example alias via a config-construction check.
Validated end-to-end on 2× RTX PRO 6000 (Blackwell): a single-env run (one env with a length-penalized advantage) and a multi-env run (two reverse-text envs with separate advantages — plain GRPO vs length-penalized) each completed 3 steps cleanly (exit 0, Error 0.0%, both envs trained, GPUs released).

🤖 Generated with Claude Code

Note

Medium Risk
Changes how advantages are wired for multi-env runs; misconfigured per-env overrides could alter gradients, though single-env default configs stay behavior-preserving.

Overview
Per-env advantage lets mixed training runs use different GRPO advantage strategies per environment instead of one global [orchestrator.advantage].

TrainEnvConfig adds an optional advantage field (default / custom, including length-penalty knobs). Unset envs inherit the top-level config in OrchestratorConfig.resolve_batching, mirroring group_size. Advantage-related Pydantic types are moved above EnvConfig so TrainEnvConfig can reference them.

TrainSink no longer takes a single advantage_config; it builds advantage_fns keyed by env name at init and calls assign_advantages(survivors, self.advantage_fns[env_name]) in process_group. The orchestrator drops the corresponding constructor argument.

docs/algorithms.md documents the TOML pattern for overrides. Default-only configs behave the same as before.

^{Reviewed by Cursor Bugbot for commit b561ec8. Bugbot is set up for automated code reviews on this repo. Configure here.}

Advantage was a single global config applied to every training env. Make it configurable per env: each `TrainEnvConfig` can set its own `advantage`, inheriting the top-level `orchestrator.advantage` when unset (same pattern as `group_size`). `TrainSink` resolves one advantage fn per env and applies it in `process_group` via `self.advantage_fns[env_name]`. The advantage/length-penalty config classes move above `EnvConfig` so `TrainEnvConfig` can reference `AdvantageConfig` (the module has no `from __future__ import annotations`, so annotations evaluate eagerly). Behavior-preserving: with the default config every env inherits the global default, producing identical advantages. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

hallerite marked this pull request as ready for review June 5, 2026 13:43

hallerite mentioned this pull request Jun 5, 2026

feat(orchestrator): per-env sample strategy + env-mix seam #2722

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(orchestrator): per-env advantage strategy#2721

feat(orchestrator): per-env advantage strategy#2721
hallerite wants to merge 1 commit into
mainfrom
feat/per-env-advantage

hallerite commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Changes

Behavior

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented Jun 5, 2026 •

edited

Loading