Skip to content

feat(orchestrator): per-env advantage strategy#2721

Open
hallerite wants to merge 1 commit into
mainfrom
feat/per-env-advantage
Open

feat(orchestrator): per-env advantage strategy#2721
hallerite wants to merge 1 commit into
mainfrom
feat/per-env-advantage

Conversation

@hallerite
Copy link
Copy Markdown
Member

@hallerite hallerite commented Jun 5, 2026

What

Make advantage computation configurable per training environment instead of a single global setting.

Why

Mixed-env runs currently share one advantage function. Different envs often want different advantage computation — plain GRPO for one, a custom or length-penalized advantage for another. This lets each env carry its own.

Changes

  • TrainEnvConfig gains an advantage field (default / custom), inheriting the top-level [orchestrator.advantage] when unset — the same inheritance pattern already used for group_size.
  • TrainSink builds one advantage fn per env and applies it in process_group via self.advantage_fns[env_name]; the global advantage_config param and its call-site arg are removed.
  • Moved the advantage / length-penalty config classes above EnvConfig so TrainEnvConfig can reference AdvantageConfig (the module doesn't use from __future__ import annotations, so annotations evaluate eagerly).
  • Doc note in docs/algorithms.md.

Behavior

Behavior-preserving: with a default config, every env inherits the global default → identical advantages to before.

[orchestrator.advantage]
type = "default"            # inherited by every env unless overridden

[[orchestrator.train.env]]
id = "math-env"             # inherits the default

[[orchestrator.train.env]]
id = "agent-env"
advantage = { type = "custom", import_path = "my_module.normalized_advantage" }

Testing

  • ruff check + ruff format --check clean on edited files.
  • pytest tests/unit/orchestrator/test_advantage.py → 17 passed; tests/unit/test_configs.py → 106 passed.
  • Verified per-env inheritance, per-env override, and the rollouts_per_example alias via a config-construction check.
  • Validated end-to-end on 2× RTX PRO 6000 (Blackwell): a single-env run (one env with a length-penalized advantage) and a multi-env run (two reverse-text envs with separate advantages — plain GRPO vs length-penalized) each completed 3 steps cleanly (exit 0, Error 0.0%, both envs trained, GPUs released).

🤖 Generated with Claude Code


Note

Medium Risk
Changes how advantages are wired for multi-env runs; misconfigured per-env overrides could alter gradients, though single-env default configs stay behavior-preserving.

Overview
Per-env advantage lets mixed training runs use different GRPO advantage strategies per environment instead of one global [orchestrator.advantage].

TrainEnvConfig adds an optional advantage field (default / custom, including length-penalty knobs). Unset envs inherit the top-level config in OrchestratorConfig.resolve_batching, mirroring group_size. Advantage-related Pydantic types are moved above EnvConfig so TrainEnvConfig can reference them.

TrainSink no longer takes a single advantage_config; it builds advantage_fns keyed by env name at init and calls assign_advantages(survivors, self.advantage_fns[env_name]) in process_group. The orchestrator drops the corresponding constructor argument.

docs/algorithms.md documents the TOML pattern for overrides. Default-only configs behave the same as before.

Reviewed by Cursor Bugbot for commit b561ec8. Bugbot is set up for automated code reviews on this repo. Configure here.

Advantage was a single global config applied to every training env. Make it
configurable per env: each `TrainEnvConfig` can set its own `advantage`,
inheriting the top-level `orchestrator.advantage` when unset (same pattern as
`group_size`). `TrainSink` resolves one advantage fn per env and applies it in
`process_group` via `self.advantage_fns[env_name]`.

The advantage/length-penalty config classes move above `EnvConfig` so
`TrainEnvConfig` can reference `AdvantageConfig` (the module has no
`from __future__ import annotations`, so annotations evaluate eagerly).

Behavior-preserving: with the default config every env inherits the global
default, producing identical advantages.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hallerite hallerite marked this pull request as ready for review June 5, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant