feat(orchestrator): per-env advantage strategy#2721
Open
hallerite wants to merge 1 commit into
Open
Conversation
Advantage was a single global config applied to every training env. Make it configurable per env: each `TrainEnvConfig` can set its own `advantage`, inheriting the top-level `orchestrator.advantage` when unset (same pattern as `group_size`). `TrainSink` resolves one advantage fn per env and applies it in `process_group` via `self.advantage_fns[env_name]`. The advantage/length-penalty config classes move above `EnvConfig` so `TrainEnvConfig` can reference `AdvantageConfig` (the module has no `from __future__ import annotations`, so annotations evaluate eagerly). Behavior-preserving: with the default config every env inherits the global default, producing identical advantages. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Make advantage computation configurable per training environment instead of a single global setting.
Why
Mixed-env runs currently share one advantage function. Different envs often want different advantage computation — plain GRPO for one, a custom or length-penalized advantage for another. This lets each env carry its own.
Changes
TrainEnvConfiggains anadvantagefield (default/custom), inheriting the top-level[orchestrator.advantage]when unset — the same inheritance pattern already used forgroup_size.TrainSinkbuilds one advantage fn per env and applies it inprocess_groupviaself.advantage_fns[env_name]; the globaladvantage_configparam and its call-site arg are removed.EnvConfigsoTrainEnvConfigcan referenceAdvantageConfig(the module doesn't usefrom __future__ import annotations, so annotations evaluate eagerly).docs/algorithms.md.Behavior
Behavior-preserving: with a default config, every env inherits the global default → identical advantages to before.
Testing
ruff check+ruff format --checkclean on edited files.pytest tests/unit/orchestrator/test_advantage.py→ 17 passed;tests/unit/test_configs.py→ 106 passed.rollouts_per_examplealias via a config-construction check.reverse-textenvs with separate advantages — plain GRPO vs length-penalized) each completed 3 steps cleanly (exit 0,Error 0.0%, both envs trained, GPUs released).🤖 Generated with Claude Code
Note
Medium Risk
Changes how advantages are wired for multi-env runs; misconfigured per-env overrides could alter gradients, though single-env default configs stay behavior-preserving.
Overview
Per-env advantage lets mixed training runs use different GRPO advantage strategies per environment instead of one global
[orchestrator.advantage].TrainEnvConfigadds an optionaladvantagefield (default/custom, including length-penalty knobs). Unset envs inherit the top-level config inOrchestratorConfig.resolve_batching, mirroringgroup_size. Advantage-related Pydantic types are moved aboveEnvConfigsoTrainEnvConfigcan reference them.TrainSinkno longer takes a singleadvantage_config; it buildsadvantage_fnskeyed by env name at init and callsassign_advantages(survivors, self.advantage_fns[env_name])inprocess_group. The orchestrator drops the corresponding constructor argument.docs/algorithms.mddocuments the TOML pattern for overrides. Default-only configs behave the same as before.Reviewed by Cursor Bugbot for commit b561ec8. Bugbot is set up for automated code reviews on this repo. Configure here.