Skip to content

Commit 4419b21

Browse files
abrichrclaude
andauthored
feat: add dual training backend support (standalone + verl-agent) (#51)
* feat: add dual training backend support (standalone + verl-agent) Add `backend` field to GRPOConfig ("standalone" or "verl") to support switching between training backends: - standalone: existing trainer.py (single-GPU, episode-level rewards) - verl: verl-agent/VAGEN integration (multi-GPU, GiGPO per-step credit) New verl_backend.py provides build_vagen_config() to map GRPOConfig to VAGEN-compatible config, and train_with_verl() as the integration point (placeholder until full end-to-end is wired up). No existing function signatures or behavior modified. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: format verl_backend.py with ruff Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 99ae082 commit 4419b21

File tree

4 files changed

+167
-3
lines changed

4 files changed

+167
-3
lines changed

openadapt_ml/training/grpo/__init__.py

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,21 @@
44
Connects to openadapt-evals RLEnvironment for rollout collection and
55
task evaluation against live Windows Agent Arena VMs.
66
7+
Supports two training backends (set via GRPOConfig.backend):
8+
- "standalone" (default): Built-in trainer using HuggingFace + PEFT.
9+
Good for single-GPU prototyping and debugging. See trainer.py.
10+
- "verl": Integration with verl-agent/VAGEN for GiGPO and multi-GPU
11+
distributed training. See verl_backend.py.
12+
713
Key components:
8-
- GRPOConfig: Training configuration dataclass
9-
- GRPOTrainer: Main training loop
14+
- GRPOConfig: Training configuration dataclass (includes backend field)
15+
- GRPOTrainer: Main training loop (standalone backend)
1016
- GRPORolloutCollector: Collects rollouts via RLEnvironment
1117
- reward functions: Binary task success + group-relative advantages
1218
- CoT warm-up: Chain-of-thought SFT before GRPO
19+
- verl_backend: verl-agent/VAGEN integration (verl backend)
1320
14-
Example:
21+
Example (standalone):
1522
from openadapt_ml.training.grpo import GRPOConfig, GRPOTrainer
1623
1724
config = GRPOConfig(
@@ -20,6 +27,17 @@
2027
)
2128
trainer = GRPOTrainer(config)
2229
trainer.train()
30+
31+
Example (verl backend):
32+
from openadapt_ml.training.grpo import GRPOConfig
33+
from openadapt_ml.training.grpo.verl_backend import train_with_verl
34+
35+
config = GRPOConfig(
36+
backend="verl",
37+
task_ids=["notepad_1", "settings_1"],
38+
num_training_steps=100,
39+
)
40+
train_with_verl(config) # Prints instructions; raises NotImplementedError
2341
"""
2442

2543
from __future__ import annotations
@@ -44,6 +62,10 @@
4462
build_cot_sft_samples,
4563
generate_cot_annotations,
4664
)
65+
from openadapt_ml.training.grpo.verl_backend import (
66+
build_vagen_config,
67+
train_with_verl,
68+
)
4769

4870
__all__ = [
4971
"GRPOConfig",
@@ -58,4 +80,6 @@
5880
"format_action_as_text",
5981
"build_cot_sft_samples",
6082
"generate_cot_annotations",
83+
"build_vagen_config",
84+
"train_with_verl",
6185
]

openadapt_ml/training/grpo/config.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,11 @@
22
33
Follows the same pattern as TRLTrainingConfig in trl_trainer.py, with
44
additional fields for GRPO-specific hyperparameters and environment setup.
5+
6+
Supports two training backends:
7+
- "standalone" (default): Built-in GRPO trainer using HuggingFace + PEFT.
8+
- "verl": Integration point for verl-agent/VAGEN, which provides GiGPO
9+
and multi-GPU support. See verl_backend.py for details.
510
"""
611

712
from __future__ import annotations
@@ -16,6 +21,9 @@ class GRPOConfig:
1621
Groups model/LoRA defaults with TRLTrainingConfig for consistency.
1722
1823
Attributes:
24+
backend: Training backend to use. "standalone" for the built-in
25+
HuggingFace + PEFT trainer, or "verl" for verl-agent/VAGEN
26+
integration (requires separate installation).
1927
model_name: HuggingFace model identifier.
2028
load_in_4bit: Whether to use 4-bit quantization.
2129
lora_r: LoRA rank.
@@ -32,6 +40,9 @@ class GRPOConfig:
3240
stuck_window: Number of identical screenshots before early termination.
3341
"""
3442

43+
# Backend: "standalone" (built-in HF+PEFT) or "verl" (verl-agent/VAGEN)
44+
backend: str = "standalone"
45+
3546
# Model
3647
model_name: str = "Qwen/Qwen2.5-VL-7B-Instruct"
3748
load_in_4bit: bool = True

openadapt_ml/training/grpo/trainer.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
"""Minimal GRPO trainer bridging TRL/HuggingFace and openadapt-evals RLEnvironment.
22
3+
Note: This is the "standalone" backend. For the verl-agent backend (recommended
4+
for production training with GiGPO and multi-GPU support), see verl_backend.py
5+
or use the VAGEN training config in openadapt-evals/configs/train_waa_vagen.yaml.
6+
37
Uses REINFORCE with group-relative advantages (equivalent to single-epoch GRPO).
48
The policy_gradient_loss function includes PPO-style clipping for future multi-epoch
59
support, but with the current single-epoch design (old_logps == current_logps),
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
"""verl-agent / VAGEN backend for GRPO training.
2+
3+
This module provides the integration point for training via verl-agent
4+
(https://github.com/VAGEN), which offers:
5+
- GiGPO (Generalized Group Relative Policy Optimization)
6+
- Multi-GPU distributed training via veRL
7+
- Desktop environment integration via WAADesktopEnv
8+
9+
The actual training loop is managed by verl-agent's own training script,
10+
not by our GRPOTrainer. This module builds the VAGEN-compatible config
11+
from our GRPOConfig and documents how to run training.
12+
13+
Usage:
14+
To train with the verl backend, set backend="verl" in GRPOConfig.
15+
The train_with_verl() function will print instructions and raise
16+
NotImplementedError until full integration is wired up.
17+
18+
For now, training with verl-agent should be done via:
19+
1. Generate a VAGEN config: train_with_verl(config)
20+
2. Run verl-agent's training script with that config
21+
22+
See also:
23+
- openadapt-evals/configs/train_waa_vagen.yaml
24+
- docs/verl_agent_decision.md (if available)
25+
"""
26+
27+
from __future__ import annotations
28+
29+
import logging
30+
from typing import Any
31+
32+
from openadapt_ml.training.grpo.config import GRPOConfig
33+
34+
logger = logging.getLogger(__name__)
35+
36+
# Deferred import for openadapt-evals WAADesktopEnv (optional dependency)
37+
try:
38+
from openadapt_evals.adapters.verl_env import WAADesktopEnv
39+
except ImportError:
40+
WAADesktopEnv = None # type: ignore[assignment, misc]
41+
42+
43+
def build_vagen_config(config: GRPOConfig) -> dict[str, Any]:
44+
"""Build a VAGEN-compatible config dict from GRPOConfig.
45+
46+
Maps our config fields to the structure expected by verl-agent's
47+
training script. This dict can be serialized to YAML for use with
48+
VAGEN's CLI.
49+
50+
Args:
51+
config: Our GRPO training configuration.
52+
53+
Returns:
54+
Dict matching VAGEN's expected config structure.
55+
"""
56+
return {
57+
"model": {
58+
"name": config.model_name,
59+
"load_in_4bit": config.load_in_4bit,
60+
"lora_r": config.lora_r,
61+
"lora_alpha": config.lora_alpha,
62+
},
63+
"training": {
64+
"learning_rate": config.learning_rate,
65+
"num_training_steps": config.num_training_steps,
66+
"save_every_steps": config.save_every_steps,
67+
"output_dir": config.output_dir,
68+
"num_rollouts_per_step": config.num_rollouts_per_step,
69+
"temperature": config.temperature,
70+
},
71+
"environment": {
72+
"type": "waa_desktop",
73+
"server_url": config.server_url,
74+
"task_ids": config.task_ids,
75+
"max_steps_per_episode": config.max_steps_per_episode,
76+
"screen_size": list(config.screen_size),
77+
"stuck_window": config.stuck_window,
78+
},
79+
}
80+
81+
82+
def train_with_verl(config: GRPOConfig) -> None:
83+
"""Entry point for verl-agent backend training.
84+
85+
Currently a placeholder that documents the integration point.
86+
The actual training happens via verl-agent's own CLI/training script,
87+
not through this function.
88+
89+
Args:
90+
config: GRPO training configuration with backend="verl".
91+
92+
Raises:
93+
NotImplementedError: Always, until full verl-agent integration
94+
is wired up. The error message includes instructions for
95+
running training via verl-agent directly.
96+
"""
97+
vagen_config = build_vagen_config(config)
98+
99+
if WAADesktopEnv is not None:
100+
logger.info(
101+
"WAADesktopEnv is available. verl-agent can use it for "
102+
"desktop environment interaction."
103+
)
104+
else:
105+
logger.warning(
106+
"WAADesktopEnv not found. Install openadapt-evals to enable "
107+
"desktop environment support: uv add openadapt-evals"
108+
)
109+
110+
logger.info("VAGEN config built from GRPOConfig:")
111+
logger.info(" Model: %s", vagen_config["model"]["name"])
112+
logger.info(" Tasks: %s", vagen_config["environment"]["task_ids"])
113+
logger.info(" Steps: %d", vagen_config["training"]["num_training_steps"])
114+
logger.info("")
115+
logger.info(
116+
"To train with verl-agent, use the VAGEN training script with "
117+
"a config derived from the above. Example:"
118+
)
119+
logger.info(" python -m vagen.train --config configs/train_waa_vagen.yaml")
120+
121+
raise NotImplementedError(
122+
"verl-agent training requires running via VAGEN's training script. "
123+
"See docs/verl_agent_decision.md for setup instructions. "
124+
"Use build_vagen_config() to generate a compatible config dict."
125+
)

0 commit comments

Comments
 (0)