Skip to content

Commit 0183321

Browse files
abrichrclaude
andauthored
feat: add VAGEN/verl-agent environment adapter for VLM RL training
* feat: add VAGEN/verl-agent environment adapter for VLM RL training WAADesktopEnv implements the GymImageEnv protocol from VAGEN, enabling desktop GUI automation training with verl-agent's multi-turn VLM RL pipeline (GiGPO, GRPO, PPO). The adapter translates between openadapt-evals BenchmarkObservation (PNG bytes + a11y tree) and VAGEN's observation format (obs_str + multi_modal_input with PIL images). - Async interface (reset/step/close/system_prompt) - Action DSL parsing (CLICK, TYPE, KEY, SCROLL, WAIT, DONE) - Fractional coordinate support (0.0-1.0) - Lazy adapter initialization - 21 tests passing with mock adapter - Example VAGEN training config included Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add comprehensive verl-agent decision document Records the full reasoning chain for choosing verl-agent/VAGEN: - Framework comparison (TRL, standalone, verl-agent, VAGEN, OpenRLHF, Unsloth) - Key insight: per-step verification via GiGPO for long-horizon GUI tasks - TRL multi-turn VLM blocker (issues #5119, #5120) - "Environment is the moat" strategic framing - Architecture diagram and migration path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add verl-agent as optional dependency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: vendor GymImageEnv base classes from VAGEN Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: fact-check framework review in verl decision doc Update Sections E (OpenRLHF), F (Unsloth), TRL, and comparison matrix with accurate details from thorough review: - OpenRLHF: document AgentTrainer multi-turn support and OpenRLHF-M fork - Unsloth: nuanced assessment — single-turn VLM works, multi-turn text via ART works, but multi-turn VLM blocked by rollout_func issue (#3573) - TRL: add note about OpenEnv/rollout_func for text models (VLM blocked) - Comparison matrix: add Unsloth column with footnotes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ffcb41d commit 0183321

9 files changed

Lines changed: 1203 additions & 0 deletions

File tree

configs/train_waa_vagen.yaml

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# VAGEN training config for WAA desktop automation
2+
#
3+
# This trains a VLM (e.g., Qwen2.5-VL-3B) to automate Windows desktop tasks
4+
# using GRPO/GiGPO via the verl-agent framework.
5+
#
6+
# Prerequisites:
7+
# 1. WAA server running (via SSH tunnel): ssh -L 5001:localhost:5050 azureuser@<VM_IP>
8+
# 2. VAGEN installed: pip install vagen
9+
# 3. Register env: add to vagen's env_registry.yaml:
10+
# WAADesktop: openadapt_evals.adapters.verl_env.WAADesktopEnv
11+
#
12+
# Usage:
13+
# python -m vagen.train --config configs/train_waa_vagen.yaml
14+
#
15+
# For mock testing (no VM):
16+
# Set server_url to "mock" and use WAAMockAdapter internally
17+
18+
# --- Model ---
19+
model:
20+
name: Qwen/Qwen2.5-VL-3B-Instruct
21+
# For larger models with LoRA:
22+
# name: Qwen/Qwen2.5-VL-7B-Instruct
23+
# lora:
24+
# r: 16
25+
# alpha: 32
26+
# target_modules: [q_proj, k_proj, v_proj, o_proj]
27+
28+
# --- Environment ---
29+
envs:
30+
- name: WAADesktop
31+
n_envs: 8 # Number of parallel environments (= GRPO group size)
32+
data_source: waa
33+
seed: [1, 100, 1] # [start, end, step] for task selection
34+
max_turns: 15 # Max actions per episode
35+
response_length_per_turn: 512
36+
config:
37+
server_url: "http://localhost:5001"
38+
task_id: "REPLACE_WITH_WAA_TASK_UUID"
39+
max_steps: 15
40+
evaluate_at_done: true
41+
action_type: fractional # VLM outputs normalized 0-1 coordinates
42+
43+
# --- Training (GRPO) ---
44+
algorithm:
45+
name: grpo # or "gigpo" for step-level advantages
46+
kl_coef: 0.0 # No KL penalty (DAPO/Open-Reasoner-Zero style)
47+
epsilon: 0.2 # PPO clip range (inactive with single epoch)
48+
gamma: 1.0 # No discounting for episodic tasks
49+
50+
trainer:
51+
total_epochs: 100
52+
n_gpus_per_node: 2 # Minimum for VLM training
53+
micro_batch_size: 4
54+
gradient_accumulation_steps: 2
55+
56+
# --- Rollout ---
57+
rollout:
58+
temperature: 0.7
59+
top_p: 0.95
60+
mode: async # async sglang rollout for throughput
61+
62+
# --- Logging ---
63+
logging:
64+
project: openadapt-waa-rl
65+
log_interval: 1
66+
save_interval: 10

docs/verl_agent_decision.md

Lines changed: 328 additions & 0 deletions
Large diffs are not rendered by default.

openadapt_evals/adapters/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@
3939
RLEnvironment,
4040
RolloutStep,
4141
)
42+
from openadapt_evals.adapters.verl_env import WAADesktopEnv
4243
from openadapt_evals.adapters.waa import (
4344
WAAAdapter,
4445
WAAConfig,
@@ -69,6 +70,8 @@
6970
"WAAMockAdapter",
7071
"WAALiveAdapter",
7172
"WAALiveConfig",
73+
# verl-agent / VAGEN integration
74+
"WAADesktopEnv",
7275
# Task ID validation
7376
"SyntheticTaskError",
7477
"is_real_waa_task_id",
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
"""Vendored pure-abstract base classes from VAGEN.
2+
3+
These are copied from https://github.com/mll-lab-nu/VAGEN so that
4+
openadapt-evals can implement the GymImageEnv protocol without
5+
requiring the full VAGEN package (and its heavy transitive
6+
dependencies) to be installed.
7+
8+
Only the abstract interface definitions are vendored here -- no
9+
concrete implementations or utilities.
10+
"""
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Vendored from https://github.com/mll-lab-nu/VAGEN
2+
# These are pure abstract base classes with no heavy dependencies.
3+
# Vendored to avoid requiring the full VAGEN installation.
4+
# Last synced: 2026-03-02
5+
6+
from __future__ import annotations
7+
8+
from abc import ABC, abstractmethod
9+
from typing import Any, Dict, Tuple
10+
11+
12+
class GymBaseEnv(ABC):
13+
"""
14+
Abstract async environment API.
15+
The handler does not assume any obs/data schema beyond what you return.
16+
17+
Contract:
18+
- reset(seed) -> (obs, info)
19+
- step(action_str) -> (obs, reward, done, info)
20+
"""
21+
22+
def __init__(self, env_config: Dict[str, Any]):
23+
self.config = env_config
24+
25+
@abstractmethod
26+
async def close(self) -> None:
27+
"""Async teardown."""
28+
raise NotImplementedError
29+
30+
@abstractmethod
31+
async def reset(self, seed: int):
32+
raise NotImplementedError
33+
34+
@abstractmethod
35+
async def step(self, action_str: str):
36+
raise NotImplementedError
37+
38+
@abstractmethod
39+
async def system_prompt(self):
40+
raise NotImplementedError
Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
# Vendored from https://github.com/mll-lab-nu/VAGEN
2+
# These are pure abstract base classes with no heavy dependencies.
3+
# Vendored to avoid requiring the full VAGEN installation.
4+
# Last synced: 2026-03-02
5+
6+
from __future__ import annotations
7+
8+
from abc import abstractmethod
9+
from typing import Any, Dict, Tuple
10+
11+
from .gym_base_env import GymBaseEnv
12+
13+
14+
class GymImageEnv(GymBaseEnv):
15+
"""
16+
GymImageEnv is a base environment class that supports optional
17+
**image-based multi-modal observations**, while keeping the same API
18+
as GymBaseEnv.
19+
20+
--------------------------------------------------------------------
21+
Observation Protocol
22+
--------------------------------------------------------------------
23+
24+
WITH images
25+
-------------------------------
26+
If the environment returns images, the observation should follow:
27+
28+
obs = {
29+
"obs_str": "... <image> ...",
30+
"multi_modal_input": {
31+
"<image>": [PIL.Image.Image, ...]
32+
}
33+
}
34+
35+
- Images are stored under obs["multi_modal_input"]["<image>"].
36+
- "<image>" in obs_str is a placeholder indicating where each image
37+
should appear in the prompt.
38+
- The number of "<image>" in obs_str should match the number of
39+
images in the list.
40+
41+
WITHOUT images:
42+
----------------------------------
43+
Can simply use:
44+
45+
obs = {
46+
"obs_str": "..."
47+
}
48+
49+
- "multi_modal_input" is optional and may be omitted.
50+
- obs_str should NOT contain "<image>" placeholders.
51+
52+
53+
--------------------------------------------------------------------
54+
Agent-Loop Rollout
55+
--------------------------------------------------------------------
56+
- sys : system prompt (from system_prompt()).
57+
- init_obs : observation from reset().
58+
- step_obs : observation from step().
59+
- res_i : agent response at step i.
60+
61+
Concat mode (single growing context):
62+
sys + init_obs + res_0 + step_obs_1 + res_1 + ...
63+
64+
Non-concat mode (step-wise independent contexts):
65+
Step 0: sys + init_obs + res_0
66+
Step 1: sys + step_obs_1 + res_1
67+
Step 2: sys + step_obs_2 + res_2
68+
69+
--------------------------------------------------------------------
70+
Info
71+
--------------------------------------------------------------------
72+
The `info` dict returned by reset() and step() may include:
73+
- success (bool): whether the task/episode is considered
74+
successful, this will be used for wandb logging.
75+
"""
76+
77+
def __init__(self, env_config: Dict[str, Any]):
78+
"""
79+
Initialize the environment.
80+
81+
Args:
82+
env_config (Dict[str, Any]):
83+
Environment configuration. The exact schema is defined by
84+
the concrete environment implementation and/or GymBaseEnv.
85+
86+
Side effects:
87+
- Calls GymBaseEnv.__init__(env_config).
88+
"""
89+
super().__init__(env_config)
90+
91+
@abstractmethod
92+
async def close(self) -> None:
93+
"""
94+
Close the environment and release all resources.
95+
96+
This should clean up anything created by the environment, e.g.:
97+
- windows / renderers
98+
- subprocesses
99+
- file handles
100+
- GPU memory / models
101+
102+
Returns:
103+
None
104+
"""
105+
raise NotImplementedError
106+
107+
@abstractmethod
108+
async def system_prompt(self) -> Dict[str, Any]:
109+
"""
110+
Return the system-level prompt/observation for the environment.
111+
112+
Returns:
113+
obs (Dict[str, Any]):
114+
A dict representing the system prompt observation.
115+
116+
If returning images, it must follow:
117+
118+
obs = {
119+
"obs_str": "... <image> ...",
120+
"multi_modal_input": {
121+
"<image>": [PIL.Image.Image, ...]
122+
}
123+
}
124+
125+
If returning no images, it must follow:
126+
127+
obs = {
128+
"obs_str": "..."
129+
}
130+
"""
131+
raise NotImplementedError
132+
133+
@abstractmethod
134+
async def reset(self, seed: int) -> Tuple[Dict[str, Any], Dict[str, Any]]:
135+
"""
136+
Reset the environment to the initial state.
137+
138+
Args:
139+
seed (int):
140+
Random seed used to initialize the environment
141+
142+
Returns:
143+
obs (Dict[str, Any]):
144+
The initial observation after reset.
145+
146+
If returning images, it must follow:
147+
148+
obs = {
149+
"obs_str": "... <image> ...",
150+
"multi_modal_input": {
151+
"<image>": [PIL.Image.Image, ...]
152+
}
153+
}
154+
155+
If returning no images, it must follow:
156+
157+
obs = {
158+
"obs_str": "..."
159+
}
160+
161+
info (Dict[str, Any]):
162+
A dict containing any additional metadata about the reset,
163+
e.g. debug information, episode identifiers, etc.
164+
"""
165+
raise NotImplementedError
166+
167+
@abstractmethod
168+
async def step(
169+
self, action_str: str
170+
) -> Tuple[Dict[str, Any], float, bool, Dict[str, Any]]:
171+
"""
172+
Execute one environment step using an agent-provided action.
173+
174+
Args:
175+
action_str (str):
176+
The action produced by the agent, in text form.
177+
178+
Returns:
179+
obs (Dict[str, Any]):
180+
The next observation after applying the action.
181+
182+
If returning images, it must follow:
183+
184+
obs = {
185+
"obs_str": "... <image> ...",
186+
"multi_modal_input": {
187+
"<image>": [PIL.Image.Image, ...]
188+
}
189+
}
190+
191+
If returning no images, it must follow:
192+
193+
obs = {
194+
"obs_str": "..."
195+
}
196+
197+
reward (float):
198+
Scalar reward for the current step.
199+
200+
done (bool):
201+
Whether the current episode has terminated after this
202+
step.
203+
204+
info (Dict[str, Any]):
205+
Additional step-level metadata.
206+
207+
Common optional keys:
208+
- success (bool): whether the task/episode is
209+
considered successful, typically used for logging
210+
(e.g. wandb).
211+
"""
212+
raise NotImplementedError

0 commit comments

Comments
 (0)