Skip to content

Commit f298484

Browse files
abrichrclaude
andcommitted
docs: add comprehensive verl-agent decision document
Records the full reasoning chain for choosing verl-agent/VAGEN: - Framework comparison (TRL, standalone, verl-agent, VAGEN, OpenRLHF, Unsloth) - Key insight: per-step verification via GiGPO for long-horizon GUI tasks - TRL multi-turn VLM blocker (issues #5119, #5120) - "Environment is the moat" strategic framing - Architecture diagram and migration path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 4f3cec0 commit f298484

1 file changed

Lines changed: 256 additions & 0 deletions

File tree

docs/verl_agent_decision.md

Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
# Decision: verl-agent/VAGEN for VLM RL Training
2+
3+
**Date**: 2026-03-02
4+
**Status**: Adopted (spike complete, PR #84)
5+
**Stakeholders**: OpenAdapt engineering, RL training partners
6+
7+
## Summary
8+
9+
After a comprehensive review of available RL training frameworks for multi-turn
10+
VLM (Vision-Language Model) desktop automation, we chose **verl-agent/VAGEN**
11+
as our training backend. This document records the reasoning, alternatives
12+
considered, and the key architectural insight that drove the decision.
13+
14+
---
15+
16+
## The Key Insight
17+
18+
**verl-agent enables per-step verification within multi-step rollouts — a
19+
requirement for complex computer-usage tasks that no other framework handles
20+
well.**
21+
22+
Desktop automation tasks are inherently long-horizon: a 15-step episode might
23+
involve navigating menus, typing text, clicking buttons, and verifying results.
24+
Standard GRPO gives you a single reward at the end of the episode (did the task
25+
succeed?), but tells you nothing about which individual steps helped or hurt.
26+
27+
verl-agent's GiGPO (Group-in-Group Policy Optimization) solves this with
28+
**two-level advantage computation**:
29+
30+
1. **Episode-level** (standard GRPO): Did rollout A succeed while rollout B
31+
failed? Give higher advantage to A's actions.
32+
2. **Step-level** (GiGPO innovation): Across all rollouts, find steps where
33+
the agent was in the **same state** (same screenshot). Compare the actions
34+
taken from that state — which ones led to better outcomes? Assign
35+
per-step advantages accordingly.
36+
37+
This is uniquely valuable for desktop automation because:
38+
39+
- **Episodes are long** (15+ steps), so episode-level signal is diluted
40+
- **Only the final WAA evaluator** tells you if the task succeeded (binary reward)
41+
- **The same intermediate state** (e.g., "File menu is open") appears across
42+
rollouts — GiGPO exploits this to figure out which click was correct
43+
- **No critic model needed** — GiGPO is critic-free, computing advantages purely
44+
from group comparisons, keeping GPU memory manageable for large VLMs
45+
46+
Without per-step credit assignment, GRPO on a 15-step episode is like giving a
47+
student a single grade on a 15-question exam without marking which answers were
48+
wrong. GiGPO marks each answer.
49+
50+
---
51+
52+
## The Strategic Framing
53+
54+
> "The environment is the moat, not the training math."
55+
56+
This principle, articulated during our architecture review, drove the decision:
57+
58+
1. **Our core value is the WAA RL environment**`RLEnvironment` in
59+
openadapt-evals provides Gym-like reset/step/observe/evaluate for desktop
60+
automation. Nobody else has this as a turnkey package.
61+
62+
2. **Training math is commodity** — GRPO loss is 15 lines of PyTorch. Anyone
63+
can write it. The value is in having a standard interface to plug into.
64+
65+
3. **Build on what others have built** — verl-agent has multi-turn VLM support,
66+
GiGPO, distributed training (FSDP, Ray), vLLM/sglang acceleration. Why
67+
reimplement any of this?
68+
69+
4. **The training example should be a recipe, not a library** — Users
70+
`pip install openadapt-evals`, write a 50-line adapter, and train with
71+
verl-agent. They don't need to install openadapt-ml for GRPO.
72+
73+
> "What's the right way to implement this so that more people will adopt it?
74+
> Is less code better? Should we re-use standard libs and just focus on our
75+
> core value, which is the WAA automation?" — project lead
76+
77+
The answer: yes. Our adapter is ~250 lines of glue. Everything else is
78+
verl-agent's problem.
79+
80+
---
81+
82+
## Comprehensive Framework Review
83+
84+
We evaluated 6 approaches before selecting verl-agent/VAGEN:
85+
86+
### A. TRL GRPOTrainer (HuggingFace)
87+
88+
**Status**: Does NOT support multi-turn VLM GRPO (as of March 2026).
89+
90+
- **Single-turn VLM**: Works. `pixel_values` are buffered in `_buffered_inputs`
91+
and passed to the training forward pass. Tested with Qwen2.5-VL.
92+
- **Multi-turn VLM**: Broken. Chat templating is applied before the rollout
93+
logic, flattening structured multimodal data (text + images) into plain text.
94+
The `rollout_func` receives flattened text, losing image information.
95+
- **Open issues**:
96+
- [#5120](https://github.com/huggingface/trl/issues/5120): "Preserve
97+
structured multimodal messages through rollout and generation pipeline"
98+
(opened Feb 18, 2026, OPEN)
99+
- [#5119](https://github.com/huggingface/trl/issues/5119): "Decouple
100+
inference backend from rollout & agent logic" (OPEN)
101+
- [#4543](https://github.com/huggingface/trl/issues/4543): Multi-step
102+
training forces one shared prompt across all generations, but multi-step
103+
trajectories have different prefixes at each turn
104+
- **Verdict**: Not viable for our use case until #5120 is resolved. Monitoring.
105+
106+
### B. Standalone Loss Math (our initial approach)
107+
108+
**What we built**: 15-line `policy_gradient_loss` function + custom training
109+
loop (~546 lines in `openadapt_ml/training/grpo/trainer.py`).
110+
111+
- **Pros**: Works today, simple, no external dependencies beyond HF/PEFT
112+
- **Cons**:
113+
- Only episode-level rewards (no per-step credit assignment)
114+
- Reimplements model loading, LoRA setup, optimizer, checkpointing
115+
- No distributed training support
116+
- No vLLM/sglang acceleration
117+
- 546 lines of code we own and must maintain
118+
- **Verdict**: Was the right call when TRL couldn't do multi-turn VLM. Now
119+
superseded by verl-agent integration.
120+
121+
### C. verl-agent (selected)
122+
123+
**Repository**: [langfengQ/verl-agent](https://github.com/langfengQ/verl-agent)
124+
125+
- **Multi-turn VLM GRPO**: Yes, first-class support
126+
- **GiGPO**: Step-level credit assignment via two-level grouping
127+
- **Qwen2.5-VL tested**: Yes (Sokoban VLM example with `run_sokoban_qwen3vl.sh`)
128+
- **Architecture**: Step-wise interaction paradigm (no full history concatenation),
129+
customizable memory module per step
130+
- **Algorithms**: GiGPO, GRPO, PPO, DAPO, GSPO, RLOO, REINFORCE++
131+
- **Infrastructure**: Ray-based parallel environments, FSDP training, vLLM/sglang
132+
- **Requirements**: 2+ GPUs minimum, Ray, vLLM
133+
- **Verdict**: Best fit. Purpose-built for multi-turn VLM agent training.
134+
135+
### D. VAGEN / VAGEN-Lite
136+
137+
**Repository**: [mll-lab-nu/VAGEN](https://github.com/mll-lab-nu/VAGEN)
138+
139+
- Built on verl's `agent_loop` abstraction (same ecosystem as verl-agent)
140+
- **Bi-Level GAE** for turn-aware credit assignment
141+
- 3B model achieved 0.82 across 5 agent benchmarks (outperforming GPT-5 at 0.75)
142+
- VAGEN-Lite (Feb 2026): lightweight reimplementation for easier customization
143+
- **Environment protocol**: `GymImageEnv` — async `reset(seed)`, `step(action_str)`,
144+
`close()`, `system_prompt()`. This is the interface we implemented.
145+
- **Verdict**: Excellent. We implemented its `GymImageEnv` protocol. Compatible
146+
with both VAGEN and verl-agent.
147+
148+
### E. OpenRLHF
149+
150+
**Repository**: [OpenRLHF/OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)
151+
152+
- Supports multimodal models, has LMM-R1 fork for multimodal RL
153+
- Implements GRPO, PPO, REINFORCE++ with Ray-based distributed training
154+
- **Multi-turn VLM**: Less documented, unclear if fully supported
155+
- **Verdict**: Viable alternative but less proven for multi-turn VLM specifically.
156+
157+
### F. Unsloth
158+
159+
- 1.5-2x speed, 90% less VRAM for Qwen3-VL/Gemma 3
160+
- **Single-turn only** — not suitable for multi-step desktop automation
161+
- **Verdict**: Not applicable.
162+
163+
### Comparison Matrix
164+
165+
| Feature | TRL | Standalone | verl-agent | VAGEN | OpenRLHF |
166+
|-----------------------------|--------|------------|------------|--------|----------|
167+
| Single-turn VLM GRPO | Yes | Yes | Yes | Yes | Yes |
168+
| Multi-turn VLM GRPO | **No** | Yes* | **Yes** | **Yes**| Unclear |
169+
| Per-step credit assignment | No | No | **GiGPO** | **GAE**| No |
170+
| Distributed training | Yes | No | Yes | Yes | Yes |
171+
| vLLM/sglang acceleration | Yes | No | Yes | Yes | Yes |
172+
| Qwen2.5-VL tested | Yes | Yes | Yes | Yes | Yes |
173+
| Lines of code we maintain | ~200 | ~546 | **~250** | ~250 | ~200 |
174+
| Ease of adoption | High | N/A | Medium | Medium | Medium |
175+
176+
*Standalone multi-turn VLM works but only has episode-level rewards.
177+
178+
---
179+
180+
## Architecture
181+
182+
```
183+
verl-agent / VAGEN openadapt-evals
184+
┌─────────────────────┐ ┌──────────────────────┐
185+
│ GRPOTrainer / GiGPO │ │ WAADesktopEnv │
186+
│ ↓ │ GymImageEnv │ ↓ │
187+
│ AgentLoop │ ──protocol──│ RLEnvironment │
188+
│ ↓ │ │ ↓ │
189+
│ rollout_worker │ │ WAALiveAdapter │
190+
│ (vLLM/sglang) │ │ ↓ │
191+
│ │ │ WAA Flask Server │
192+
│ They handle: │ │ We handle: │
193+
│ - VLM forward pass │ │ - Desktop automation │
194+
│ - Log-prob storage │ │ - Task setup/eval │
195+
│ - GiGPO advantages │ │ - Action translation │
196+
│ - FSDP training │ │ - Screenshot capture │
197+
│ - Checkpointing │ │ - Stuck detection │
198+
└─────────────────────┘ └──────────────────────┘
199+
```
200+
201+
Our adapter (`WAADesktopEnv`, ~250 lines) translates between:
202+
- **openadapt-evals**: `BenchmarkObservation` (PNG bytes + a11y tree)
203+
- **VAGEN**: `{"obs_str": "...", "multi_modal_input": {"<image>": [PIL.Image]}}`
204+
205+
The `RLEnvironment` in openadapt-evals is the **stable interface**. If
206+
verl-agent is superseded (e.g., TRL fixes #5120), we swap the training backend
207+
by writing a new 250-line adapter. The environment code doesn't change.
208+
209+
---
210+
211+
## What We Get for Free
212+
213+
By delegating to verl-agent, we avoid building and maintaining:
214+
215+
| Capability | Lines saved | Complexity saved |
216+
|-----------------------------------|-------------|------------------------|
217+
| Multi-turn VLM rollout collection | ~200 | Image tensor management|
218+
| GiGPO step-level advantages | ~300 | State grouping logic |
219+
| Distributed training (FSDP) | ~500 | Multi-GPU coordination |
220+
| vLLM/sglang inference | ~400 | Inference server mgmt |
221+
| Reference model management | ~100 | Weight synchronization |
222+
| Advanced logging (WandB, TB) | ~100 | Metric tracking |
223+
| **Total** | **~1600** | |
224+
225+
---
226+
227+
## Migration Path
228+
229+
1. **Current state**: Standalone trainer in openadapt-ml (PR #34, merged).
230+
Works, well-tested (56 unit tests + 5 E2E tests). Episode-level rewards only.
231+
232+
2. **Spike complete**: `WAADesktopEnv` adapter in openadapt-evals (PR #84).
233+
21 tests passing. Implements GymImageEnv protocol.
234+
235+
3. **Next**: Test end-to-end with verl-agent on a GPU machine. If successful,
236+
the standalone trainer becomes a reference implementation / fallback, and
237+
verl-agent becomes the recommended training path.
238+
239+
4. **Future**: If TRL resolves #5120 (multi-turn VLM support), evaluate whether
240+
to switch. TRL has broader adoption; switching would reduce the dependency
241+
footprint. But only if TRL also adds per-step credit assignment comparable
242+
to GiGPO.
243+
244+
---
245+
246+
## References
247+
248+
- [verl-agent](https://github.com/langfengQ/verl-agent) — GiGPO paper implementation
249+
- [VAGEN](https://github.com/mll-lab-nu/VAGEN) — Multi-turn VLM agent training
250+
- [verl](https://github.com/verl-project/verl) — Volcano Engine RL for LLMs
251+
- [GiGPO paper](https://arxiv.org/html/2505.10978) — Group-in-Group Policy Optimization
252+
- [VAGEN paper](https://arxiv.org/abs/2510.16907) — World Model Reasoning for VLM Agents
253+
- [TRL #5120](https://github.com/huggingface/trl/issues/5120) — Multimodal rollout pipeline
254+
- [TRL #5119](https://github.com/huggingface/trl/issues/5119) — Backend/rollout decoupling
255+
- [TRL GRPOTrainer docs](https://huggingface.co/docs/trl/main/grpo_trainer)
256+
- [TRL PR #3072](https://github.com/huggingface/trl/pull/3072) — Original VLM support

0 commit comments

Comments
 (0)