|
| 1 | +# Decision: verl-agent/VAGEN for VLM RL Training |
| 2 | + |
| 3 | +**Date**: 2026-03-02 |
| 4 | +**Status**: Adopted (spike complete, PR #84) |
| 5 | +**Stakeholders**: OpenAdapt engineering, RL training partners |
| 6 | + |
| 7 | +## Summary |
| 8 | + |
| 9 | +After a comprehensive review of available RL training frameworks for multi-turn |
| 10 | +VLM (Vision-Language Model) desktop automation, we chose **verl-agent/VAGEN** |
| 11 | +as our training backend. This document records the reasoning, alternatives |
| 12 | +considered, and the key architectural insight that drove the decision. |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## The Key Insight |
| 17 | + |
| 18 | +**verl-agent enables per-step verification within multi-step rollouts — a |
| 19 | +requirement for complex computer-usage tasks that no other framework handles |
| 20 | +well.** |
| 21 | + |
| 22 | +Desktop automation tasks are inherently long-horizon: a 15-step episode might |
| 23 | +involve navigating menus, typing text, clicking buttons, and verifying results. |
| 24 | +Standard GRPO gives you a single reward at the end of the episode (did the task |
| 25 | +succeed?), but tells you nothing about which individual steps helped or hurt. |
| 26 | + |
| 27 | +verl-agent's GiGPO (Group-in-Group Policy Optimization) solves this with |
| 28 | +**two-level advantage computation**: |
| 29 | + |
| 30 | +1. **Episode-level** (standard GRPO): Did rollout A succeed while rollout B |
| 31 | + failed? Give higher advantage to A's actions. |
| 32 | +2. **Step-level** (GiGPO innovation): Across all rollouts, find steps where |
| 33 | + the agent was in the **same state** (same screenshot). Compare the actions |
| 34 | + taken from that state — which ones led to better outcomes? Assign |
| 35 | + per-step advantages accordingly. |
| 36 | + |
| 37 | +This is uniquely valuable for desktop automation because: |
| 38 | + |
| 39 | +- **Episodes are long** (15+ steps), so episode-level signal is diluted |
| 40 | +- **Only the final WAA evaluator** tells you if the task succeeded (binary reward) |
| 41 | +- **The same intermediate state** (e.g., "File menu is open") appears across |
| 42 | + rollouts — GiGPO exploits this to figure out which click was correct |
| 43 | +- **No critic model needed** — GiGPO is critic-free, computing advantages purely |
| 44 | + from group comparisons, keeping GPU memory manageable for large VLMs |
| 45 | + |
| 46 | +Without per-step credit assignment, GRPO on a 15-step episode is like giving a |
| 47 | +student a single grade on a 15-question exam without marking which answers were |
| 48 | +wrong. GiGPO marks each answer. |
| 49 | + |
| 50 | +--- |
| 51 | + |
| 52 | +## The Strategic Framing |
| 53 | + |
| 54 | +> "The environment is the moat, not the training math." |
| 55 | +
|
| 56 | +This principle, articulated during our architecture review, drove the decision: |
| 57 | + |
| 58 | +1. **Our core value is the WAA RL environment** — `RLEnvironment` in |
| 59 | + openadapt-evals provides Gym-like reset/step/observe/evaluate for desktop |
| 60 | + automation. Nobody else has this as a turnkey package. |
| 61 | + |
| 62 | +2. **Training math is commodity** — GRPO loss is 15 lines of PyTorch. Anyone |
| 63 | + can write it. The value is in having a standard interface to plug into. |
| 64 | + |
| 65 | +3. **Build on what others have built** — verl-agent has multi-turn VLM support, |
| 66 | + GiGPO, distributed training (FSDP, Ray), vLLM/sglang acceleration. Why |
| 67 | + reimplement any of this? |
| 68 | + |
| 69 | +4. **The training example should be a recipe, not a library** — Users |
| 70 | + `pip install openadapt-evals`, write a 50-line adapter, and train with |
| 71 | + verl-agent. They don't need to install openadapt-ml for GRPO. |
| 72 | + |
| 73 | +> "What's the right way to implement this so that more people will adopt it? |
| 74 | +> Is less code better? Should we re-use standard libs and just focus on our |
| 75 | +> core value, which is the WAA automation?" — project lead |
| 76 | +
|
| 77 | +The answer: yes. Our adapter is ~250 lines of glue. Everything else is |
| 78 | +verl-agent's problem. |
| 79 | + |
| 80 | +--- |
| 81 | + |
| 82 | +## Comprehensive Framework Review |
| 83 | + |
| 84 | +We evaluated 6 approaches before selecting verl-agent/VAGEN: |
| 85 | + |
| 86 | +### A. TRL GRPOTrainer (HuggingFace) |
| 87 | + |
| 88 | +**Status**: Does NOT support multi-turn VLM GRPO (as of March 2026). |
| 89 | + |
| 90 | +- **Single-turn VLM**: Works. `pixel_values` are buffered in `_buffered_inputs` |
| 91 | + and passed to the training forward pass. Tested with Qwen2.5-VL. |
| 92 | +- **Multi-turn VLM**: Broken. Chat templating is applied before the rollout |
| 93 | + logic, flattening structured multimodal data (text + images) into plain text. |
| 94 | + The `rollout_func` receives flattened text, losing image information. |
| 95 | +- **Open issues**: |
| 96 | + - [#5120](https://github.com/huggingface/trl/issues/5120): "Preserve |
| 97 | + structured multimodal messages through rollout and generation pipeline" |
| 98 | + (opened Feb 18, 2026, OPEN) |
| 99 | + - [#5119](https://github.com/huggingface/trl/issues/5119): "Decouple |
| 100 | + inference backend from rollout & agent logic" (OPEN) |
| 101 | + - [#4543](https://github.com/huggingface/trl/issues/4543): Multi-step |
| 102 | + training forces one shared prompt across all generations, but multi-step |
| 103 | + trajectories have different prefixes at each turn |
| 104 | +- **Verdict**: Not viable for our use case until #5120 is resolved. Monitoring. |
| 105 | + |
| 106 | +### B. Standalone Loss Math (our initial approach) |
| 107 | + |
| 108 | +**What we built**: 15-line `policy_gradient_loss` function + custom training |
| 109 | +loop (~546 lines in `openadapt_ml/training/grpo/trainer.py`). |
| 110 | + |
| 111 | +- **Pros**: Works today, simple, no external dependencies beyond HF/PEFT |
| 112 | +- **Cons**: |
| 113 | + - Only episode-level rewards (no per-step credit assignment) |
| 114 | + - Reimplements model loading, LoRA setup, optimizer, checkpointing |
| 115 | + - No distributed training support |
| 116 | + - No vLLM/sglang acceleration |
| 117 | + - 546 lines of code we own and must maintain |
| 118 | +- **Verdict**: Was the right call when TRL couldn't do multi-turn VLM. Now |
| 119 | + superseded by verl-agent integration. |
| 120 | + |
| 121 | +### C. verl-agent (selected) |
| 122 | + |
| 123 | +**Repository**: [langfengQ/verl-agent](https://github.com/langfengQ/verl-agent) |
| 124 | + |
| 125 | +- **Multi-turn VLM GRPO**: Yes, first-class support |
| 126 | +- **GiGPO**: Step-level credit assignment via two-level grouping |
| 127 | +- **Qwen2.5-VL tested**: Yes (Sokoban VLM example with `run_sokoban_qwen3vl.sh`) |
| 128 | +- **Architecture**: Step-wise interaction paradigm (no full history concatenation), |
| 129 | + customizable memory module per step |
| 130 | +- **Algorithms**: GiGPO, GRPO, PPO, DAPO, GSPO, RLOO, REINFORCE++ |
| 131 | +- **Infrastructure**: Ray-based parallel environments, FSDP training, vLLM/sglang |
| 132 | +- **Requirements**: 2+ GPUs minimum, Ray, vLLM |
| 133 | +- **Verdict**: Best fit. Purpose-built for multi-turn VLM agent training. |
| 134 | + |
| 135 | +### D. VAGEN / VAGEN-Lite |
| 136 | + |
| 137 | +**Repository**: [mll-lab-nu/VAGEN](https://github.com/mll-lab-nu/VAGEN) |
| 138 | + |
| 139 | +- Built on verl's `agent_loop` abstraction (same ecosystem as verl-agent) |
| 140 | +- **Bi-Level GAE** for turn-aware credit assignment |
| 141 | +- 3B model achieved 0.82 across 5 agent benchmarks (outperforming GPT-5 at 0.75) |
| 142 | +- VAGEN-Lite (Feb 2026): lightweight reimplementation for easier customization |
| 143 | +- **Environment protocol**: `GymImageEnv` — async `reset(seed)`, `step(action_str)`, |
| 144 | + `close()`, `system_prompt()`. This is the interface we implemented. |
| 145 | +- **Verdict**: Excellent. We implemented its `GymImageEnv` protocol. Compatible |
| 146 | + with both VAGEN and verl-agent. |
| 147 | + |
| 148 | +### E. OpenRLHF |
| 149 | + |
| 150 | +**Repository**: [OpenRLHF/OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) |
| 151 | + |
| 152 | +- Supports multimodal models, has LMM-R1 fork for multimodal RL |
| 153 | +- Implements GRPO, PPO, REINFORCE++ with Ray-based distributed training |
| 154 | +- **Multi-turn VLM**: Less documented, unclear if fully supported |
| 155 | +- **Verdict**: Viable alternative but less proven for multi-turn VLM specifically. |
| 156 | + |
| 157 | +### F. Unsloth |
| 158 | + |
| 159 | +- 1.5-2x speed, 90% less VRAM for Qwen3-VL/Gemma 3 |
| 160 | +- **Single-turn only** — not suitable for multi-step desktop automation |
| 161 | +- **Verdict**: Not applicable. |
| 162 | + |
| 163 | +### Comparison Matrix |
| 164 | + |
| 165 | +| Feature | TRL | Standalone | verl-agent | VAGEN | OpenRLHF | |
| 166 | +|-----------------------------|--------|------------|------------|--------|----------| |
| 167 | +| Single-turn VLM GRPO | Yes | Yes | Yes | Yes | Yes | |
| 168 | +| Multi-turn VLM GRPO | **No** | Yes* | **Yes** | **Yes**| Unclear | |
| 169 | +| Per-step credit assignment | No | No | **GiGPO** | **GAE**| No | |
| 170 | +| Distributed training | Yes | No | Yes | Yes | Yes | |
| 171 | +| vLLM/sglang acceleration | Yes | No | Yes | Yes | Yes | |
| 172 | +| Qwen2.5-VL tested | Yes | Yes | Yes | Yes | Yes | |
| 173 | +| Lines of code we maintain | ~200 | ~546 | **~250** | ~250 | ~200 | |
| 174 | +| Ease of adoption | High | N/A | Medium | Medium | Medium | |
| 175 | + |
| 176 | +*Standalone multi-turn VLM works but only has episode-level rewards. |
| 177 | + |
| 178 | +--- |
| 179 | + |
| 180 | +## Architecture |
| 181 | + |
| 182 | +``` |
| 183 | +verl-agent / VAGEN openadapt-evals |
| 184 | +┌─────────────────────┐ ┌──────────────────────┐ |
| 185 | +│ GRPOTrainer / GiGPO │ │ WAADesktopEnv │ |
| 186 | +│ ↓ │ GymImageEnv │ ↓ │ |
| 187 | +│ AgentLoop │ ──protocol──│ RLEnvironment │ |
| 188 | +│ ↓ │ │ ↓ │ |
| 189 | +│ rollout_worker │ │ WAALiveAdapter │ |
| 190 | +│ (vLLM/sglang) │ │ ↓ │ |
| 191 | +│ │ │ WAA Flask Server │ |
| 192 | +│ They handle: │ │ We handle: │ |
| 193 | +│ - VLM forward pass │ │ - Desktop automation │ |
| 194 | +│ - Log-prob storage │ │ - Task setup/eval │ |
| 195 | +│ - GiGPO advantages │ │ - Action translation │ |
| 196 | +│ - FSDP training │ │ - Screenshot capture │ |
| 197 | +│ - Checkpointing │ │ - Stuck detection │ |
| 198 | +└─────────────────────┘ └──────────────────────┘ |
| 199 | +``` |
| 200 | + |
| 201 | +Our adapter (`WAADesktopEnv`, ~250 lines) translates between: |
| 202 | +- **openadapt-evals**: `BenchmarkObservation` (PNG bytes + a11y tree) |
| 203 | +- **VAGEN**: `{"obs_str": "...", "multi_modal_input": {"<image>": [PIL.Image]}}` |
| 204 | + |
| 205 | +The `RLEnvironment` in openadapt-evals is the **stable interface**. If |
| 206 | +verl-agent is superseded (e.g., TRL fixes #5120), we swap the training backend |
| 207 | +by writing a new 250-line adapter. The environment code doesn't change. |
| 208 | + |
| 209 | +--- |
| 210 | + |
| 211 | +## What We Get for Free |
| 212 | + |
| 213 | +By delegating to verl-agent, we avoid building and maintaining: |
| 214 | + |
| 215 | +| Capability | Lines saved | Complexity saved | |
| 216 | +|-----------------------------------|-------------|------------------------| |
| 217 | +| Multi-turn VLM rollout collection | ~200 | Image tensor management| |
| 218 | +| GiGPO step-level advantages | ~300 | State grouping logic | |
| 219 | +| Distributed training (FSDP) | ~500 | Multi-GPU coordination | |
| 220 | +| vLLM/sglang inference | ~400 | Inference server mgmt | |
| 221 | +| Reference model management | ~100 | Weight synchronization | |
| 222 | +| Advanced logging (WandB, TB) | ~100 | Metric tracking | |
| 223 | +| **Total** | **~1600** | | |
| 224 | + |
| 225 | +--- |
| 226 | + |
| 227 | +## Migration Path |
| 228 | + |
| 229 | +1. **Current state**: Standalone trainer in openadapt-ml (PR #34, merged). |
| 230 | + Works, well-tested (56 unit tests + 5 E2E tests). Episode-level rewards only. |
| 231 | + |
| 232 | +2. **Spike complete**: `WAADesktopEnv` adapter in openadapt-evals (PR #84). |
| 233 | + 21 tests passing. Implements GymImageEnv protocol. |
| 234 | + |
| 235 | +3. **Next**: Test end-to-end with verl-agent on a GPU machine. If successful, |
| 236 | + the standalone trainer becomes a reference implementation / fallback, and |
| 237 | + verl-agent becomes the recommended training path. |
| 238 | + |
| 239 | +4. **Future**: If TRL resolves #5120 (multi-turn VLM support), evaluate whether |
| 240 | + to switch. TRL has broader adoption; switching would reduce the dependency |
| 241 | + footprint. But only if TRL also adds per-step credit assignment comparable |
| 242 | + to GiGPO. |
| 243 | + |
| 244 | +--- |
| 245 | + |
| 246 | +## References |
| 247 | + |
| 248 | +- [verl-agent](https://github.com/langfengQ/verl-agent) — GiGPO paper implementation |
| 249 | +- [VAGEN](https://github.com/mll-lab-nu/VAGEN) — Multi-turn VLM agent training |
| 250 | +- [verl](https://github.com/verl-project/verl) — Volcano Engine RL for LLMs |
| 251 | +- [GiGPO paper](https://arxiv.org/html/2505.10978) — Group-in-Group Policy Optimization |
| 252 | +- [VAGEN paper](https://arxiv.org/abs/2510.16907) — World Model Reasoning for VLM Agents |
| 253 | +- [TRL #5120](https://github.com/huggingface/trl/issues/5120) — Multimodal rollout pipeline |
| 254 | +- [TRL #5119](https://github.com/huggingface/trl/issues/5119) — Backend/rollout decoupling |
| 255 | +- [TRL GRPOTrainer docs](https://huggingface.co/docs/trl/main/grpo_trainer) |
| 256 | +- [TRL PR #3072](https://github.com/huggingface/trl/pull/3072) — Original VLM support |
0 commit comments