Skip to content

Commit b8b426d

Browse files
abrichrclaude
andcommitted
docs: fact-check framework review in verl decision doc
Update Sections E (OpenRLHF), F (Unsloth), TRL, and comparison matrix with accurate details from thorough review: - OpenRLHF: document AgentTrainer multi-turn support and OpenRLHF-M fork - Unsloth: nuanced assessment — single-turn VLM works, multi-turn text via ART works, but multi-turn VLM blocked by rollout_func issue (#3573) - TRL: add note about OpenEnv/rollout_func for text models (VLM blocked) - Comparison matrix: add Unsloth column with footnotes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent e2b6d0f commit b8b426d

1 file changed

Lines changed: 41 additions & 15 deletions

File tree

docs/verl_agent_decision.md

Lines changed: 41 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,10 @@ We evaluated 6 approaches before selecting verl-agent/VAGEN:
101101
- [#4543](https://github.com/huggingface/trl/issues/4543): Multi-step
102102
training forces one shared prompt across all generations, but multi-step
103103
trajectories have different prefixes at each turn
104+
- **NEW (2026)**: TRL added OpenEnv integration with `rollout_func` for
105+
multi-turn environment training (Gym-style). Works for text models. VLM
106+
support blocked by #5120 (chat template flattens multimodal data before
107+
rollout).
104108
- **Verdict**: Not viable for our use case until #5120 is resolved. Monitoring.
105109

106110
### B. Standalone Loss Math (our initial approach)
@@ -149,31 +153,53 @@ loop (~546 lines in `openadapt_ml/training/grpo/trainer.py`).
149153

150154
**Repository**: [OpenRLHF/OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)
151155

152-
- Supports multimodal models, has LMM-R1 fork for multimodal RL
156+
- Supports multimodal models via [OpenRLHF-M](https://github.com/OpenRLHF/OpenRLHF-M)
157+
fork (LMM-R1 lineage), tested with Qwen2.5-VL and InternVL
153158
- Implements GRPO, PPO, REINFORCE++ with Ray-based distributed training
154-
- **Multi-turn VLM**: Less documented, unclear if fully supported
155-
- **Verdict**: Viable alternative but less proven for multi-turn VLM specifically.
159+
- **Multi-turn agent support**: Added in 2025 — `AgentTrainer` with `env_rollout`
160+
function for Gym-style interaction. Text-based multi-turn works; multi-turn
161+
VLM with per-step images less documented but architecturally feasible
162+
- **No per-step credit assignment**: Episode-level rewards only (same limitation
163+
as our standalone trainer)
164+
- **Verdict**: Viable alternative for multi-turn VLM with strong distributed
165+
training. Lacks GiGPO-style step-level credit assignment, which is the key
166+
differentiator for long-horizon desktop tasks.
156167

157168
### F. Unsloth
158169

170+
**Repository**: [unslothai/unsloth](https://github.com/unslothai/unsloth)
171+
159172
- 1.5-2x speed, 90% less VRAM for Qwen3-VL/Gemma 3
160-
- **Single-turn only** — not suitable for multi-step desktop automation
161-
- **Verdict**: Not applicable.
173+
- **Single-turn VLM GRPO**: Works. `UnslothGRPOTrainer` wraps TRL's GRPOTrainer
174+
with kernel optimizations. Tested with Qwen2.5-VL, Gemma 3, Llama 3.2-Vision.
175+
- **Multi-turn text**: Supported via ART (Agent Reinforcement Training, OpenPipe
176+
collaboration). Text-only multi-turn environments work with `rollout_func`.
177+
- **Multi-turn VLM**: NOT supported. `rollout_func` is silently ignored by
178+
`UnslothGRPOTrainer` ([#3573](https://github.com/unslothai/unsloth/issues/3573)),
179+
preventing custom environment interaction. Multi-GPU VLM training also broken
180+
([#3571](https://github.com/unslothai/unsloth/issues/3571)).
181+
- **Verdict**: Not applicable for our use case. Multi-turn VLM RL is blocked by
182+
the `rollout_func` issue. If resolved, Unsloth's VRAM savings could make it
183+
attractive for single-GPU experimentation, but it still lacks per-step credit
184+
assignment (GiGPO) and distributed training.
162185

163186
### Comparison Matrix
164187

165-
| Feature | TRL | Standalone | verl-agent | VAGEN | OpenRLHF |
166-
|-----------------------------|--------|------------|------------|--------|----------|
167-
| Single-turn VLM GRPO | Yes | Yes | Yes | Yes | Yes |
168-
| Multi-turn VLM GRPO | **No** | Yes* | **Yes** | **Yes**| Unclear |
169-
| Per-step credit assignment | No | No | **GiGPO** | **GAE**| No |
170-
| Distributed training | Yes | No | Yes | Yes | Yes |
171-
| vLLM/sglang acceleration | Yes | No | Yes | Yes | Yes |
172-
| Qwen2.5-VL tested | Yes | Yes | Yes | Yes | Yes |
173-
| Lines of code we maintain | ~200 | ~546 | **~250** | ~250 | ~200 |
174-
| Ease of adoption | High | N/A | Medium | Medium | Medium |
188+
| Feature | TRL | Standalone | verl-agent | VAGEN | OpenRLHF | Unsloth |
189+
|-----------------------------|--------|------------|------------|--------|----------|----------|
190+
| Single-turn VLM GRPO | Yes | Yes | Yes | Yes | Yes | Yes |
191+
| Multi-turn VLM GRPO | **No** | Yes* | **Yes** | **Yes**| Partial† | **No** |
192+
| Per-step credit assignment | No | No | **GiGPO** | **GAE**| No | No |
193+
| Distributed training | Yes | No | Yes | Yes | Yes | No§ |
194+
| vLLM/sglang acceleration | Yes | No | Yes | Yes | Yes | No |
195+
| Qwen2.5-VL tested | Yes | Yes | Yes | Yes | Yes | Yes |
196+
| Lines of code we maintain | ~200 | ~546 | **~250** | ~250 | ~200 | ~200 |
197+
| Ease of adoption | High | N/A | Medium | Medium | Medium | High |
175198

176199
*Standalone multi-turn VLM works but only has episode-level rewards.
200+
†OpenRLHF has AgentTrainer for multi-turn text; VLM multi-turn less documented.
201+
‡Unsloth `rollout_func` silently ignored (#3573), blocking multi-turn VLM.
202+
§Unsloth multi-GPU VLM broken (#3571).
177203

178204
---
179205

0 commit comments

Comments
 (0)