@@ -101,6 +101,10 @@ We evaluated 6 approaches before selecting verl-agent/VAGEN:
101101 - [ #4543 ] ( https://github.com/huggingface/trl/issues/4543 ) : Multi-step
102102 training forces one shared prompt across all generations, but multi-step
103103 trajectories have different prefixes at each turn
104+ - ** NEW (2026)** : TRL added OpenEnv integration with ` rollout_func ` for
105+ multi-turn environment training (Gym-style). Works for text models. VLM
106+ support blocked by #5120 (chat template flattens multimodal data before
107+ rollout).
104108- ** Verdict** : Not viable for our use case until #5120 is resolved. Monitoring.
105109
106110### B. Standalone Loss Math (our initial approach)
@@ -149,31 +153,53 @@ loop (~546 lines in `openadapt_ml/training/grpo/trainer.py`).
149153
150154** Repository** : [ OpenRLHF/OpenRLHF] ( https://github.com/OpenRLHF/OpenRLHF )
151155
152- - Supports multimodal models, has LMM-R1 fork for multimodal RL
156+ - Supports multimodal models via [ OpenRLHF-M] ( https://github.com/OpenRLHF/OpenRLHF-M )
157+ fork (LMM-R1 lineage), tested with Qwen2.5-VL and InternVL
153158- Implements GRPO, PPO, REINFORCE++ with Ray-based distributed training
154- - ** Multi-turn VLM** : Less documented, unclear if fully supported
155- - ** Verdict** : Viable alternative but less proven for multi-turn VLM specifically.
159+ - ** Multi-turn agent support** : Added in 2025 — ` AgentTrainer ` with ` env_rollout `
160+ function for Gym-style interaction. Text-based multi-turn works; multi-turn
161+ VLM with per-step images less documented but architecturally feasible
162+ - ** No per-step credit assignment** : Episode-level rewards only (same limitation
163+ as our standalone trainer)
164+ - ** Verdict** : Viable alternative for multi-turn VLM with strong distributed
165+ training. Lacks GiGPO-style step-level credit assignment, which is the key
166+ differentiator for long-horizon desktop tasks.
156167
157168### F. Unsloth
158169
170+ ** Repository** : [ unslothai/unsloth] ( https://github.com/unslothai/unsloth )
171+
159172- 1.5-2x speed, 90% less VRAM for Qwen3-VL/Gemma 3
160- - ** Single-turn only** — not suitable for multi-step desktop automation
161- - ** Verdict** : Not applicable.
173+ - ** Single-turn VLM GRPO** : Works. ` UnslothGRPOTrainer ` wraps TRL's GRPOTrainer
174+ with kernel optimizations. Tested with Qwen2.5-VL, Gemma 3, Llama 3.2-Vision.
175+ - ** Multi-turn text** : Supported via ART (Agent Reinforcement Training, OpenPipe
176+ collaboration). Text-only multi-turn environments work with ` rollout_func ` .
177+ - ** Multi-turn VLM** : NOT supported. ` rollout_func ` is silently ignored by
178+ ` UnslothGRPOTrainer ` ([ #3573 ] ( https://github.com/unslothai/unsloth/issues/3573 ) ),
179+ preventing custom environment interaction. Multi-GPU VLM training also broken
180+ ([ #3571 ] ( https://github.com/unslothai/unsloth/issues/3571 ) ).
181+ - ** Verdict** : Not applicable for our use case. Multi-turn VLM RL is blocked by
182+ the ` rollout_func ` issue. If resolved, Unsloth's VRAM savings could make it
183+ attractive for single-GPU experimentation, but it still lacks per-step credit
184+ assignment (GiGPO) and distributed training.
162185
163186### Comparison Matrix
164187
165- | Feature | TRL | Standalone | verl-agent | VAGEN | OpenRLHF |
166- | -----------------------------| --------| ------------| ------------| --------| ----------|
167- | Single-turn VLM GRPO | Yes | Yes | Yes | Yes | Yes |
168- | Multi-turn VLM GRPO | ** No** | Yes* | ** Yes** | ** Yes** | Unclear |
169- | Per-step credit assignment | No | No | ** GiGPO** | ** GAE** | No |
170- | Distributed training | Yes | No | Yes | Yes | Yes |
171- | vLLM/sglang acceleration | Yes | No | Yes | Yes | Yes |
172- | Qwen2.5-VL tested | Yes | Yes | Yes | Yes | Yes |
173- | Lines of code we maintain | ~ 200 | ~ 546 | ** ~ 250** | ~ 250 | ~ 200 |
174- | Ease of adoption | High | N/A | Medium | Medium | Medium |
188+ | Feature | TRL | Standalone | verl-agent | VAGEN | OpenRLHF | Unsloth |
189+ | -----------------------------| --------| ------------| ------------| --------| ----------| ---------- |
190+ | Single-turn VLM GRPO | Yes | Yes | Yes | Yes | Yes | Yes |
191+ | Multi-turn VLM GRPO | ** No** | Yes* | ** Yes** | ** Yes** | Partial† | ** No ** ‡ |
192+ | Per-step credit assignment | No | No | ** GiGPO** | ** GAE** | No | No |
193+ | Distributed training | Yes | No | Yes | Yes | Yes | No§ |
194+ | vLLM/sglang acceleration | Yes | No | Yes | Yes | Yes | No |
195+ | Qwen2.5-VL tested | Yes | Yes | Yes | Yes | Yes | Yes |
196+ | Lines of code we maintain | ~ 200 | ~ 546 | ** ~ 250** | ~ 250 | ~ 200 | ~ 200 |
197+ | Ease of adoption | High | N/A | Medium | Medium | Medium | High |
175198
176199* Standalone multi-turn VLM works but only has episode-level rewards.
200+ †OpenRLHF has AgentTrainer for multi-turn text; VLM multi-turn less documented.
201+ ‡Unsloth ` rollout_func ` silently ignored (#3573 ), blocking multi-turn VLM.
202+ §Unsloth multi-GPU VLM broken (#3571 ).
177203
178204---
179205
0 commit comments