Skip to content

Commit da17355

Browse files
abrichrclaude
andauthored
feat: add GPU training automation for verl-agent E2E workflow (#87)
* feat: add GPU training automation for verl-agent E2E workflow - Add GPU_VM_SIZE_FALLBACKS to azure_vm.py (NC48ads_A100_v4, NC24ads, NC12s_v3) - Add GPU_INSTANCE_TYPE_FALLBACKS to aws_vm.py (p3.8xlarge, g5.12xlarge, p3.2xlarge) - Update find_available_size_and_region(gpu=True) on both providers + protocol - Add scripts/setup_gpu_training.sh: installs conda, vLLM, flash-attn, verl-agent - Add scripts/train_verl_e2e.py: provisions GPU VM, uploads setup, launches training - Add oa-vm gpu-setup and gpu-train CLI commands Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: correct verl-agent Hydra config paths and document integration gap Validated all 17 Hydra config paths against verl-agent's actual schema (ppo_trainer.yaml + make_envs()). Key fixes: - env.env_name: use 'waa_desktop' short name, not Python import path (verl-agent uses hardcoded dispatch, not dynamic imports) - Remove env.env_kwargs (doesn't exist), use env.waa.* sub-keys - Add data.train_files/val_files (required parquet, generated via data_preprocess.prepare --mode visual) - Add missing overrides: algorithm.gamma, gpu_memory_utilization, ppo_mini_batch_size, filter_overlong_prompts, test_freq - Add prepare_training_data() and patch_env_manager() steps - Document the EnvironmentManagerBase integration gap in decision doc Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: replace EnvironmentManagerBase with VAGEN registry-based env integration The previous implementation incorrectly assumed verl-agent uses an EnvironmentManagerBase ABC with a hardcoded make_envs() dispatch. Research reveals VAGEN actually uses: - GymImageEnv protocol (which WAADesktopEnv already implements) - YAML-based env registry (vagen/configs/env_registry.yaml) - GymAgentLoop for training-time rollout orchestration Changes: - Replace patch_env_manager() with register_waa_env() (YAML registry) - Add register_in_vagen() and generate_env_spec() helpers to verl_env.py - Update launch_training() to generate proper VAGEN training config - Fix Integration Gap section in decision doc (no EnvironmentManagerBase) - Update training config YAML with architecture diagram - Add 5 new tests for registration helpers (40 total, all passing) - Export new helpers from adapters/__init__.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: correct is_action_valid logic, scroll_direction, stale refs, and DRY violation Review fixes for the GPU training automation branch: - Fix is_action_valid: was inverted (DONE()→invalid, garbage→valid), now uses regex match on original action string - Fix scroll_direction: SCROLL parsing now populates BenchmarkAction.scroll_direction - Fix stale repo URLs: mll-lab-nu/VAGEN → RAGEN-AI/VAGEN across vendored files and docs - Fix stale branch ref: setup_gpu_training.sh referenced merged spike branch, now uses main - Fix stale repo URL: langfengQ/verl-agent → RAGEN-AI/VAGEN in setup script - Add --recurse-submodules to git clone (verl is a VAGEN submodule) - Remove dead params from register_waa_env() (waa_server, task_id, max_steps) - Deduplicate training command: vm_cli.py now delegates to launch_training() - Update test count in docs: 21 → 40+ - Add 3 new tests for is_action_valid behavior - Add scroll_direction assertion to existing scroll test All 43 tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: resolve lint errors (undefined use_fast, unused imports, f-strings) - Remove undefined `use_fast` guard — always log tried sizes on failure - Remove unused PoolManager import in vm_cli.py - Remove extraneous f-string prefixes - Remove unused boto3 and SSH_OPTS imports in aws_vm.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add evaluate_url support and E2E validation test WAADesktopEnv now correctly separates: - server_url (port 5000): Windows VM Flask API (/screenshot, /execute_windows) - evaluate_url (port 5001): evaluate_server.py (/setup, /evaluate, /probe) Previously, the single server_url default pointed at 5001 (evaluate server only), which caused 404s for screenshots and action execution. Also adds scripts/test_verl_env_e2e.py, validated on AWS g5.xlarge (A10G) with UNIX socket bridge proxy chain to Azure WAA VM. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use Deep Learning AMI for GPU instances and fix setup issues - Add _find_latest_dl_ami() for GPU VMs (pre-installed NVIDIA drivers + CUDA) - Add gpu param to create_vm() to select DL AMI vs standard Ubuntu - Reorder GPU_INSTANCE_TYPE_FALLBACKS: prefer g5 (Ampere/A10G) over p3 (Volta/V100) since OSS NVIDIA driver requires GSP (Turing+) - Make OPENADAPT_EVALS_BRANCH configurable via env var in setup script - Add conda TOS acceptance step (required since Miniconda 2025) Validated on AWS g5.xlarge with NVIDIA A10G 24GB GPU. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add GPU E2E validation report with artifacts Documents the successful end-to-end validation of the verl-agent/VAGEN training pipeline on AWS g5.xlarge (A10G 24GB) connecting to Azure WAA VM. Includes architecture diagrams, proxy chain details, raw test output, version listings, and issues discovered during validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: resolve port inconsistencies and add missing context in validation docs - Standardize evaluate_url port to 5051 (socat bridge) across all docs - Add Artifact Stage column to validation results table mapping tests to raw output - Add docs commit (c2555ef) to PR #87 commit list - Clarify 5050 vs 5051 port mapping in architecture diagrams and data flow - Expand e2e_test_output.txt Stage 7/8 with sub-steps matching README table - Add SSH tunnel tip about socat bridge still being required Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: clarify uvicorn version discrepancy and complete commit list - Add note to gpu_vm_stack_versions.txt explaining that the full pip list is from Stage 5 (vLLM install) and uvicorn was later downgraded by VAGEN - Add b7efb4f to the commit list in README.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: guard flash-attn install for Ampere+ GPUs and validate training data - Check GPU compute capability before installing flash-attn; V100s (sm_70) don't support Flash Attention 2 (requires sm_80+) and would fail at build or runtime - Add post-preparation validation to prepare_training_data() ensuring the expected parquet files exist and are non-empty, rather than silently proceeding with missing data Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update test to match server_url default port 5000 The generate_env_spec() default server_url is http://localhost:5000 (WAA Flask API port), not 5001. The test expectation was stale. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: split server_url/evaluate_url in training config and CLI args The two-port WAA architecture uses separate endpoints: - server_url (port 5000): WAA Flask API for screenshots and actions - evaluate_url (port 5001): evaluate_server for setup and evaluate Previously --waa-server defaulted to port 5001 and was assigned to server_url, conflating the two endpoints. This fixes: - train_verl_e2e.py: --waa-server default 5000, add --evaluate-server - vm_cli.py gpu-train: same CLI arg fixes, pass evaluate_url through - train_waa_vagen.yaml: correct server_url to 5000, add evaluate_url - Fix nested single quotes in register_waa_env (heredoc instead) - Replace fragile sys.path.insert with importlib.util Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: correct stale port in verl_env docstring and SSH tunnel comment - verl_env.py docstring: server_url example 5001 -> 5000, add evaluate_url - train_waa_vagen.yaml: SSH tunnel dest 5050 -> 5051 (socat bridge, not broken Docker port) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 0b8f599 commit da17355

19 files changed

Lines changed: 2023 additions & 64 deletions

configs/train_waa_vagen.yaml

Lines changed: 41 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,38 @@
11
# VAGEN training config for WAA desktop automation
22
#
33
# This trains a VLM (e.g., Qwen2.5-VL-3B) to automate Windows desktop tasks
4-
# using GRPO/GiGPO via the verl-agent framework.
4+
# using GRPO/GiGPO via the VAGEN framework (verl-agent).
55
#
66
# Prerequisites:
7-
# 1. WAA server running (via SSH tunnel): ssh -L 5001:localhost:5050 azureuser@<VM_IP>
8-
# 2. VAGEN installed: pip install vagen
9-
# 3. Register env: add to vagen's env_registry.yaml:
7+
# 1. WAA server reachable (via SSH tunnel if needed):
8+
# ssh -N -L 5000:localhost:5000 -L 5001:localhost:5051 azureuser@<VM_IP>
9+
# Port 5000: WAA Flask API (/screenshot, /execute_windows)
10+
# Port 5001: evaluate_server (/setup, /evaluate) via socat bridge from container :5050
11+
# 2. VAGEN installed on GPU VM (see scripts/setup_gpu_training.sh)
12+
# 3. openadapt-evals installed on GPU VM (pip install openadapt-evals)
13+
# 4. Register env in VAGEN's env_registry.yaml:
1014
# WAADesktop: openadapt_evals.adapters.verl_env.WAADesktopEnv
15+
# (automated by: scripts/train_verl_e2e.py or oa-vm gpu-train)
16+
#
17+
# Architecture:
18+
# GPU VM CPU VM
19+
# ┌──────────────────────┐ ┌──────────────────┐
20+
# │ VAGEN / verl │ │ Docker │
21+
# │ GymAgentLoop │ HTTP │ QEMU (Win 11) │
22+
# │ WAADesktopEnv ─────│───────────>│ WAA Flask API │
23+
# │ GiGPO/GRPO trainer │ │ │
24+
# │ vLLM inference │ │ │
25+
# └──────────────────────┘ └──────────────────┘
1126
#
1227
# Usage:
13-
# python -m vagen.train --config configs/train_waa_vagen.yaml
28+
# # Via orchestration script (recommended):
29+
# python scripts/train_verl_e2e.py --cloud aws --task-id <UUID>
30+
#
31+
# # Via CLI:
32+
# oa-vm gpu-train --cloud aws --task-id <UUID>
1433
#
1534
# For mock testing (no VM):
16-
# Set server_url to "mock" and use WAAMockAdapter internally
35+
# Set server_url to "mock" in env config
1736

1837
# --- Model ---
1938
model:
@@ -26,41 +45,45 @@ model:
2645
# target_modules: [q_proj, k_proj, v_proj, o_proj]
2746

2847
# --- Environment ---
48+
# VAGEN loads envs from env_registry.yaml using these specs.
49+
# WAADesktopEnv implements GymImageEnv (async reset/step/close/system_prompt).
50+
# Each env instance connects to the WAA server independently via HTTP.
2951
envs:
3052
- name: WAADesktop
3153
n_envs: 8 # Number of parallel environments (= GRPO group size)
3254
data_source: waa
33-
seed: [1, 100, 1] # [start, end, step] for task selection
55+
seed: [1, 100, 1] # [start, end, step] for deterministic seeding
3456
max_turns: 15 # Max actions per episode
3557
response_length_per_turn: 512
3658
config:
37-
server_url: "http://localhost:5001"
59+
server_url: "http://localhost:5000" # WAA Flask API (screenshots, actions)
60+
evaluate_url: "http://localhost:5001" # evaluate_server (setup, evaluate)
3861
task_id: "REPLACE_WITH_WAA_TASK_UUID"
3962
max_steps: 15
4063
evaluate_at_done: true
4164
action_type: fractional # VLM outputs normalized 0-1 coordinates
4265

43-
# --- Training (GRPO) ---
66+
# --- Training (GRPO/GiGPO) ---
4467
algorithm:
4568
name: grpo # or "gigpo" for step-level advantages
4669
kl_coef: 0.0 # No KL penalty (DAPO/Open-Reasoner-Zero style)
47-
epsilon: 0.2 # PPO clip range (inactive with single epoch)
48-
gamma: 1.0 # No discounting for episodic tasks
70+
epsilon: 0.2 # PPO clip range
71+
gamma: 1.0 # No discounting for episodic tasks (use 0.95 for gigpo)
4972

5073
trainer:
5174
total_epochs: 100
5275
n_gpus_per_node: 2 # Minimum for VLM training
5376
micro_batch_size: 4
5477
gradient_accumulation_steps: 2
78+
test_freq: 5 # Evaluate every N epochs
79+
experiment_name: grpo_waa_desktop
80+
project_name: openadapt-waa-rl
81+
logger:
82+
- console
83+
- wandb
5584

5685
# --- Rollout ---
5786
rollout:
5887
temperature: 0.7
5988
top_p: 0.95
60-
mode: async # async sglang rollout for throughput
61-
62-
# --- Logging ---
63-
logging:
64-
project: openadapt-waa-rl
65-
log_interval: 1
66-
save_interval: 10
89+
mode: async # Async sglang rollout for throughput

docs/gpu_e2e_validation/README.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# GPU E2E Validation Report
2+
3+
**Date**: 2026-03-04
4+
**Status**: VALIDATED
5+
**PR**: [#87](https://github.com/OpenAdaptAI/openadapt-evals/pull/87) (`feat/gpu-training-automation`)
6+
**Author**: OpenAdapt engineering
7+
8+
## Summary
9+
10+
End-to-end validation of the verl-agent/VAGEN training pipeline on AWS
11+
g5.xlarge (NVIDIA A10G, 24 GB VRAM). The full integration chain —
12+
`WAADesktopEnv -> RLEnvironment -> WAALiveAdapter -> WAA Flask API` — was
13+
confirmed working with the GPU VM connecting to an Azure WAA VM
14+
(`waa-pool-00`) via a two-port proxy architecture. Five issues were
15+
discovered and resolved during validation.
16+
17+
## Architecture
18+
19+
```
20+
GPU VM (AWS g5.xlarge) WAA VM (Azure waa-pool-00)
21+
+---------------------------+ +---------------------------+
22+
| verl-agent / VAGEN | | Docker |
23+
| +- WAADesktopEnv | HTTP | +- QEMU (Windows 11) |
24+
| +- RLEnvironment | ---------> | +- WAA Flask API |
25+
| +- WAALiveAdapter | :5000 | | /screenshot |
26+
| | :5051* | | /execute_windows |
27+
| PyTorch 2.8.0 | | +- evaluate_server |
28+
| vLLM 0.11.0 | | /setup |
29+
| Ray 2.54.0 | | /evaluate |
30+
+---------------------------+ +---------------------------+
31+
3.236.121.184 172.173.66.131
32+
33+
* evaluate_server.py listens on port 5050 inside the Docker container.
34+
Docker port forwarding for 5050 is broken by QEMU NET_ADMIN, so a
35+
socat/nsenter UNIX socket bridge exposes it as port 5051 on the VM host.
36+
See architecture.md for details.
37+
```
38+
39+
See [architecture.md](architecture.md) for the proxy chain deep dive.
40+
41+
## Environment
42+
43+
### GPU VM Specs
44+
45+
| Component | Value |
46+
|-----------------|-------------------------------------------------------------|
47+
| Instance type | g5.xlarge |
48+
| GPU | NVIDIA A10G Tensor Core (24 GB VRAM, Ampere, CC 8.6) |
49+
| vCPU | 4 (AMD EPYC 7R13) |
50+
| Memory | 16 GiB |
51+
| OS | Ubuntu 22.04 LTS |
52+
| AMI | Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.7 (20260222) |
53+
| Region | us-east-1 |
54+
55+
### Software Stack
56+
57+
| Package | Version |
58+
|----------------|----------|
59+
| PyTorch | 2.8.0 |
60+
| vLLM | 0.11.0 |
61+
| Ray | 2.54.0 |
62+
| VAGEN | 26.2.5 |
63+
| Transformers | 5.2.0 |
64+
| CUDA Toolkit | 12.8 |
65+
| cuDNN | 9.10.2 |
66+
| Python | 3.12 |
67+
68+
Full version listing: [artifacts/gpu_vm_stack_versions.txt](artifacts/gpu_vm_stack_versions.txt)
69+
70+
## Validation Steps and Results
71+
72+
| # | Test | Artifact Stage | Result |
73+
|---|------------------------------------------|----------------|--------|
74+
| 1 | GPU detected (`nvidia-smi`) | Stage 1 | PASS |
75+
| 2 | Miniconda + conda env creation | Stages 2-3b | PASS (after TOS fix) |
76+
| 3 | V100 -> A10G instance swap | Stage 4 | PASS |
77+
| 4 | vLLM 0.11.0 install + import | Stage 5 | PASS |
78+
| 5 | PyTorch 2.8.0 CUDA available | Stage 6 | PASS |
79+
| 6 | VAGEN install + env registry load | Stage 7* | PASS |
80+
| 7 | Docker port 5050 socat bridge | Stage 7 | PASS |
81+
| 8 | WAADesktopEnv reset + screenshot | Stage 8 | PASS |
82+
| 9 | WAALiveAdapter execute action | Stage 8 | PASS |
83+
| 10 | Full RLEnvironment step loop | Stage 8 | PASS |
84+
85+
\* VAGEN install output also in [artifacts/vagen_registry_output.txt](artifacts/vagen_registry_output.txt).
86+
87+
## Issues Discovered
88+
89+
| # | Issue | Root Cause | Fix Applied |
90+
|---|-------------------------------|-----------------------------------------------------|------------------------------------------------------|
91+
| 1 | Conda TOS error | Miniconda 2025 requires explicit TOS acceptance | `conda tos accept --override-channels --channel ...` |
92+
| 2 | PyTorch version conflict | vLLM 0.11.0 pins `torch==2.8.0`; pip pulled 2.10.0 | `pip install torch==2.8.0 --upgrade` |
93+
| 3 | V100 GPU incompatible | V100 lacks GSP (required for modern NVIDIA drivers) | Switched p3.2xlarge (V100) to g5.xlarge (A10G) |
94+
| 4 | Docker port 5050 broken | QEMU `NET_ADMIN` breaks Docker bridge networking | UNIX socket bridge via `nsenter` + `socat` |
95+
| 5 | AMI selection | Multiple DL AMI variants; wrong one wastes setup | Standardized on OSS Nvidia Driver + PyTorch 2.7 AMI |
96+
97+
Details in [artifacts/e2e_test_output.txt](artifacts/e2e_test_output.txt).
98+
99+
## Cost
100+
101+
| Metric | Value |
102+
|---------------------|--------------|
103+
| Instance cost | $1.006/hr |
104+
| Validation runtime | ~30 min |
105+
| Estimated cost | ~$0.50 |
106+
| Auto-shutdown | 30 min post-validation |
107+
108+
## Commits (PR #87)
109+
110+
```
111+
f9e5804 feat: add GPU training automation for verl-agent E2E workflow
112+
dda3fb2 fix: correct verl-agent Hydra config paths and document integration gap
113+
dc4f088 fix: replace EnvironmentManagerBase with VAGEN registry-based env integration
114+
dc1f81f fix: correct is_action_valid logic, scroll_direction, stale refs, and DRY violation
115+
308cade fix: resolve lint errors (undefined use_fast, unused imports, f-strings)
116+
e73df70 fix: add evaluate_url support and E2E validation test
117+
17c919b fix: use Deep Learning AMI for GPU instances and fix setup issues
118+
c2555ef docs: add GPU E2E validation report with artifacts
119+
b7efb4f fix: resolve port inconsistencies and add missing context in validation docs
120+
```
121+
122+
## Next Steps
123+
124+
1. Merge PR #87 once CI passes
125+
2. Bump openadapt-ml PyTorch requirement to `>=2.8.0` (currently `>=2.9.1`, conflicts with vLLM)
126+
3. Document UNIX socket bridge in deployment runbook
127+
4. Evaluate spot instances for cost optimization during training runs
128+
5. Run first GRPO/GiGPO training loop on validated stack

0 commit comments

Comments
 (0)