Commit da17355
feat: add GPU training automation for verl-agent E2E workflow (#87)
* feat: add GPU training automation for verl-agent E2E workflow
- Add GPU_VM_SIZE_FALLBACKS to azure_vm.py (NC48ads_A100_v4, NC24ads, NC12s_v3)
- Add GPU_INSTANCE_TYPE_FALLBACKS to aws_vm.py (p3.8xlarge, g5.12xlarge, p3.2xlarge)
- Update find_available_size_and_region(gpu=True) on both providers + protocol
- Add scripts/setup_gpu_training.sh: installs conda, vLLM, flash-attn, verl-agent
- Add scripts/train_verl_e2e.py: provisions GPU VM, uploads setup, launches training
- Add oa-vm gpu-setup and gpu-train CLI commands
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: correct verl-agent Hydra config paths and document integration gap
Validated all 17 Hydra config paths against verl-agent's actual schema
(ppo_trainer.yaml + make_envs()). Key fixes:
- env.env_name: use 'waa_desktop' short name, not Python import path
(verl-agent uses hardcoded dispatch, not dynamic imports)
- Remove env.env_kwargs (doesn't exist), use env.waa.* sub-keys
- Add data.train_files/val_files (required parquet, generated via
data_preprocess.prepare --mode visual)
- Add missing overrides: algorithm.gamma, gpu_memory_utilization,
ppo_mini_batch_size, filter_overlong_prompts, test_freq
- Add prepare_training_data() and patch_env_manager() steps
- Document the EnvironmentManagerBase integration gap in decision doc
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: replace EnvironmentManagerBase with VAGEN registry-based env integration
The previous implementation incorrectly assumed verl-agent uses an
EnvironmentManagerBase ABC with a hardcoded make_envs() dispatch.
Research reveals VAGEN actually uses:
- GymImageEnv protocol (which WAADesktopEnv already implements)
- YAML-based env registry (vagen/configs/env_registry.yaml)
- GymAgentLoop for training-time rollout orchestration
Changes:
- Replace patch_env_manager() with register_waa_env() (YAML registry)
- Add register_in_vagen() and generate_env_spec() helpers to verl_env.py
- Update launch_training() to generate proper VAGEN training config
- Fix Integration Gap section in decision doc (no EnvironmentManagerBase)
- Update training config YAML with architecture diagram
- Add 5 new tests for registration helpers (40 total, all passing)
- Export new helpers from adapters/__init__.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: correct is_action_valid logic, scroll_direction, stale refs, and DRY violation
Review fixes for the GPU training automation branch:
- Fix is_action_valid: was inverted (DONE()→invalid, garbage→valid), now uses
regex match on original action string
- Fix scroll_direction: SCROLL parsing now populates BenchmarkAction.scroll_direction
- Fix stale repo URLs: mll-lab-nu/VAGEN → RAGEN-AI/VAGEN across vendored files and docs
- Fix stale branch ref: setup_gpu_training.sh referenced merged spike branch, now uses main
- Fix stale repo URL: langfengQ/verl-agent → RAGEN-AI/VAGEN in setup script
- Add --recurse-submodules to git clone (verl is a VAGEN submodule)
- Remove dead params from register_waa_env() (waa_server, task_id, max_steps)
- Deduplicate training command: vm_cli.py now delegates to launch_training()
- Update test count in docs: 21 → 40+
- Add 3 new tests for is_action_valid behavior
- Add scroll_direction assertion to existing scroll test
All 43 tests pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: resolve lint errors (undefined use_fast, unused imports, f-strings)
- Remove undefined `use_fast` guard — always log tried sizes on failure
- Remove unused PoolManager import in vm_cli.py
- Remove extraneous f-string prefixes
- Remove unused boto3 and SSH_OPTS imports in aws_vm.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: add evaluate_url support and E2E validation test
WAADesktopEnv now correctly separates:
- server_url (port 5000): Windows VM Flask API (/screenshot, /execute_windows)
- evaluate_url (port 5001): evaluate_server.py (/setup, /evaluate, /probe)
Previously, the single server_url default pointed at 5001 (evaluate server only),
which caused 404s for screenshots and action execution.
Also adds scripts/test_verl_env_e2e.py, validated on AWS g5.xlarge (A10G)
with UNIX socket bridge proxy chain to Azure WAA VM.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use Deep Learning AMI for GPU instances and fix setup issues
- Add _find_latest_dl_ami() for GPU VMs (pre-installed NVIDIA drivers + CUDA)
- Add gpu param to create_vm() to select DL AMI vs standard Ubuntu
- Reorder GPU_INSTANCE_TYPE_FALLBACKS: prefer g5 (Ampere/A10G) over p3
(Volta/V100) since OSS NVIDIA driver requires GSP (Turing+)
- Make OPENADAPT_EVALS_BRANCH configurable via env var in setup script
- Add conda TOS acceptance step (required since Miniconda 2025)
Validated on AWS g5.xlarge with NVIDIA A10G 24GB GPU.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add GPU E2E validation report with artifacts
Documents the successful end-to-end validation of the verl-agent/VAGEN
training pipeline on AWS g5.xlarge (A10G 24GB) connecting to Azure WAA VM.
Includes architecture diagrams, proxy chain details, raw test output,
version listings, and issues discovered during validation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: resolve port inconsistencies and add missing context in validation docs
- Standardize evaluate_url port to 5051 (socat bridge) across all docs
- Add Artifact Stage column to validation results table mapping tests to raw output
- Add docs commit (c2555ef) to PR #87 commit list
- Clarify 5050 vs 5051 port mapping in architecture diagrams and data flow
- Expand e2e_test_output.txt Stage 7/8 with sub-steps matching README table
- Add SSH tunnel tip about socat bridge still being required
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: clarify uvicorn version discrepancy and complete commit list
- Add note to gpu_vm_stack_versions.txt explaining that the full pip list
is from Stage 5 (vLLM install) and uvicorn was later downgraded by VAGEN
- Add b7efb4f to the commit list in README.md
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: guard flash-attn install for Ampere+ GPUs and validate training data
- Check GPU compute capability before installing flash-attn; V100s (sm_70)
don't support Flash Attention 2 (requires sm_80+) and would fail at build
or runtime
- Add post-preparation validation to prepare_training_data() ensuring the
expected parquet files exist and are non-empty, rather than silently
proceeding with missing data
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: update test to match server_url default port 5000
The generate_env_spec() default server_url is http://localhost:5000
(WAA Flask API port), not 5001. The test expectation was stale.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: split server_url/evaluate_url in training config and CLI args
The two-port WAA architecture uses separate endpoints:
- server_url (port 5000): WAA Flask API for screenshots and actions
- evaluate_url (port 5001): evaluate_server for setup and evaluate
Previously --waa-server defaulted to port 5001 and was assigned to
server_url, conflating the two endpoints. This fixes:
- train_verl_e2e.py: --waa-server default 5000, add --evaluate-server
- vm_cli.py gpu-train: same CLI arg fixes, pass evaluate_url through
- train_waa_vagen.yaml: correct server_url to 5000, add evaluate_url
- Fix nested single quotes in register_waa_env (heredoc instead)
- Replace fragile sys.path.insert with importlib.util
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: correct stale port in verl_env docstring and SSH tunnel comment
- verl_env.py docstring: server_url example 5001 -> 5000, add evaluate_url
- train_waa_vagen.yaml: SSH tunnel dest 5050 -> 5051 (socat bridge, not
broken Docker port)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent 0b8f599 commit da17355
19 files changed
Lines changed: 2023 additions & 64 deletions
File tree
- configs
- docs
- gpu_e2e_validation
- artifacts
- openadapt_evals
- adapters
- _vendored
- benchmarks
- infrastructure
- scripts
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
4 | | - | |
| 4 | + | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
8 | | - | |
9 | | - | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
10 | 14 | | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
11 | 26 | | |
12 | 27 | | |
13 | | - | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
14 | 33 | | |
15 | 34 | | |
16 | | - | |
| 35 | + | |
17 | 36 | | |
18 | 37 | | |
19 | 38 | | |
| |||
26 | 45 | | |
27 | 46 | | |
28 | 47 | | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
29 | 51 | | |
30 | 52 | | |
31 | 53 | | |
32 | 54 | | |
33 | | - | |
| 55 | + | |
34 | 56 | | |
35 | 57 | | |
36 | 58 | | |
37 | | - | |
| 59 | + | |
| 60 | + | |
38 | 61 | | |
39 | 62 | | |
40 | 63 | | |
41 | 64 | | |
42 | 65 | | |
43 | | - | |
| 66 | + | |
44 | 67 | | |
45 | 68 | | |
46 | 69 | | |
47 | | - | |
48 | | - | |
| 70 | + | |
| 71 | + | |
49 | 72 | | |
50 | 73 | | |
51 | 74 | | |
52 | 75 | | |
53 | 76 | | |
54 | 77 | | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
55 | 84 | | |
56 | 85 | | |
57 | 86 | | |
58 | 87 | | |
59 | 88 | | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
65 | | - | |
66 | | - | |
| 89 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
0 commit comments