|
| 1 | +# Training Runbook: First GRPO/GiGPO Training Loop on WAA |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Step-by-step playbook for running verl-agent/VAGEN RL training on a GPU VM |
| 6 | +connected to a WAA Windows VM. Validated on AWS g5.xlarge (A10G 24GB) with |
| 7 | +Azure WAA VM (waa-pool-00). |
| 8 | + |
| 9 | +**Stack**: PyTorch 2.8.0, vLLM 0.11.0, Ray 2.54.0, VAGEN, Qwen2.5-VL-3B-Instruct |
| 10 | + |
| 11 | +## Pre-Flight Checklist |
| 12 | + |
| 13 | +### Azure WAA VM (waa-pool-00) |
| 14 | + |
| 15 | +``` |
| 16 | +[ ] VM running: |
| 17 | + az vm show -n waa-pool-00 -g openadapt-agents --query powerState -o tsv |
| 18 | +
|
| 19 | +[ ] IP confirmed: |
| 20 | + az vm show -n waa-pool-00 -g openadapt-agents -d --query publicIps -o tsv |
| 21 | +
|
| 22 | +[ ] Docker container running: |
| 23 | + ssh azureuser@<WAA_IP> "docker ps --format '{{.Names}} {{.Status}}'" |
| 24 | +
|
| 25 | +[ ] Port 5000 (Flask API): |
| 26 | + curl -s http://<WAA_IP>:5000/probe | head -5 |
| 27 | +
|
| 28 | +[ ] Port 5051 (socat bridge for evaluate_server): |
| 29 | + curl -s http://<WAA_IP>:5051/probe | head -5 |
| 30 | + If fails, re-establish bridge: |
| 31 | + CONTAINER_PID=$(ssh azureuser@<WAA_IP> "docker inspect --format '{{.State.Pid}}' <container>") |
| 32 | + ssh azureuser@<WAA_IP> "rm -f /tmp/waa-bridge.sock" |
| 33 | + ssh azureuser@<WAA_IP> "nsenter -t $CONTAINER_PID -n socat UNIX-LISTEN:/tmp/waa-bridge.sock,fork TCP:localhost:5050 &" |
| 34 | + ssh azureuser@<WAA_IP> "socat TCP-LISTEN:5051,fork,reuseaddr UNIX-CONNECT:/tmp/waa-bridge.sock &" |
| 35 | +
|
| 36 | +[ ] Task setup works: |
| 37 | + curl -s -X POST http://<WAA_IP>:5051/setup \ |
| 38 | + -H "Content-Type: application/json" \ |
| 39 | + -d '{"task_id":"<TASK_UUID>"}' |
| 40 | +``` |
| 41 | + |
| 42 | +### AWS GPU VM |
| 43 | + |
| 44 | +``` |
| 45 | +[ ] nvidia-smi works and shows expected GPU(s) |
| 46 | +[ ] conda activate verl-agent works |
| 47 | +[ ] python -c "import vagen; print(vagen.__file__)" succeeds |
| 48 | +[ ] python -c "from openadapt_evals.adapters.verl_env import WAADesktopEnv" succeeds |
| 49 | +[ ] WAADesktop registered in ~/verl-agent/vagen/configs/env_registry.yaml |
| 50 | +[ ] WAA VM reachable: curl -s http://<WAA_IP>:5000/probe |
| 51 | +[ ] wandb configured: wandb login --verify |
| 52 | +[ ] Disk space: df -h / (need 50GB+ free) |
| 53 | +``` |
| 54 | + |
| 55 | +### Connectivity Smoke Test |
| 56 | + |
| 57 | +```bash |
| 58 | +# From GPU VM |
| 59 | +conda run -n verl-agent python3 -c " |
| 60 | +import asyncio |
| 61 | +from openadapt_evals.adapters.verl_env import WAADesktopEnv |
| 62 | +env = WAADesktopEnv({ |
| 63 | + 'server_url': 'http://<WAA_IP>:5000', |
| 64 | + 'evaluate_url': 'http://<WAA_IP>:5051', |
| 65 | + 'task_id': '<TASK_UUID>', |
| 66 | + 'max_steps': 3, |
| 67 | + 'evaluate_at_done': True, |
| 68 | + 'action_type': 'fractional', |
| 69 | +}) |
| 70 | +obs, info = asyncio.run(env.reset(seed=42)) |
| 71 | +print('Reset OK, obs keys:', obs.keys()) |
| 72 | +obs, reward, done, info = asyncio.run(env.step('CLICK(x=0.5, y=0.5)')) |
| 73 | +print(f'Step OK, reward={reward}, done={done}') |
| 74 | +asyncio.run(env.close()) |
| 75 | +print('Smoke test passed!') |
| 76 | +" |
| 77 | +``` |
| 78 | + |
| 79 | +## Instance Selection |
| 80 | + |
| 81 | +| Instance | GPUs | VRAM | $/hr (OD) | $/hr (Spot) | Use Case | |
| 82 | +|----------|------|------|-----------|-------------|----------| |
| 83 | +| g5.xlarge | 1x A10G | 24GB | $1.006 | $0.43 | Smoke test, single-GPU dev | |
| 84 | +| g5.2xlarge | 1x A10G | 24GB | $1.21 | ~$0.52 | Single-GPU with more RAM | |
| 85 | +| g5.12xlarge | 4x A10G | 96GB | $5.67 | $2.90 | Multi-GPU training (recommended) | |
| 86 | +| g6.12xlarge | 4x L4 | 96GB | $4.60 | $2.26 | Budget multi-GPU alternative | |
| 87 | + |
| 88 | +## Key Architecture Constraints |
| 89 | + |
| 90 | +1. **n_envs must be 1** — only one WAA VM, multiple envs would clobber state |
| 91 | +2. **Use `rollout.n` for GRPO group size** — generates N responses sequentially, not parallel envs |
| 92 | +3. **Entry point is `vagen.main_ppo`**, not `verl.trainer.main_ppo` — VAGEN extends verl with multi-turn agent support |
| 93 | +4. **Hydra config system** — use `--config-path` and `--config-name=vagen_multiturn` |
| 94 | +5. **Use `rollout.name=vllm`** — already validated; VAGEN examples use sglang but vLLM works |
| 95 | + |
| 96 | +## Launch Commands |
| 97 | + |
| 98 | +### Step 1: Create training data YAML on GPU VM |
| 99 | + |
| 100 | +```bash |
| 101 | +cat > ~/verl-agent/train_waa.yaml << 'EOF' |
| 102 | +envs: |
| 103 | + - name: WAADesktop |
| 104 | + n_envs: 1 |
| 105 | + data_source: waa |
| 106 | + seed: [1, 100, 1] |
| 107 | + max_turns: 15 |
| 108 | + response_length_per_turn: 512 |
| 109 | + config: |
| 110 | + server_url: "http://<WAA_IP>:5000" |
| 111 | + evaluate_url: "http://<WAA_IP>:5051" |
| 112 | + task_id: "<TASK_UUID>" |
| 113 | + max_steps: 15 |
| 114 | + evaluate_at_done: true |
| 115 | + action_type: fractional |
| 116 | +EOF |
| 117 | +cp ~/verl-agent/train_waa.yaml ~/verl-agent/val_waa.yaml |
| 118 | +``` |
| 119 | + |
| 120 | +### Step 2: Launch GRPO training |
| 121 | + |
| 122 | +```bash |
| 123 | +cd ~/verl-agent && \ |
| 124 | +PYTHONUNBUFFERED=1 conda run -n verl-agent python3 -m vagen.main_ppo \ |
| 125 | + --config-path=$(pwd)/vagen/configs \ |
| 126 | + --config-name=vagen_multiturn \ |
| 127 | + data.train_files=$(pwd)/train_waa.yaml \ |
| 128 | + data.val_files=$(pwd)/val_waa.yaml \ |
| 129 | + data.train_batch_size=1 \ |
| 130 | + data.max_prompt_length=2048 \ |
| 131 | + data.max_response_length=512 \ |
| 132 | + data.return_raw_chat=True \ |
| 133 | + data.return_multi_modal_inputs=True \ |
| 134 | + algorithm.adv_estimator=grpo \ |
| 135 | + algorithm.kl_ctrl.kl_coef=0.0 \ |
| 136 | + actor_rollout_ref.model.path=Qwen/Qwen2.5-VL-3B-Instruct \ |
| 137 | + actor_rollout_ref.model.enable_gradient_checkpointing=True \ |
| 138 | + actor_rollout_ref.actor.optim.lr=1e-6 \ |
| 139 | + actor_rollout_ref.actor.ppo_mini_batch_size=1 \ |
| 140 | + actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ |
| 141 | + actor_rollout_ref.actor.fsdp_config.param_offload=True \ |
| 142 | + actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ |
| 143 | + actor_rollout_ref.rollout.name=vllm \ |
| 144 | + actor_rollout_ref.rollout.mode=async \ |
| 145 | + actor_rollout_ref.rollout.n=4 \ |
| 146 | + actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ |
| 147 | + actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \ |
| 148 | + actor_rollout_ref.rollout.enforce_eager=True \ |
| 149 | + actor_rollout_ref.rollout.enable_chunked_prefill=True \ |
| 150 | + actor_rollout_ref.rollout.multi_turn.enable=True \ |
| 151 | + actor_rollout_ref.rollout.agent.agent_loop_config_path=$(pwd)/vagen/configs/agent.yaml \ |
| 152 | + actor_rollout_ref.ref.fsdp_config.param_offload=True \ |
| 153 | + trainer.n_gpus_per_node=4 \ |
| 154 | + trainer.nnodes=1 \ |
| 155 | + trainer.total_training_steps=10 \ |
| 156 | + trainer.test_freq=5 \ |
| 157 | + trainer.save_freq=10 \ |
| 158 | + trainer.val_before_train=True \ |
| 159 | + trainer.logger=[console,wandb] \ |
| 160 | + trainer.project_name=openadapt-waa-rl \ |
| 161 | + trainer.experiment_name=grpo_waa_smoke \ |
| 162 | + 2>&1 | tee ~/grpo_waa_training.log |
| 163 | +``` |
| 164 | + |
| 165 | +## Monitoring |
| 166 | + |
| 167 | +### WandB Metrics |
| 168 | + |
| 169 | +| Metric | Healthy Sign | |
| 170 | +|--------|-------------| |
| 171 | +| `train/reward_mean` | Increasing from ~0.0 | |
| 172 | +| `train/reward_std` | > 0 (needed for GRPO signal) | |
| 173 | +| `train/reward_max` | Hits 1.0 = first success | |
| 174 | +| `train/entropy` | Decreasing (more decisive) | |
| 175 | +| `rollout/episode_length` | Varying (not all hitting max) | |
| 176 | + |
| 177 | +### GPU Health |
| 178 | + |
| 179 | +```bash |
| 180 | +watch -n 5 nvidia-smi # Memory <20GB/GPU with offloading |
| 181 | +tail -f ~/grpo_waa_training.log |
| 182 | +``` |
| 183 | + |
| 184 | +## Iteration Plan |
| 185 | + |
| 186 | +| Run | Steps | Instance | Cost | Goal | |
| 187 | +|-----|-------|----------|------|------| |
| 188 | +| 0 (Smoke) | 2-3 | g5.xlarge | ~$5 | Pipeline runs without crashes | |
| 189 | +| 1 (Signal) | 10 | g5.12xlarge | ~$50 | Rewards computed, wandb logs | |
| 190 | +| 2 (Training) | 50 | g5.12xlarge | ~$250 | Look for reward_mean trend | |
| 191 | +| 3 (Extended) | 100+ | g5.12xlarge | ~$500 | Only if Run 2 shows signal | |
| 192 | + |
| 193 | +## Common Failure Modes |
| 194 | + |
| 195 | +| Issue | Symptom | Fix | |
| 196 | +|-------|---------|-----| |
| 197 | +| OOM | `CUDA out of memory` | Reduce `gpu_memory_utilization` to 0.4, reduce `rollout.n` to 2 | |
| 198 | +| WAA unresponsive | Timeout/ConnectionError | Check Docker, re-establish socat bridge. NEVER `az vm restart` | |
| 199 | +| PyAutoGUI fail-safe | `FailSafeException` | `curl -X POST .../execute -d '{"command":"python -c \"import pyautogui; pyautogui.FAILSAFE=False; pyautogui.moveTo(500,400)\""}'` | |
| 200 | +| WAADesktop not found | `KeyError: 'WAADesktop'` | Re-register in env_registry.yaml, verify import path | |
| 201 | +| All rewards 0.0 | No learning signal | Check evaluate endpoint, task may be too hard for 3B | |
| 202 | +| Ray issues | Dead workers | `ray stop --force && ray start --head --num-gpus=4` | |
| 203 | + |
| 204 | +## Success Criteria |
| 205 | + |
| 206 | +For first run (pipeline validation): |
| 207 | +- Pipeline runs without crashes |
| 208 | +- Rollouts complete (episodes reach DONE or max_steps) |
| 209 | +- `reward_std > 0` (variance in outcomes) |
| 210 | +- Actions are parseable (`is_action_valid` mostly True) |
| 211 | + |
| 212 | +For 04d9aeaf task (LibreOffice Calc, extremely hard): |
| 213 | +- **Minimum**: At least 1 episode scores 1.0 in 100 steps |
| 214 | +- **Good**: reward_mean > 0.1 after 100 steps |
| 215 | +- **Note**: Even Claude Sonnet scored 0/1 on this task. Consider easier tasks (notepad, settings) for faster iteration. |
0 commit comments