Skip to content

Commit dd6b6fc

Browse files
abrichrclaude
andauthored
docs: add first training run runbook with pre-flight checklist (#99)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent f7d4be9 commit dd6b6fc

1 file changed

Lines changed: 215 additions & 0 deletions

File tree

docs/training_runbook.md

Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
# Training Runbook: First GRPO/GiGPO Training Loop on WAA
2+
3+
## Overview
4+
5+
Step-by-step playbook for running verl-agent/VAGEN RL training on a GPU VM
6+
connected to a WAA Windows VM. Validated on AWS g5.xlarge (A10G 24GB) with
7+
Azure WAA VM (waa-pool-00).
8+
9+
**Stack**: PyTorch 2.8.0, vLLM 0.11.0, Ray 2.54.0, VAGEN, Qwen2.5-VL-3B-Instruct
10+
11+
## Pre-Flight Checklist
12+
13+
### Azure WAA VM (waa-pool-00)
14+
15+
```
16+
[ ] VM running:
17+
az vm show -n waa-pool-00 -g openadapt-agents --query powerState -o tsv
18+
19+
[ ] IP confirmed:
20+
az vm show -n waa-pool-00 -g openadapt-agents -d --query publicIps -o tsv
21+
22+
[ ] Docker container running:
23+
ssh azureuser@<WAA_IP> "docker ps --format '{{.Names}} {{.Status}}'"
24+
25+
[ ] Port 5000 (Flask API):
26+
curl -s http://<WAA_IP>:5000/probe | head -5
27+
28+
[ ] Port 5051 (socat bridge for evaluate_server):
29+
curl -s http://<WAA_IP>:5051/probe | head -5
30+
If fails, re-establish bridge:
31+
CONTAINER_PID=$(ssh azureuser@<WAA_IP> "docker inspect --format '{{.State.Pid}}' <container>")
32+
ssh azureuser@<WAA_IP> "rm -f /tmp/waa-bridge.sock"
33+
ssh azureuser@<WAA_IP> "nsenter -t $CONTAINER_PID -n socat UNIX-LISTEN:/tmp/waa-bridge.sock,fork TCP:localhost:5050 &"
34+
ssh azureuser@<WAA_IP> "socat TCP-LISTEN:5051,fork,reuseaddr UNIX-CONNECT:/tmp/waa-bridge.sock &"
35+
36+
[ ] Task setup works:
37+
curl -s -X POST http://<WAA_IP>:5051/setup \
38+
-H "Content-Type: application/json" \
39+
-d '{"task_id":"<TASK_UUID>"}'
40+
```
41+
42+
### AWS GPU VM
43+
44+
```
45+
[ ] nvidia-smi works and shows expected GPU(s)
46+
[ ] conda activate verl-agent works
47+
[ ] python -c "import vagen; print(vagen.__file__)" succeeds
48+
[ ] python -c "from openadapt_evals.adapters.verl_env import WAADesktopEnv" succeeds
49+
[ ] WAADesktop registered in ~/verl-agent/vagen/configs/env_registry.yaml
50+
[ ] WAA VM reachable: curl -s http://<WAA_IP>:5000/probe
51+
[ ] wandb configured: wandb login --verify
52+
[ ] Disk space: df -h / (need 50GB+ free)
53+
```
54+
55+
### Connectivity Smoke Test
56+
57+
```bash
58+
# From GPU VM
59+
conda run -n verl-agent python3 -c "
60+
import asyncio
61+
from openadapt_evals.adapters.verl_env import WAADesktopEnv
62+
env = WAADesktopEnv({
63+
'server_url': 'http://<WAA_IP>:5000',
64+
'evaluate_url': 'http://<WAA_IP>:5051',
65+
'task_id': '<TASK_UUID>',
66+
'max_steps': 3,
67+
'evaluate_at_done': True,
68+
'action_type': 'fractional',
69+
})
70+
obs, info = asyncio.run(env.reset(seed=42))
71+
print('Reset OK, obs keys:', obs.keys())
72+
obs, reward, done, info = asyncio.run(env.step('CLICK(x=0.5, y=0.5)'))
73+
print(f'Step OK, reward={reward}, done={done}')
74+
asyncio.run(env.close())
75+
print('Smoke test passed!')
76+
"
77+
```
78+
79+
## Instance Selection
80+
81+
| Instance | GPUs | VRAM | $/hr (OD) | $/hr (Spot) | Use Case |
82+
|----------|------|------|-----------|-------------|----------|
83+
| g5.xlarge | 1x A10G | 24GB | $1.006 | $0.43 | Smoke test, single-GPU dev |
84+
| g5.2xlarge | 1x A10G | 24GB | $1.21 | ~$0.52 | Single-GPU with more RAM |
85+
| g5.12xlarge | 4x A10G | 96GB | $5.67 | $2.90 | Multi-GPU training (recommended) |
86+
| g6.12xlarge | 4x L4 | 96GB | $4.60 | $2.26 | Budget multi-GPU alternative |
87+
88+
## Key Architecture Constraints
89+
90+
1. **n_envs must be 1** — only one WAA VM, multiple envs would clobber state
91+
2. **Use `rollout.n` for GRPO group size** — generates N responses sequentially, not parallel envs
92+
3. **Entry point is `vagen.main_ppo`**, not `verl.trainer.main_ppo` — VAGEN extends verl with multi-turn agent support
93+
4. **Hydra config system** — use `--config-path` and `--config-name=vagen_multiturn`
94+
5. **Use `rollout.name=vllm`** — already validated; VAGEN examples use sglang but vLLM works
95+
96+
## Launch Commands
97+
98+
### Step 1: Create training data YAML on GPU VM
99+
100+
```bash
101+
cat > ~/verl-agent/train_waa.yaml << 'EOF'
102+
envs:
103+
- name: WAADesktop
104+
n_envs: 1
105+
data_source: waa
106+
seed: [1, 100, 1]
107+
max_turns: 15
108+
response_length_per_turn: 512
109+
config:
110+
server_url: "http://<WAA_IP>:5000"
111+
evaluate_url: "http://<WAA_IP>:5051"
112+
task_id: "<TASK_UUID>"
113+
max_steps: 15
114+
evaluate_at_done: true
115+
action_type: fractional
116+
EOF
117+
cp ~/verl-agent/train_waa.yaml ~/verl-agent/val_waa.yaml
118+
```
119+
120+
### Step 2: Launch GRPO training
121+
122+
```bash
123+
cd ~/verl-agent && \
124+
PYTHONUNBUFFERED=1 conda run -n verl-agent python3 -m vagen.main_ppo \
125+
--config-path=$(pwd)/vagen/configs \
126+
--config-name=vagen_multiturn \
127+
data.train_files=$(pwd)/train_waa.yaml \
128+
data.val_files=$(pwd)/val_waa.yaml \
129+
data.train_batch_size=1 \
130+
data.max_prompt_length=2048 \
131+
data.max_response_length=512 \
132+
data.return_raw_chat=True \
133+
data.return_multi_modal_inputs=True \
134+
algorithm.adv_estimator=grpo \
135+
algorithm.kl_ctrl.kl_coef=0.0 \
136+
actor_rollout_ref.model.path=Qwen/Qwen2.5-VL-3B-Instruct \
137+
actor_rollout_ref.model.enable_gradient_checkpointing=True \
138+
actor_rollout_ref.actor.optim.lr=1e-6 \
139+
actor_rollout_ref.actor.ppo_mini_batch_size=1 \
140+
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
141+
actor_rollout_ref.actor.fsdp_config.param_offload=True \
142+
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
143+
actor_rollout_ref.rollout.name=vllm \
144+
actor_rollout_ref.rollout.mode=async \
145+
actor_rollout_ref.rollout.n=4 \
146+
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
147+
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
148+
actor_rollout_ref.rollout.enforce_eager=True \
149+
actor_rollout_ref.rollout.enable_chunked_prefill=True \
150+
actor_rollout_ref.rollout.multi_turn.enable=True \
151+
actor_rollout_ref.rollout.agent.agent_loop_config_path=$(pwd)/vagen/configs/agent.yaml \
152+
actor_rollout_ref.ref.fsdp_config.param_offload=True \
153+
trainer.n_gpus_per_node=4 \
154+
trainer.nnodes=1 \
155+
trainer.total_training_steps=10 \
156+
trainer.test_freq=5 \
157+
trainer.save_freq=10 \
158+
trainer.val_before_train=True \
159+
trainer.logger=[console,wandb] \
160+
trainer.project_name=openadapt-waa-rl \
161+
trainer.experiment_name=grpo_waa_smoke \
162+
2>&1 | tee ~/grpo_waa_training.log
163+
```
164+
165+
## Monitoring
166+
167+
### WandB Metrics
168+
169+
| Metric | Healthy Sign |
170+
|--------|-------------|
171+
| `train/reward_mean` | Increasing from ~0.0 |
172+
| `train/reward_std` | > 0 (needed for GRPO signal) |
173+
| `train/reward_max` | Hits 1.0 = first success |
174+
| `train/entropy` | Decreasing (more decisive) |
175+
| `rollout/episode_length` | Varying (not all hitting max) |
176+
177+
### GPU Health
178+
179+
```bash
180+
watch -n 5 nvidia-smi # Memory <20GB/GPU with offloading
181+
tail -f ~/grpo_waa_training.log
182+
```
183+
184+
## Iteration Plan
185+
186+
| Run | Steps | Instance | Cost | Goal |
187+
|-----|-------|----------|------|------|
188+
| 0 (Smoke) | 2-3 | g5.xlarge | ~$5 | Pipeline runs without crashes |
189+
| 1 (Signal) | 10 | g5.12xlarge | ~$50 | Rewards computed, wandb logs |
190+
| 2 (Training) | 50 | g5.12xlarge | ~$250 | Look for reward_mean trend |
191+
| 3 (Extended) | 100+ | g5.12xlarge | ~$500 | Only if Run 2 shows signal |
192+
193+
## Common Failure Modes
194+
195+
| Issue | Symptom | Fix |
196+
|-------|---------|-----|
197+
| OOM | `CUDA out of memory` | Reduce `gpu_memory_utilization` to 0.4, reduce `rollout.n` to 2 |
198+
| WAA unresponsive | Timeout/ConnectionError | Check Docker, re-establish socat bridge. NEVER `az vm restart` |
199+
| PyAutoGUI fail-safe | `FailSafeException` | `curl -X POST .../execute -d '{"command":"python -c \"import pyautogui; pyautogui.FAILSAFE=False; pyautogui.moveTo(500,400)\""}'` |
200+
| WAADesktop not found | `KeyError: 'WAADesktop'` | Re-register in env_registry.yaml, verify import path |
201+
| All rewards 0.0 | No learning signal | Check evaluate endpoint, task may be too hard for 3B |
202+
| Ray issues | Dead workers | `ray stop --force && ray start --head --num-gpus=4` |
203+
204+
## Success Criteria
205+
206+
For first run (pipeline validation):
207+
- Pipeline runs without crashes
208+
- Rollouts complete (episodes reach DONE or max_steps)
209+
- `reward_std > 0` (variance in outcomes)
210+
- Actions are parseable (`is_action_valid` mostly True)
211+
212+
For 04d9aeaf task (LibreOffice Calc, extremely hard):
213+
- **Minimum**: At least 1 episode scores 1.0 in 100 steps
214+
- **Good**: reward_mean > 0.1 after 100 steps
215+
- **Note**: Even Claude Sonnet scored 0/1 on this task. Consider easier tasks (notepad, settings) for faster iteration.

0 commit comments

Comments
 (0)