From 808e4690d6caa03d7040e9b08f33067bf34e4656 Mon Sep 17 00:00:00 2001 From: Richard Abrich Date: Tue, 3 Mar 2026 23:09:22 -0500 Subject: [PATCH] docs: add first training run runbook with pre-flight checklist Co-Authored-By: Claude Opus 4.6 --- docs/training_runbook.md | 215 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 215 insertions(+) create mode 100644 docs/training_runbook.md diff --git a/docs/training_runbook.md b/docs/training_runbook.md new file mode 100644 index 0000000..8b26881 --- /dev/null +++ b/docs/training_runbook.md @@ -0,0 +1,215 @@ +# Training Runbook: First GRPO/GiGPO Training Loop on WAA + +## Overview + +Step-by-step playbook for running verl-agent/VAGEN RL training on a GPU VM +connected to a WAA Windows VM. Validated on AWS g5.xlarge (A10G 24GB) with +Azure WAA VM (waa-pool-00). + +**Stack**: PyTorch 2.8.0, vLLM 0.11.0, Ray 2.54.0, VAGEN, Qwen2.5-VL-3B-Instruct + +## Pre-Flight Checklist + +### Azure WAA VM (waa-pool-00) + +``` +[ ] VM running: + az vm show -n waa-pool-00 -g openadapt-agents --query powerState -o tsv + +[ ] IP confirmed: + az vm show -n waa-pool-00 -g openadapt-agents -d --query publicIps -o tsv + +[ ] Docker container running: + ssh azureuser@ "docker ps --format '{{.Names}} {{.Status}}'" + +[ ] Port 5000 (Flask API): + curl -s http://:5000/probe | head -5 + +[ ] Port 5051 (socat bridge for evaluate_server): + curl -s http://:5051/probe | head -5 + If fails, re-establish bridge: + CONTAINER_PID=$(ssh azureuser@ "docker inspect --format '{{.State.Pid}}' ") + ssh azureuser@ "rm -f /tmp/waa-bridge.sock" + ssh azureuser@ "nsenter -t $CONTAINER_PID -n socat UNIX-LISTEN:/tmp/waa-bridge.sock,fork TCP:localhost:5050 &" + ssh azureuser@ "socat TCP-LISTEN:5051,fork,reuseaddr UNIX-CONNECT:/tmp/waa-bridge.sock &" + +[ ] Task setup works: + curl -s -X POST http://:5051/setup \ + -H "Content-Type: application/json" \ + -d '{"task_id":""}' +``` + +### AWS GPU VM + +``` +[ ] nvidia-smi works and shows expected GPU(s) +[ ] conda activate verl-agent works +[ ] python -c "import vagen; print(vagen.__file__)" succeeds +[ ] python -c "from openadapt_evals.adapters.verl_env import WAADesktopEnv" succeeds +[ ] WAADesktop registered in ~/verl-agent/vagen/configs/env_registry.yaml +[ ] WAA VM reachable: curl -s http://:5000/probe +[ ] wandb configured: wandb login --verify +[ ] Disk space: df -h / (need 50GB+ free) +``` + +### Connectivity Smoke Test + +```bash +# From GPU VM +conda run -n verl-agent python3 -c " +import asyncio +from openadapt_evals.adapters.verl_env import WAADesktopEnv +env = WAADesktopEnv({ + 'server_url': 'http://:5000', + 'evaluate_url': 'http://:5051', + 'task_id': '', + 'max_steps': 3, + 'evaluate_at_done': True, + 'action_type': 'fractional', +}) +obs, info = asyncio.run(env.reset(seed=42)) +print('Reset OK, obs keys:', obs.keys()) +obs, reward, done, info = asyncio.run(env.step('CLICK(x=0.5, y=0.5)')) +print(f'Step OK, reward={reward}, done={done}') +asyncio.run(env.close()) +print('Smoke test passed!') +" +``` + +## Instance Selection + +| Instance | GPUs | VRAM | $/hr (OD) | $/hr (Spot) | Use Case | +|----------|------|------|-----------|-------------|----------| +| g5.xlarge | 1x A10G | 24GB | $1.006 | $0.43 | Smoke test, single-GPU dev | +| g5.2xlarge | 1x A10G | 24GB | $1.21 | ~$0.52 | Single-GPU with more RAM | +| g5.12xlarge | 4x A10G | 96GB | $5.67 | $2.90 | Multi-GPU training (recommended) | +| g6.12xlarge | 4x L4 | 96GB | $4.60 | $2.26 | Budget multi-GPU alternative | + +## Key Architecture Constraints + +1. **n_envs must be 1** — only one WAA VM, multiple envs would clobber state +2. **Use `rollout.n` for GRPO group size** — generates N responses sequentially, not parallel envs +3. **Entry point is `vagen.main_ppo`**, not `verl.trainer.main_ppo` — VAGEN extends verl with multi-turn agent support +4. **Hydra config system** — use `--config-path` and `--config-name=vagen_multiturn` +5. **Use `rollout.name=vllm`** — already validated; VAGEN examples use sglang but vLLM works + +## Launch Commands + +### Step 1: Create training data YAML on GPU VM + +```bash +cat > ~/verl-agent/train_waa.yaml << 'EOF' +envs: + - name: WAADesktop + n_envs: 1 + data_source: waa + seed: [1, 100, 1] + max_turns: 15 + response_length_per_turn: 512 + config: + server_url: "http://:5000" + evaluate_url: "http://:5051" + task_id: "" + max_steps: 15 + evaluate_at_done: true + action_type: fractional +EOF +cp ~/verl-agent/train_waa.yaml ~/verl-agent/val_waa.yaml +``` + +### Step 2: Launch GRPO training + +```bash +cd ~/verl-agent && \ +PYTHONUNBUFFERED=1 conda run -n verl-agent python3 -m vagen.main_ppo \ + --config-path=$(pwd)/vagen/configs \ + --config-name=vagen_multiturn \ + data.train_files=$(pwd)/train_waa.yaml \ + data.val_files=$(pwd)/val_waa.yaml \ + data.train_batch_size=1 \ + data.max_prompt_length=2048 \ + data.max_response_length=512 \ + data.return_raw_chat=True \ + data.return_multi_modal_inputs=True \ + algorithm.adv_estimator=grpo \ + algorithm.kl_ctrl.kl_coef=0.0 \ + actor_rollout_ref.model.path=Qwen/Qwen2.5-VL-3B-Instruct \ + actor_rollout_ref.model.enable_gradient_checkpointing=True \ + actor_rollout_ref.actor.optim.lr=1e-6 \ + actor_rollout_ref.actor.ppo_mini_batch_size=1 \ + actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ + actor_rollout_ref.actor.fsdp_config.param_offload=True \ + actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ + actor_rollout_ref.rollout.name=vllm \ + actor_rollout_ref.rollout.mode=async \ + actor_rollout_ref.rollout.n=4 \ + actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ + actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \ + actor_rollout_ref.rollout.enforce_eager=True \ + actor_rollout_ref.rollout.enable_chunked_prefill=True \ + actor_rollout_ref.rollout.multi_turn.enable=True \ + actor_rollout_ref.rollout.agent.agent_loop_config_path=$(pwd)/vagen/configs/agent.yaml \ + actor_rollout_ref.ref.fsdp_config.param_offload=True \ + trainer.n_gpus_per_node=4 \ + trainer.nnodes=1 \ + trainer.total_training_steps=10 \ + trainer.test_freq=5 \ + trainer.save_freq=10 \ + trainer.val_before_train=True \ + trainer.logger=[console,wandb] \ + trainer.project_name=openadapt-waa-rl \ + trainer.experiment_name=grpo_waa_smoke \ + 2>&1 | tee ~/grpo_waa_training.log +``` + +## Monitoring + +### WandB Metrics + +| Metric | Healthy Sign | +|--------|-------------| +| `train/reward_mean` | Increasing from ~0.0 | +| `train/reward_std` | > 0 (needed for GRPO signal) | +| `train/reward_max` | Hits 1.0 = first success | +| `train/entropy` | Decreasing (more decisive) | +| `rollout/episode_length` | Varying (not all hitting max) | + +### GPU Health + +```bash +watch -n 5 nvidia-smi # Memory <20GB/GPU with offloading +tail -f ~/grpo_waa_training.log +``` + +## Iteration Plan + +| Run | Steps | Instance | Cost | Goal | +|-----|-------|----------|------|------| +| 0 (Smoke) | 2-3 | g5.xlarge | ~$5 | Pipeline runs without crashes | +| 1 (Signal) | 10 | g5.12xlarge | ~$50 | Rewards computed, wandb logs | +| 2 (Training) | 50 | g5.12xlarge | ~$250 | Look for reward_mean trend | +| 3 (Extended) | 100+ | g5.12xlarge | ~$500 | Only if Run 2 shows signal | + +## Common Failure Modes + +| Issue | Symptom | Fix | +|-------|---------|-----| +| OOM | `CUDA out of memory` | Reduce `gpu_memory_utilization` to 0.4, reduce `rollout.n` to 2 | +| WAA unresponsive | Timeout/ConnectionError | Check Docker, re-establish socat bridge. NEVER `az vm restart` | +| PyAutoGUI fail-safe | `FailSafeException` | `curl -X POST .../execute -d '{"command":"python -c \"import pyautogui; pyautogui.FAILSAFE=False; pyautogui.moveTo(500,400)\""}'` | +| WAADesktop not found | `KeyError: 'WAADesktop'` | Re-register in env_registry.yaml, verify import path | +| All rewards 0.0 | No learning signal | Check evaluate endpoint, task may be too hard for 3B | +| Ray issues | Dead workers | `ray stop --force && ray start --head --num-gpus=4` | + +## Success Criteria + +For first run (pipeline validation): +- Pipeline runs without crashes +- Rollouts complete (episodes reach DONE or max_steps) +- `reward_std > 0` (variance in outcomes) +- Actions are parseable (`is_action_valid` mostly True) + +For 04d9aeaf task (LibreOffice Calc, extremely hard): +- **Minimum**: At least 1 episode scores 1.0 in 100 steps +- **Good**: reward_mean > 0.1 after 100 steps +- **Note**: Even Claude Sonnet scored 0/1 on this task. Consider easier tasks (notepad, settings) for faster iteration.