@@ -57,23 +57,28 @@ uv run python -m openadapt_evals.benchmarks.cli up
5757
5858---
5959
60- ## 🎯 WAA BENCHMARK WORKFLOW (COMPLETE GUIDE)
60+ ## WAA BENCHMARK WORKFLOW (COMPLETE GUIDE)
6161
6262### Architecture Overview
6363
64- The WAA setup spans TWO repos with distinct responsibilities:
64+ All evaluation infrastructure lives in openadapt-evals. Three CLI entry points:
65+
66+ - ` oa ` — unified CLI (` oa evals run ` , ` oa evals mock ` )
67+ - ` oa-vm ` — VM/pool management (` oa-vm pool-create ` , ` oa-vm status ` )
68+ - ` openadapt-evals ` — legacy entry point
6569
6670```
67- LOCAL MACHINE
68- ├── openadapt-ml CLI (VM management)
69- │ - vm setup-waa # Create VM + Docker + WAA
70- │ - vm monitor # Dashboard + SSH tunnels
71- │ - vm deallocate # Stop billing
71+ LOCAL MACHINE (openadapt-evals)
72+ ├── oa-vm CLI (VM + pool management)
73+ │ - create / delete # Single VM lifecycle
74+ │ - pool-create / pool-cleanup # Multi-VM pools
75+ │ - vm monitor # Dashboard + SSH tunnels
76+ │ - pool-run # Distributed benchmark execution
7277│
73- ├── openadapt-evals CLI (benchmark execution)
74- │ - run # Simplified benchmark run
75- │ - live # Full control live eval
76- │ - mock # No VM needed
78+ ├── oa CLI (benchmark execution)
79+ │ - evals run # Simplified benchmark run
80+ │ - evals live # Full control live eval
81+ │ - evals mock # No VM needed
7782│
7883└── SSH Tunnels (auto-managed by vm monitor)
7984 - localhost:5001 → VM:5000 (WAA Flask API)
@@ -89,21 +94,24 @@ AZURE VM (Ubuntu)
8994
9095### Step-by-Step Workflow
9196
92- ** Step 1: Setup VM (from openadapt-ml, first time only)**
93- ``` bash
94- cd /Users/abrichr/oa/src/openadapt-ml
95- uv run python -m openadapt_ml.benchmarks.cli vm setup-waa ` ` `
97+ All commands run from openadapt-evals (` cd /Users/abrichr/oa/src/openadapt-evals ` ).
9698
97- ** Step 2: Start Dashboard and Tunnels (from openadapt-ml) **
99+ ** Step 1: Create VM Pool **
98100``` bash
99- uv run python -m openadapt_ml.benchmarks.cli vm monitor
101+ # Single VM for quick tests
102+ oa-vm pool-create --workers 1
103+
104+ # Multiple VMs for parallel evaluation
105+ oa-vm pool-create --workers 3
100106```
101- Keep this running! It manages SSH tunnels automatically.
102107
103- ** Step 3: Run Benchmark (from openadapt-evals) **
108+ ** Step 2: Wait for WAA Ready **
104109``` bash
105- cd /Users/abrichr/oa/src/openadapt-evals
110+ oa-vm pool-wait
111+ ```
106112
113+ ** Step 3: Run Benchmark**
114+ ``` bash
107115# Quick smoke test (no API key needed)
108116uv run python -m openadapt_evals.benchmarks.cli run --agent noop --task notepad_1
109117
@@ -113,24 +121,23 @@ uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --task no
113121# With Claude
114122uv run python -m openadapt_evals.benchmarks.cli run --agent api-claude --task notepad_1
115123
116- # Multiple tasks
117- uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai -- tasks notepad_1,notepad_2
124+ # Distributed across pool
125+ oa-vm pool- run --tasks 10
118126```
119127
120128** Step 4: View Results**
121129``` bash
122130uv run python -m openadapt_evals.benchmarks.cli view --run-name live_eval
123131```
124132
125- ** Step 5: Stop VM (from openadapt-ml )**
133+ ** Step 5: Cleanup (Stop Billing )**
126134``` bash
127- cd /Users/abrichr/oa/src/openadapt-ml
128- uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y
135+ oa-vm pool-cleanup -y
129136```
130137
131138### Key Points
132139
133- 1 . ** Two CLIs ** - openadapt-ml manages VM/Docker, openadapt-evals runs benchmarks
140+ 1 . ** One repo ** - all VM management AND benchmark execution in openadapt-evals
1341412 . ** SSH tunnels required** - Azure NSG blocks direct port access
1351423 . ** Default server is localhost:5001** - The ` run ` command uses this automatically
1361434 . ** WAA runs INSIDE Windows** - Not on the Ubuntu host
@@ -140,6 +147,8 @@ uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y
140147
141148## CLI Commands
142149
150+ ### Benchmark CLI (` openadapt_evals.benchmarks.cli ` )
151+
143152| Command | Description |
144153| ---------| -------------|
145154| ` run ` | ** Simplified live evaluation** (uses localhost:5001 by default) |
@@ -151,10 +160,24 @@ uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y
151160| ` estimate ` | Estimate Azure costs |
152161| ` dashboard ` | Generate VM usage dashboard |
153162| ` up ` | All-in-one: Start VM + WAA server + wait until ready |
154- | ` vm-start ` | Start an Azure VM |
155- | ` vm-stop ` | Stop (deallocate) an Azure VM |
156- | ` vm-status ` | Check Azure VM status and IP |
157- | ` vm-setup ` | Full WAA container setup (automated) |
163+
164+ ### VM/Pool CLI (` oa-vm ` )
165+
166+ | Command | Description |
167+ | ---------| -------------|
168+ | ` pool-create --workers N ` | Create N VMs with Docker + WAA |
169+ | ` pool-wait ` | Wait for WAA server ready on all workers |
170+ | ` pool-run --tasks N ` | Run N tasks distributed across workers |
171+ | ` pool-status ` | Show status of all pool VMs |
172+ | ` pool-vnc ` | Open VNC to pool workers |
173+ | ` pool-logs ` | Stream logs from all workers |
174+ | ` pool-cleanup -y ` | Delete all pool VMs and resources |
175+ | ` create --fast ` | Create single VM |
176+ | ` delete ` | Delete VM and all resources |
177+ | ` status ` | Show VM status and IP |
178+ | ` vm monitor ` | Dashboard + SSH tunnels |
179+ | ` deallocate ` | Stop VM (preserves disk, stops billing) |
180+ | ` azure-ml-quota-wait ` | Wait for Azure quota approval |
158181
159182### ` run ` Command (Recommended for Live Evaluation)
160183
@@ -193,14 +216,28 @@ openadapt_evals/
193216│ ├── base.py # BenchmarkAdapter ABC
194217│ ├── waa.py # WAAAdapter, WAAMockAdapter
195218│ └── waa_live.py # WAALiveAdapter (HTTP)
219+ ├── infrastructure/ # Azure VM/pool management (migrated from openadapt-ml)
220+ │ ├── azure_vm.py # AzureVMManager (SDK + az CLI)
221+ │ ├── pool.py # PoolManager (multi-VM orchestration)
222+ │ ├── vm_monitor.py # VMMonitor dashboard
223+ │ ├── azure_ops_tracker.py # Azure operations tracking
224+ │ ├── resource_tracker.py # Cost tracking
225+ │ └── ssh_tunnel.py # SSH tunnel manager
226+ ├── waa_deploy/ # Docker agent deployment (migrated from openadapt-ml)
227+ │ ├── api_agent.py # ApiAgent for WAA container
228+ │ └── Dockerfile # WAA Docker image
196229├── server/ # WAA server extensions
197230│ ├── evaluate_endpoint.py # /evaluate endpoint
198231│ └── waa_server_patch.py # Deploy script
199232├── benchmarks/ # Evaluation utilities
200233│ ├── runner.py # evaluate_agent_on_benchmark()
201234│ ├── azure.py # AzureWAAOrchestrator
202- │ ├── cli.py # Unified CLI
235+ │ ├── cli.py # Benchmark CLI (run, mock, live, view)
236+ │ ├── vm_cli.py # VM/Pool CLI (oa-vm entry point, 50+ commands)
237+ │ ├── pool_viewer.py # Pool results HTML viewer
238+ │ ├── trace_export.py # Training data export
203239│ └── viewer.py # HTML viewer
240+ ├── config.py # Settings (pydantic-settings, .env loading)
204241└── __init__.py
205242```
206243
@@ -227,7 +264,11 @@ agent = ApiAgent(provider="anthropic", demo="Step 1: Click Start menu\n...")
227264| ` agents/retrieval_agent.py ` | Auto demo selection |
228265| ` adapters/waa_live.py ` | HTTP adapter for WAA server |
229266| ` benchmarks/azure.py ` | Azure orchestrator with cost optimization |
230- | ` benchmarks/cli.py ` | CLI entry point |
267+ | ` benchmarks/cli.py ` | Benchmark CLI entry point |
268+ | ` benchmarks/vm_cli.py ` | VM/Pool CLI (` oa-vm ` , 50+ commands) |
269+ | ` infrastructure/azure_vm.py ` | AzureVMManager (SDK + az CLI fallback) |
270+ | ` infrastructure/pool.py ` | PoolManager for parallel evaluation |
271+ | ` config.py ` | Settings (pydantic-settings, .env loading) |
231272
232273## Azure Dashboard
233274
@@ -239,11 +280,11 @@ Shows: real-time costs, VM status, activity logs, start/stop controls.
239280## WAA Container Setup
240281
241282``` bash
242- uv run python -m openadapt_evals.benchmarks.cli vm-setup --auto-verify
283+ oa-vm vm setup-waa
243284```
244285
245286Automated WAA deployment (95%+ reliability). Fresh VM: 15-20 min, existing: 2-5 min.
246- Implementation: bash script in cli .py. Use ` --help ` for troubleshooting.
287+ Implementation: in ` benchmarks/vm_cli .py` . Use ` --help ` for troubleshooting.
247288
248289## Screenshot Requirements
249290
0 commit comments