11# OpenAdapt-ML
22
3- [ ![ Build Status] ( https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/publish .yml/badge.svg )] ( https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/publish .yml )
3+ [ ![ Build Status] ( https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/release .yml/badge.svg )] ( https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/release .yml )
44[ ![ PyPI version] ( https://img.shields.io/pypi/v/openadapt-ml.svg )] ( https://pypi.org/project/openadapt-ml/ )
55[ ![ Downloads] ( https://img.shields.io/pypi/dm/openadapt-ml.svg )] ( https://pypi.org/project/openadapt-ml/ )
66[ ![ License: MIT] ( https://img.shields.io/badge/License-MIT-yellow.svg )] ( https://opensource.org/licenses/MIT )
@@ -30,6 +30,38 @@ The design is described in detail in [`docs/design.md`](docs/design.md).
3030
3131---
3232
33+ ## Parallel WAA Benchmark Evaluation (New in v0.3.0)
34+
35+ Run Windows Agent Arena benchmarks across multiple Azure VMs in parallel for faster evaluation:
36+
37+ ``` bash
38+ # Create a pool of 5 workers
39+ uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 5
40+
41+ # Wait for all workers to be ready
42+ uv run python -m openadapt_ml.benchmarks.cli pool-wait
43+
44+ # Run 154 tasks distributed across workers (~5x faster)
45+ uv run python -m openadapt_ml.benchmarks.cli pool-run --tasks 154
46+ ```
47+
48+ ** Key features:**
49+ - ** Parallel execution** : Distribute 154 WAA tasks across N workers
50+ - ** Automatic task distribution** : Uses WAA's native ` --worker_id ` /` --num_workers ` for round-robin assignment
51+ - ** VNC access** : View each Windows VM via SSH tunnels (` localhost:8006 ` , ` localhost:8007 ` , etc.)
52+ - ** Cost tracking** : Monitor Azure VM costs in real-time
53+
54+ ** Performance:**
55+ | Workers | Estimated Time (154 tasks) |
56+ | ---------| ---------------------------|
57+ | 1 | ~ 50-80 hours |
58+ | 5 | ~ 10-16 hours |
59+ | 10 | ~ 5-8 hours |
60+
61+ See [ WAA Benchmark Workflow] ( #waa-benchmark-workflow ) for complete setup instructions.
62+
63+ ---
64+
3365## 1. Installation
3466
3567### 1.1 From PyPI (recommended)
@@ -971,7 +1003,108 @@ uv run python -m openadapt_ml.benchmarks.cli screenshot --target terminal --no-t
9711003
9721004---
9731005
974- ## 14. Limitations & Notes
1006+ ## 14. WAA Benchmark Workflow
1007+
1008+ <a id =" waa-benchmark-workflow " ></a >
1009+
1010+ Windows Agent Arena (WAA) is a benchmark of 154 tasks across 11 Windows domains. OpenAdapt-ML provides infrastructure to run WAA evaluations on Azure VMs with parallel execution.
1011+
1012+ ### 14.1 Prerequisites
1013+
1014+ 1 . ** Azure CLI** : ` brew install azure-cli && az login `
1015+ 2 . ** OpenAI API Key** : Set in ` .env ` file (` OPENAI_API_KEY=sk-... ` )
1016+ 3 . ** Azure quota** : Ddsv5 family VMs (8+ vCPUs per worker)
1017+
1018+ ### 14.2 Single VM Workflow
1019+
1020+ For quick testing or small runs:
1021+
1022+ ``` bash
1023+ # Setup VM with WAA
1024+ uv run python -m openadapt_ml.benchmarks.cli vm setup-waa
1025+
1026+ # Start monitoring dashboard (auto-opens VNC, manages SSH tunnels)
1027+ uv run python -m openadapt_ml.benchmarks.cli vm monitor
1028+
1029+ # Run benchmark
1030+ uv run python -m openadapt_ml.benchmarks.cli waa --num-tasks 10
1031+
1032+ # Deallocate when done (stops billing)
1033+ uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y
1034+ ```
1035+
1036+ ### 14.3 Parallel Pool Workflow (Recommended)
1037+
1038+ For full 154-task evaluations, use multiple VMs:
1039+
1040+ ``` bash
1041+ # 1. Create pool (provisions N Azure VMs with Docker + WAA)
1042+ uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 5
1043+
1044+ # 2. Wait for all workers to be ready (Windows boot + WAA server startup)
1045+ uv run python -m openadapt_ml.benchmarks.cli pool-wait
1046+
1047+ # 3. Run benchmark across all workers
1048+ # Tasks are distributed using WAA's native --worker_id/--num_workers
1049+ uv run python -m openadapt_ml.benchmarks.cli pool-run --tasks 154
1050+
1051+ # 4. Monitor progress
1052+ uv run python -m openadapt_ml.benchmarks.cli pool-status
1053+ uv run python -m openadapt_ml.benchmarks.cli pool-logs
1054+
1055+ # 5. Cleanup (delete all VMs - IMPORTANT to stop billing!)
1056+ uv run python -m openadapt_ml.benchmarks.cli pool-delete -y
1057+ ```
1058+
1059+ ### 14.4 VNC Access to Workers
1060+
1061+ View what each Windows VM is doing:
1062+
1063+ ``` bash
1064+ # Set up SSH tunnels (tunnels are created automatically, but you can also do this manually)
1065+ ssh -f -N -L 8006:localhost:8006 azureuser@< worker-0-ip> # localhost:8006
1066+ ssh -f -N -L 8007:localhost:8006 azureuser@< worker-1-ip> # localhost:8007
1067+ # etc.
1068+
1069+ # Open in browser
1070+ open http://localhost:8006 # Worker 0
1071+ open http://localhost:8007 # Worker 1
1072+ ```
1073+
1074+ ### 14.5 Architecture
1075+
1076+ ```
1077+ Local Machine
1078+ ├── openadapt-ml CLI (pool-create, pool-wait, pool-run)
1079+ │ └── SSH tunnels to each worker
1080+ │
1081+ Azure (N VMs, Standard_D8ds_v5)
1082+ ├── waa-pool-00
1083+ │ └── Docker
1084+ │ └── windowsarena/winarena:latest
1085+ │ └── QEMU (Windows 11)
1086+ │ ├── WAA Flask server (port 5000)
1087+ │ └── Navi agent (GPT-4o-mini)
1088+ ├── waa-pool-01
1089+ │ └── ...
1090+ └── waa-pool-N
1091+ └── ...
1092+ ```
1093+
1094+ ### 14.6 Cost Estimates
1095+
1096+ | VM Size | vCPUs | RAM | Cost/hr | 5 VMs for 10hrs |
1097+ | ---------| -------| -----| ---------| -----------------|
1098+ | Standard_D8ds_v5 | 8 | 32GB | ~ $0.38 | ~ $19 |
1099+
1100+ ** Tips:**
1101+ - Always run ` pool-delete -y ` when done
1102+ - Use ` vm deallocate ` (not delete) to pause billing but keep disk
1103+ - Set ` --auto-shutdown-hours 2 ` on ` vm monitor ` for safety
1104+
1105+ ---
1106+
1107+ ## 15. Limitations & Notes
9751108
9761109- ** Apple Silicon / bitsandbytes** :
9771110 - Example configs are sized for CPU / Apple Silicon development runs; see
@@ -995,7 +1128,7 @@ For deeper architectural details, see [`docs/design.md`](docs/design.md).
9951128
9961129---
9971130
998- ## 15 . Roadmap
1131+ ## 16 . Roadmap
9991132
10001133For the up-to-date, prioritized roadmap (including concrete implementation
10011134targets and agent-executable acceptance criteria), see
0 commit comments