Skip to content

Commit b389d3e

Browse files
abrichrclaude
andcommitted
docs(readme): add parallel WAA evaluation section, fix build badge
- Fix broken build badge (publish.yml → release.yml) - Add prominent "Parallel WAA Benchmark Evaluation" section near top - Add detailed "WAA Benchmark Workflow" section (#14) with: - Single VM and parallel pool workflows - VNC access instructions - Architecture diagram - Cost estimates - Update section numbering (Limitations → 15, Roadmap → 16) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent cbf3397 commit b389d3e

File tree

1 file changed

+136
-3
lines changed

1 file changed

+136
-3
lines changed

README.md

Lines changed: 136 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# OpenAdapt-ML
22

3-
[![Build Status](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/publish.yml/badge.svg)](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/publish.yml)
3+
[![Build Status](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/release.yml/badge.svg)](https://github.com/OpenAdaptAI/openadapt-ml/actions/workflows/release.yml)
44
[![PyPI version](https://img.shields.io/pypi/v/openadapt-ml.svg)](https://pypi.org/project/openadapt-ml/)
55
[![Downloads](https://img.shields.io/pypi/dm/openadapt-ml.svg)](https://pypi.org/project/openadapt-ml/)
66
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
@@ -30,6 +30,38 @@ The design is described in detail in [`docs/design.md`](docs/design.md).
3030

3131
---
3232

33+
## Parallel WAA Benchmark Evaluation (New in v0.3.0)
34+
35+
Run Windows Agent Arena benchmarks across multiple Azure VMs in parallel for faster evaluation:
36+
37+
```bash
38+
# Create a pool of 5 workers
39+
uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 5
40+
41+
# Wait for all workers to be ready
42+
uv run python -m openadapt_ml.benchmarks.cli pool-wait
43+
44+
# Run 154 tasks distributed across workers (~5x faster)
45+
uv run python -m openadapt_ml.benchmarks.cli pool-run --tasks 154
46+
```
47+
48+
**Key features:**
49+
- **Parallel execution**: Distribute 154 WAA tasks across N workers
50+
- **Automatic task distribution**: Uses WAA's native `--worker_id`/`--num_workers` for round-robin assignment
51+
- **VNC access**: View each Windows VM via SSH tunnels (`localhost:8006`, `localhost:8007`, etc.)
52+
- **Cost tracking**: Monitor Azure VM costs in real-time
53+
54+
**Performance:**
55+
| Workers | Estimated Time (154 tasks) |
56+
|---------|---------------------------|
57+
| 1 | ~50-80 hours |
58+
| 5 | ~10-16 hours |
59+
| 10 | ~5-8 hours |
60+
61+
See [WAA Benchmark Workflow](#waa-benchmark-workflow) for complete setup instructions.
62+
63+
---
64+
3365
## 1. Installation
3466

3567
### 1.1 From PyPI (recommended)
@@ -971,7 +1003,108 @@ uv run python -m openadapt_ml.benchmarks.cli screenshot --target terminal --no-t
9711003

9721004
---
9731005

974-
## 14. Limitations & Notes
1006+
## 14. WAA Benchmark Workflow
1007+
1008+
<a id="waa-benchmark-workflow"></a>
1009+
1010+
Windows Agent Arena (WAA) is a benchmark of 154 tasks across 11 Windows domains. OpenAdapt-ML provides infrastructure to run WAA evaluations on Azure VMs with parallel execution.
1011+
1012+
### 14.1 Prerequisites
1013+
1014+
1. **Azure CLI**: `brew install azure-cli && az login`
1015+
2. **OpenAI API Key**: Set in `.env` file (`OPENAI_API_KEY=sk-...`)
1016+
3. **Azure quota**: Ddsv5 family VMs (8+ vCPUs per worker)
1017+
1018+
### 14.2 Single VM Workflow
1019+
1020+
For quick testing or small runs:
1021+
1022+
```bash
1023+
# Setup VM with WAA
1024+
uv run python -m openadapt_ml.benchmarks.cli vm setup-waa
1025+
1026+
# Start monitoring dashboard (auto-opens VNC, manages SSH tunnels)
1027+
uv run python -m openadapt_ml.benchmarks.cli vm monitor
1028+
1029+
# Run benchmark
1030+
uv run python -m openadapt_ml.benchmarks.cli waa --num-tasks 10
1031+
1032+
# Deallocate when done (stops billing)
1033+
uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y
1034+
```
1035+
1036+
### 14.3 Parallel Pool Workflow (Recommended)
1037+
1038+
For full 154-task evaluations, use multiple VMs:
1039+
1040+
```bash
1041+
# 1. Create pool (provisions N Azure VMs with Docker + WAA)
1042+
uv run python -m openadapt_ml.benchmarks.cli pool-create --workers 5
1043+
1044+
# 2. Wait for all workers to be ready (Windows boot + WAA server startup)
1045+
uv run python -m openadapt_ml.benchmarks.cli pool-wait
1046+
1047+
# 3. Run benchmark across all workers
1048+
# Tasks are distributed using WAA's native --worker_id/--num_workers
1049+
uv run python -m openadapt_ml.benchmarks.cli pool-run --tasks 154
1050+
1051+
# 4. Monitor progress
1052+
uv run python -m openadapt_ml.benchmarks.cli pool-status
1053+
uv run python -m openadapt_ml.benchmarks.cli pool-logs
1054+
1055+
# 5. Cleanup (delete all VMs - IMPORTANT to stop billing!)
1056+
uv run python -m openadapt_ml.benchmarks.cli pool-delete -y
1057+
```
1058+
1059+
### 14.4 VNC Access to Workers
1060+
1061+
View what each Windows VM is doing:
1062+
1063+
```bash
1064+
# Set up SSH tunnels (tunnels are created automatically, but you can also do this manually)
1065+
ssh -f -N -L 8006:localhost:8006 azureuser@<worker-0-ip> # localhost:8006
1066+
ssh -f -N -L 8007:localhost:8006 azureuser@<worker-1-ip> # localhost:8007
1067+
# etc.
1068+
1069+
# Open in browser
1070+
open http://localhost:8006 # Worker 0
1071+
open http://localhost:8007 # Worker 1
1072+
```
1073+
1074+
### 14.5 Architecture
1075+
1076+
```
1077+
Local Machine
1078+
├── openadapt-ml CLI (pool-create, pool-wait, pool-run)
1079+
│ └── SSH tunnels to each worker
1080+
1081+
Azure (N VMs, Standard_D8ds_v5)
1082+
├── waa-pool-00
1083+
│ └── Docker
1084+
│ └── windowsarena/winarena:latest
1085+
│ └── QEMU (Windows 11)
1086+
│ ├── WAA Flask server (port 5000)
1087+
│ └── Navi agent (GPT-4o-mini)
1088+
├── waa-pool-01
1089+
│ └── ...
1090+
└── waa-pool-N
1091+
└── ...
1092+
```
1093+
1094+
### 14.6 Cost Estimates
1095+
1096+
| VM Size | vCPUs | RAM | Cost/hr | 5 VMs for 10hrs |
1097+
|---------|-------|-----|---------|-----------------|
1098+
| Standard_D8ds_v5 | 8 | 32GB | ~$0.38 | ~$19 |
1099+
1100+
**Tips:**
1101+
- Always run `pool-delete -y` when done
1102+
- Use `vm deallocate` (not delete) to pause billing but keep disk
1103+
- Set `--auto-shutdown-hours 2` on `vm monitor` for safety
1104+
1105+
---
1106+
1107+
## 15. Limitations & Notes
9751108

9761109
- **Apple Silicon / bitsandbytes**:
9771110
- Example configs are sized for CPU / Apple Silicon development runs; see
@@ -995,7 +1128,7 @@ For deeper architectural details, see [`docs/design.md`](docs/design.md).
9951128

9961129
---
9971130

998-
## 15. Roadmap
1131+
## 16. Roadmap
9991132

10001133
For the up-to-date, prioritized roadmap (including concrete implementation
10011134
targets and agent-executable acceptance criteria), see

0 commit comments

Comments
 (0)