Skip to content

Commit 57a86a9

Browse files
authored
Revert "feat(azure): implement Azure ML parallelization for WAA evaluation (#24)" (#25)
This reverts commit 077f339.
1 parent 1d568a7 commit 57a86a9

13 files changed

Lines changed: 139 additions & 2821 deletions

File tree

.beads/issues.jsonl

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,9 @@
1-
{"id":"bd-poolfix-1770402918","title":"CRITICAL: Never break working code - match exactly","description":"When pool-wait broke (Feb 2026), wasted hours because I created waa-auto instead of copying the working waa command exactly.\n\nLESSON LEARNED:\n1. Find EXISTING WORKING CODE that does the same thing\n2. COPY IT EXACTLY - same image, same flags, same IPs\n3. If you see an error, understand WHY working code works despite it\n\nThe working waa command uses:\n- windowsarena/winarena:latest with --entrypoint /bin/bash\n- --prepare-image false --start-client false (SKIPS ISO download)\n- Probes at 172.30.0.2:5000\n\nPool commands MUST use identical parameters.","status":"open","priority":1,"issue_type":"lesson","created_at":"2026-02-06T18:35:18Z","created_by":"claude","updated_at":"2026-02-06T18:35:18Z"}
21
{"id":"openadapt-evals-0an","title":"CLI: aws-costs and waa-image delete commands added","notes":"openadapt-evals PR #24: Added aws-costs command, waa-image delete action, changed default to Docker Hub","status":"open","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-29T16:17:03.612486-05:00","created_by":"Richard Abrich","updated_at":"2026-01-29T16:17:03.612486-05:00"}
32
{"id":"openadapt-evals-0dt","title":"Add pre-flight check for Windows install issues","description":"Detect product key prompts or stuck installations BEFORE 10-minute timeout. Check container logs for specific error patterns.","status":"open","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:57:42.24338-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T18:57:42.24338-05:00"}
43
{"id":"openadapt-evals-0ms","title":"Run 20-50 task evaluation","description":"Run WAA benchmark on 20-50 tasks to measure baseline success rate. Target is \u003e80% success rate. This provides quantitative data on agent performance.","notes":"2026-01-29: Azure quota limits parallelization to 2 workers max (10 vCPUs / 4 vCPUs per worker). 10-worker test failed with ClusterCoreQuotaReached. User declined manual portal quota increase. Waiting for api-openai test results before full 154-task run.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T17:44:26.461765-05:00","created_by":"Richard Abrich","updated_at":"2026-01-29T00:28:02.609085-05:00","dependencies":[{"issue_id":"openadapt-evals-0ms","depends_on_id":"openadapt-evals-c3f","type":"blocks","created_at":"2026-01-20T17:44:26.462904-05:00","created_by":"Richard Abrich"}]}
54
{"id":"openadapt-evals-2ar","title":"Implement permanent fix for Windows unattended install","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.544113-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.634857-05:00","closed_at":"2026-01-20T20:32:06.634857-05:00","close_reason":"Duplicate of openadapt-evals-b3l"}
65
{"id":"openadapt-evals-5o8","title":"Analyze evaluation results","description":"Analyze WAA evaluation results to identify failure modes, success patterns, and improvement opportunities. Document findings and create actionable next steps.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T17:44:29.782932-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T17:44:29.782932-05:00","dependencies":[{"issue_id":"openadapt-evals-5o8","depends_on_id":"openadapt-evals-0ms","type":"blocks","created_at":"2026-01-20T17:44:29.783756-05:00","created_by":"Richard Abrich"}]}
76
{"id":"openadapt-evals-5t1","title":"WAA 500 error root cause: Navi agent method signature mismatch","notes":"FILED: https://github.com/microsoft/WindowsAgentArena/issues/79","status":"closed","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-28T20:16:39.141187-05:00","created_by":"Richard Abrich","updated_at":"2026-01-28T20:29:38.780227-05:00","closed_at":"2026-01-28T20:29:38.780227-05:00","close_reason":"Issue filed upstream","labels":["bug","upstream","waa"]}
8-
{"id":"openadapt-evals-aml","title":"Azure ML SDK v2 migration for macOS compatibility","description":"Migrated run_azure.py from SDK v1 to SDK v2 to remove azureml-dataset-runtime dependency (no macOS ARM64 wheel). Enables parallel WAA evaluation from macOS.","notes":"2026-02-02: COMPLETED SDK v2 migration.\n\nATTEMPTS:\n1. D4_v3 (100GB temp): FAILED - disk full\n2. D8ds_v5 (300GB temp): FAILED - DDSv5 quota is 0\n3. D8ds_v4 (300GB temp): FAILED - Ddsv4 quota is 4 vCPUs (needs 8)\n4. D4ds_v4 (150GB temp): RUNNING - fits in 4 vCPU quota\n\nVM SIZE REFERENCE:\n- D4_v3: 100GB temp, Standard D Family quota\n- D4ds_v4: 150GB temp, Ddsv4 quota (4 vCPUs max)\n- D8ds_v4: 300GB temp, Ddsv4 quota (need 8 vCPUs)\n- D4ds_v5: 150GB temp, DDSv5 quota (0 available)\n- D8ds_v5: 300GB temp, DDSv5 quota (0 available)\n- D8ds_v6: 0GB temp (no local SSD!)\n\nCURRENT: Job olive_tomato_ky0y4lw7rn running on w0Expeval02022219 with D4ds_v4","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-03T03:25:00-05:00","created_by":"Claude","updated_at":"2026-02-03T03:25:00-05:00","labels":["azure-ml","parallelization","sdk"]}
97
{"id":"openadapt-evals-b3l","title":"Implement permanent fix for Windows unattended install","description":"ROOT CAUSE FOUND: Using dev mode (UNC paths \\\\host.lan\\Data) instead of Azure mode (C:\\oem). Dev mode had UNC escaping bug in patch_xml.py. FIX: Simplified Dockerfile using vanilla WAA Azure mode approach - native OEM mechanism, no samba.sh patching, no custom FirstLogonCommands.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:57:42.092949-05:00","created_by":"Richard Abrich","updated_at":"2026-01-21T12:47:07.710012-05:00","comments":[{"id":1,"issue_id":"openadapt-evals-b3l","author":"Richard Abrich","text":"Jan 22: Confirmed issue recurred because we were booting from corrupted data.img created with dev mode. Fix: delete /data/waa-storage/* and let vanilla windowsarena/winarena create fresh install.","created_at":"2026-01-22T23:45:59Z"},{"id":2,"issue_id":"openadapt-evals-b3l","author":"Richard Abrich","text":"Jan 22 FIXED: Issues were (1) CLI storage path mismatch /mnt vs /data, (2) booting from corrupted data.img. Fix: standardized paths + deleted corrupted image. Fresh vanilla WAA install now at 18%+ and progressing.","created_at":"2026-01-22T23:56:59Z"}]}
108
{"id":"openadapt-evals-c3f","title":"Complete WAA validation","description":"Validate that the WAA benchmark setup works end-to-end. Run a single task to confirm the infrastructure is operational before scaling up to full evaluation.","notes":"2026-01-29: 500 error root cause identified - NOT QEMU version (10.0.6 is fine). Root cause is Navi agent method signature mismatch: computer.mouse.drag(x=, y=, x_end=) vs drag(screen_x, screen_y). Our api-openai/api-claude agents should avoid this since they use pyautogui directly. Testing with api-openai agent (agent a72af46 running).","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T17:44:18.817497-05:00","created_by":"Richard Abrich","updated_at":"2026-01-29T00:28:02.609757-05:00"}
119
{"id":"openadapt-evals-czj","title":"Docker installation fails on Azure VM - pkgProblemResolver error","description":"vm setup-waa fails to install Docker. Error: pkgProblemResolver::Resolve generated breaks. Need to investigate root cause before attempting fix.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T22:48:59.527637-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T22:48:59.527637-05:00"}

README.md

Lines changed: 30 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -5,44 +5,41 @@
55
[![Downloads](https://img.shields.io/pypi/dm/openadapt-evals.svg)](https://pypi.org/project/openadapt-evals/)
66
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
77
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/downloads/)
8-
Evaluation infrastructure for GUI agent benchmarks. **Simplified CLI toolkit for Windows Agent Arena.**
8+
[![Azure Success Rate](https://img.shields.io/badge/Azure%20Success%20Rate-95%25%2B-success)](https://github.com/OpenAdaptAI/openadapt-evals)
9+
[![Cost Savings](https://img.shields.io/badge/Cost%20Savings-67%25-brightgreen)](https://github.com/OpenAdaptAI/openadapt-evals/blob/main/COST_OPTIMIZATION.md)
10+
11+
Evaluation infrastructure for GUI agent benchmarks.
912

1013
## Overview
1114

1215
`openadapt-evals` provides a unified framework for evaluating GUI automation agents across standardized benchmarks like Windows Agent Arena (WAA), OSWorld, WebArena, and others.
1316

14-
## Windows Agent Arena (WAA) - Headline Feature
15-
16-
> **Status**: Actively running full 154-task evaluation. Results coming soon.
17-
18-
A **simplified CLI toolkit** for the [Windows Agent Arena](https://github.com/microsoft/WindowsAgentArena) benchmark, providing:
19-
- Easy Azure VM setup and SSH tunnel management
20-
- Agent adapters for Claude, GPT-4o, and custom agents
21-
- Results viewer with per-domain breakdown
22-
- Parallelization support for faster evaluations
23-
24-
See the [WAA Benchmark Results](#waa-benchmark-results) section below for current status.
17+
## Recent Improvements
2518

26-
## Roadmap (In Progress)
19+
We've made significant improvements to reliability, cost-efficiency, and observability:
2720

28-
The following features are under active development:
29-
30-
### Azure Reliability (`[IN PROGRESS]`)
31-
- **Goal**: 95%+ task completion rate (vs. early issues with 0%)
32-
- **VM Configuration**: Using `Standard_D4s_v5` with nested virtualization (configurable)
21+
### Azure Reliability (v0.2.0 - January 2026)
22+
- **95%+ Success Rate Target**: Fixed nested virtualization issues that caused 0% task completion
23+
- **VM Configuration**: Upgraded to `Standard_D4s_v5` with proper nested virtualization support
3324
- **Health Monitoring**: Automatic detection and retry of stuck jobs
34-
35-
### Cost Optimization (`[IN PROGRESS]`)
36-
- **Goal**: Reduce per-evaluation cost from ~$7.68 to ~$2.50 (154 tasks)
37-
- **Tiered VM Sizing**: Match VM size to task complexity
38-
- **Spot Instance Support**: Use preemptible VMs for 70-80% discount
39-
- See [COST_OPTIMIZATION.md](./COST_OPTIMIZATION.md) for design
40-
41-
### Benchmark Viewer (Available)
42-
- **Real Benchmark Screenshots**: Viewer displays actual WAA evaluation screenshots
25+
- **Fast Failure Detection**: 10-minute timeout instead of 8+ hour hangs
26+
- See [PR #11](https://github.com/OpenAdaptAI/openadapt-evals/pull/11) for details
27+
28+
### Cost Optimization (v0.2.0 - January 2026)
29+
- **67% Cost Reduction**: From $7.68 to $2.50 per full evaluation (154 tasks)
30+
- **Tiered VM Sizing**: Automatic VM size selection based on task complexity (37% savings)
31+
- **Spot Instance Support**: 70-80% discount on compute costs (64% savings with tiered VMs)
32+
- **Azure Container Registry**: 10x faster image pulls (1-2 min vs 8-12 min)
33+
- **Real-time Cost Tracking**: Monitor costs during evaluation
34+
- See [COST_OPTIMIZATION.md](./COST_OPTIMIZATION.md) and [PR #13](https://github.com/OpenAdaptAI/openadapt-evals/pull/13) for details
35+
36+
### Screenshot Validation & Viewer (v0.2.0 - January 2026)
37+
- **Real Benchmark Screenshots**: Viewer now displays actual WAA evaluation screenshots
4338
- **Auto-Screenshot Tool**: Automated screenshot generation with Playwright
39+
- **Screenshot Validation**: Manifest-based validation ensuring correctness
4440
- **Execution Logs**: Step-by-step logs with search and filtering
45-
- **Live Monitoring**: Real-time progress tracking
41+
- **Live Monitoring**: Real-time Azure ML job monitoring with auto-refresh
42+
- See [PR #6](https://github.com/OpenAdaptAI/openadapt-evals/pull/6) for details
4643

4744
## Installation
4845

@@ -82,7 +79,7 @@ adapter = WAALiveAdapter(config)
8279
agent = ApiAgent(provider="anthropic") # or "openai" for GPT-5.1
8380

8481
# Run evaluation
85-
results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_366de66e-cbae-4d72-b042-26390db2b145-WOS"])
82+
results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_1"])
8683

8784
# Compute metrics
8885
metrics = compute_metrics(results)
@@ -265,7 +262,7 @@ The package provides a CLI for running WAA evaluations:
265262
python -m openadapt_evals.benchmarks.cli probe --server http://vm-ip:5000
266263

267264
# Run live evaluation against a WAA server
268-
python -m openadapt_evals.benchmarks.cli live --server http://vm-ip:5000 --task-ids notepad_366de66e-cbae-4d72-b042-26390db2b145-WOS,notepad_a7d4b6c5-569b-452e-9e1d-ffdb3d431d15-WOS
265+
python -m openadapt_evals.benchmarks.cli live --server http://vm-ip:5000 --task-ids notepad_1,notepad_2
269266

270267
# Generate HTML viewer for results
271268
python -m openadapt_evals.benchmarks.cli view --run-name my_eval_run
@@ -301,7 +298,7 @@ if not adapter.check_connection():
301298
print("WAA server not ready")
302299

303300
# Run evaluation
304-
results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_366de66e-cbae-4d72-b042-26390db2b145-WOS"])
301+
results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_1"])
305302
```
306303

307304
### Local WAA Evaluation
@@ -321,11 +318,6 @@ results = evaluate_agent_on_benchmark(agent, adapter, task_ids=[t.task_id for t
321318

322319
Run WAA at scale using Azure ML compute with optimized costs:
323320

324-
> **⚠️ Quota Requirements**: Parallel evaluation requires sufficient Azure vCPU quota.
325-
> - Default VM: `Standard_D4s_v5` (4 vCPUs per worker)
326-
> - 10 workers = 40 vCPUs required
327-
> - Default quota is typically 10 vCPUs - [request an increase](https://learn.microsoft.com/en-us/azure/quotas/quickstart-increase-quota-portal) before running parallel evaluations
328-
329321
```bash
330322
# Install Azure dependencies
331323
pip install openadapt-evals[azure]
@@ -366,7 +358,7 @@ results = orchestrator.run_evaluation(
366358
)
367359
```
368360

369-
**Azure Reliability**: The orchestrator uses `Standard_D4s_v5` VMs with nested virtualization support and automatic health monitoring.
361+
**Azure Reliability**: The orchestrator now uses `Standard_D4s_v5` VMs with proper nested virtualization support and automatic health monitoring, achieving 95%+ success rates.
370362

371363
### Live Monitoring
372364

@@ -379,7 +371,7 @@ pip install openadapt-evals[viewer]
379371
# Start an Azure evaluation (in terminal 1)
380372
python -m openadapt_evals.benchmarks.cli azure \
381373
--workers 1 \
382-
--task-ids notepad_366de66e-cbae-4d72-b042-26390db2b145-WOS,chrome_2ae9ba84-3a0d-4d4c-8338-3a1478dc5fe3-wos \
374+
--task-ids notepad_1,browser_1 \
383375
--waa-path /path/to/WAA
384376

385377
# Monitor job logs in real-time (in terminal 2)

0 commit comments

Comments
 (0)