You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .beads/issues.jsonl
-2Lines changed: 0 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,9 @@
1
-
{"id":"bd-poolfix-1770402918","title":"CRITICAL: Never break working code - match exactly","description":"When pool-wait broke (Feb 2026), wasted hours because I created waa-auto instead of copying the working waa command exactly.\n\nLESSON LEARNED:\n1. Find EXISTING WORKING CODE that does the same thing\n2. COPY IT EXACTLY - same image, same flags, same IPs\n3. If you see an error, understand WHY working code works despite it\n\nThe working waa command uses:\n- windowsarena/winarena:latest with --entrypoint /bin/bash\n- --prepare-image false --start-client false (SKIPS ISO download)\n- Probes at 172.30.0.2:5000\n\nPool commands MUST use identical parameters.","status":"open","priority":1,"issue_type":"lesson","created_at":"2026-02-06T18:35:18Z","created_by":"claude","updated_at":"2026-02-06T18:35:18Z"}
{"id":"openadapt-evals-0dt","title":"Add pre-flight check for Windows install issues","description":"Detect product key prompts or stuck installations BEFORE 10-minute timeout. Check container logs for specific error patterns.","status":"open","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:57:42.24338-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T18:57:42.24338-05:00"}
4
3
{"id":"openadapt-evals-0ms","title":"Run 20-50 task evaluation","description":"Run WAA benchmark on 20-50 tasks to measure baseline success rate. Target is \u003e80% success rate. This provides quantitative data on agent performance.","notes":"2026-01-29: Azure quota limits parallelization to 2 workers max (10 vCPUs / 4 vCPUs per worker). 10-worker test failed with ClusterCoreQuotaReached. User declined manual portal quota increase. Waiting for api-openai test results before full 154-task run.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T17:44:26.461765-05:00","created_by":"Richard Abrich","updated_at":"2026-01-29T00:28:02.609085-05:00","dependencies":[{"issue_id":"openadapt-evals-0ms","depends_on_id":"openadapt-evals-c3f","type":"blocks","created_at":"2026-01-20T17:44:26.462904-05:00","created_by":"Richard Abrich"}]}
5
4
{"id":"openadapt-evals-2ar","title":"Implement permanent fix for Windows unattended install","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.544113-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.634857-05:00","closed_at":"2026-01-20T20:32:06.634857-05:00","close_reason":"Duplicate of openadapt-evals-b3l"}
6
5
{"id":"openadapt-evals-5o8","title":"Analyze evaluation results","description":"Analyze WAA evaluation results to identify failure modes, success patterns, and improvement opportunities. Document findings and create actionable next steps.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T17:44:29.782932-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T17:44:29.782932-05:00","dependencies":[{"issue_id":"openadapt-evals-5o8","depends_on_id":"openadapt-evals-0ms","type":"blocks","created_at":"2026-01-20T17:44:29.783756-05:00","created_by":"Richard Abrich"}]}
{"id":"openadapt-evals-aml","title":"Azure ML SDK v2 migration for macOS compatibility","description":"Migrated run_azure.py from SDK v1 to SDK v2 to remove azureml-dataset-runtime dependency (no macOS ARM64 wheel). Enables parallel WAA evaluation from macOS.","notes":"2026-02-02: COMPLETED SDK v2 migration.\n\nATTEMPTS:\n1. D4_v3 (100GB temp): FAILED - disk full\n2. D8ds_v5 (300GB temp): FAILED - DDSv5 quota is 0\n3. D8ds_v4 (300GB temp): FAILED - Ddsv4 quota is 4 vCPUs (needs 8)\n4. D4ds_v4 (150GB temp): RUNNING - fits in 4 vCPU quota\n\nVM SIZE REFERENCE:\n- D4_v3: 100GB temp, Standard D Family quota\n- D4ds_v4: 150GB temp, Ddsv4 quota (4 vCPUs max)\n- D8ds_v4: 300GB temp, Ddsv4 quota (need 8 vCPUs)\n- D4ds_v5: 150GB temp, DDSv5 quota (0 available)\n- D8ds_v5: 300GB temp, DDSv5 quota (0 available)\n- D8ds_v6: 0GB temp (no local SSD!)\n\nCURRENT: Job olive_tomato_ky0y4lw7rn running on w0Expeval02022219 with D4ds_v4","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-03T03:25:00-05:00","created_by":"Claude","updated_at":"2026-02-03T03:25:00-05:00","labels":["azure-ml","parallelization","sdk"]}
9
7
{"id":"openadapt-evals-b3l","title":"Implement permanent fix for Windows unattended install","description":"ROOT CAUSE FOUND: Using dev mode (UNC paths \\\\host.lan\\Data) instead of Azure mode (C:\\oem). Dev mode had UNC escaping bug in patch_xml.py. FIX: Simplified Dockerfile using vanilla WAA Azure mode approach - native OEM mechanism, no samba.sh patching, no custom FirstLogonCommands.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:57:42.092949-05:00","created_by":"Richard Abrich","updated_at":"2026-01-21T12:47:07.710012-05:00","comments":[{"id":1,"issue_id":"openadapt-evals-b3l","author":"Richard Abrich","text":"Jan 22: Confirmed issue recurred because we were booting from corrupted data.img created with dev mode. Fix: delete /data/waa-storage/* and let vanilla windowsarena/winarena create fresh install.","created_at":"2026-01-22T23:45:59Z"},{"id":2,"issue_id":"openadapt-evals-b3l","author":"Richard Abrich","text":"Jan 22 FIXED: Issues were (1) CLI storage path mismatch /mnt vs /data, (2) booting from corrupted data.img. Fix: standardized paths + deleted corrupted image. Fresh vanilla WAA install now at 18%+ and progressing.","created_at":"2026-01-22T23:56:59Z"}]}
10
8
{"id":"openadapt-evals-c3f","title":"Complete WAA validation","description":"Validate that the WAA benchmark setup works end-to-end. Run a single task to confirm the infrastructure is operational before scaling up to full evaluation.","notes":"2026-01-29: 500 error root cause identified - NOT QEMU version (10.0.6 is fine). Root cause is Navi agent method signature mismatch: computer.mouse.drag(x=, y=, x_end=) vs drag(screen_x, screen_y). Our api-openai/api-claude agents should avoid this since they use pyautogui directly. Testing with api-openai agent (agent a72af46 running).","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T17:44:18.817497-05:00","created_by":"Richard Abrich","updated_at":"2026-01-29T00:28:02.609757-05:00"}
11
9
{"id":"openadapt-evals-czj","title":"Docker installation fails on Azure VM - pkgProblemResolver error","description":"vm setup-waa fails to install Docker. Error: pkgProblemResolver::Resolve generated breaks. Need to investigate root cause before attempting fix.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T22:48:59.527637-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T22:48:59.527637-05:00"}
Evaluation infrastructure for GUI agent benchmarks.
9
12
10
13
## Overview
11
14
12
15
`openadapt-evals` provides a unified framework for evaluating GUI automation agents across standardized benchmarks like Windows Agent Arena (WAA), OSWorld, WebArena, and others.
13
16
14
-
## Windows Agent Arena (WAA) - Headline Feature
15
-
16
-
> **Status**: Actively running full 154-task evaluation. Results coming soon.
17
-
18
-
A **simplified CLI toolkit** for the [Windows Agent Arena](https://github.com/microsoft/WindowsAgentArena) benchmark, providing:
19
-
- Easy Azure VM setup and SSH tunnel management
20
-
- Agent adapters for Claude, GPT-4o, and custom agents
21
-
- Results viewer with per-domain breakdown
22
-
- Parallelization support for faster evaluations
23
-
24
-
See the [WAA Benchmark Results](#waa-benchmark-results) section below for current status.
17
+
## Recent Improvements
25
18
26
-
## Roadmap (In Progress)
19
+
We've made significant improvements to reliability, cost-efficiency, and observability:
27
20
28
-
The following features are under active development:
29
-
30
-
### Azure Reliability (`[IN PROGRESS]`)
31
-
-**Goal**: 95%+ task completion rate (vs. early issues with 0%)
32
-
-**VM Configuration**: Using `Standard_D4s_v5` with nested virtualization (configurable)
> - Default VM: `Standard_D4s_v5` (4 vCPUs per worker)
326
-
> - 10 workers = 40 vCPUs required
327
-
> - Default quota is typically 10 vCPUs - [request an increase](https://learn.microsoft.com/en-us/azure/quotas/quickstart-increase-quota-portal) before running parallel evaluations
**Azure Reliability**: The orchestrator uses `Standard_D4s_v5` VMs with nested virtualization support and automatic health monitoring.
361
+
**Azure Reliability**: The orchestrator now uses `Standard_D4s_v5` VMs with proper nested virtualization support and automatic health monitoring, achieving 95%+ success rates.
0 commit comments