Skip to content

Commit 44697a1

Browse files
abrichrclaude
andauthored
feat(wandb): add Weights & Biases integration with fixtures and reports (#21)
* docs: add WAA integration guide for vanilla approach Documents the minimal-patches approach to WAA integration: - 5 lines of patches to vendor/WindowsAgentArena - Auto-ISO download via VERSION=11e - IP address fix for modern dockurr/windows - Architecture diagram showing wrapper layers - Quick start guides for local and Azure deployment Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(wandb): add Weights & Biases integration with fixtures and reports Add comprehensive W&B integration for experiment tracking and benchmark visualization: - `openadapt_evals/integrations/wandb_logger.py`: Core logging class that handles run initialization, metric logging, artifact uploads, and per-domain breakdown statistics - `openadapt_evals/integrations/fixtures.py`: Synthetic data generators for testing/demos with scenarios: noise (10%), best (85%), worst (5%), and median (20% - SOTA-like) success rates - `openadapt_evals/integrations/wandb_reports.py`: Programmatic report generation via W&B Reports API with charts for success rate, domain breakdown, step distribution, and error analysis - `openadapt_evals/integrations/demo_wandb.py`: Demo script to populate wandb with synthetic evaluation data across all scenarios - CLI commands: wandb-demo, wandb-report, wandb-log for easy CLI access - Add wandb as optional dependency in pyproject.toml - Add WANDB_API_KEY to .env.example with documentation - Add docs/wandb_integration.md with usage guide and report design Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(cli): add simplified `run` command for live evaluation - Add `run` command with good defaults (localhost:5001, 15 steps) - Update CLAUDE.md with comprehensive two-repo workflow guide - Document API key auto-loading from .env via config.py - Add --api-key optional override syntax Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1 parent cbe26c9 commit 44697a1

12 files changed

Lines changed: 3588 additions & 102 deletions

File tree

.beads/issues.jsonl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,10 @@
22
{"id":"openadapt-evals-0ms","title":"Run 20-50 task evaluation","description":"Run WAA benchmark on 20-50 tasks to measure baseline success rate. Target is \u003e80% success rate. This provides quantitative data on agent performance.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T17:44:26.461765-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T17:44:26.461765-05:00","dependencies":[{"issue_id":"openadapt-evals-0ms","depends_on_id":"openadapt-evals-c3f","type":"blocks","created_at":"2026-01-20T17:44:26.462904-05:00","created_by":"Richard Abrich"}]}
33
{"id":"openadapt-evals-2ar","title":"Implement permanent fix for Windows unattended install","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.544113-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.634857-05:00","closed_at":"2026-01-20T20:32:06.634857-05:00","close_reason":"Duplicate of openadapt-evals-b3l"}
44
{"id":"openadapt-evals-5o8","title":"Analyze evaluation results","description":"Analyze WAA evaluation results to identify failure modes, success patterns, and improvement opportunities. Document findings and create actionable next steps.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T17:44:29.782932-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T17:44:29.782932-05:00","dependencies":[{"issue_id":"openadapt-evals-5o8","depends_on_id":"openadapt-evals-0ms","type":"blocks","created_at":"2026-01-20T17:44:29.783756-05:00","created_by":"Richard Abrich"}]}
5-
{"id":"openadapt-evals-b3l","title":"Implement permanent fix for Windows unattended install","description":"ROOT CAUSE FOUND: Using dev mode (UNC paths \\\\host.lan\\Data) instead of Azure mode (C:\\oem). Dev mode had UNC escaping bug in patch_xml.py. FIX: Simplified Dockerfile using vanilla WAA Azure mode approach - native OEM mechanism, no samba.sh patching, no custom FirstLogonCommands.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:57:42.092949-05:00","created_by":"Richard Abrich","updated_at":"2026-01-21T12:47:07.710012-05:00"}
5+
{"id":"openadapt-evals-b3l","title":"Implement permanent fix for Windows unattended install","description":"ROOT CAUSE FOUND: Using dev mode (UNC paths \\\\host.lan\\Data) instead of Azure mode (C:\\oem). Dev mode had UNC escaping bug in patch_xml.py. FIX: Simplified Dockerfile using vanilla WAA Azure mode approach - native OEM mechanism, no samba.sh patching, no custom FirstLogonCommands.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:57:42.092949-05:00","created_by":"Richard Abrich","updated_at":"2026-01-21T12:47:07.710012-05:00","comments":[{"id":7,"issue_id":"openadapt-evals-b3l","author":"Richard Abrich","text":"Jan 22: Confirmed issue recurred because we were booting from corrupted data.img created with dev mode. Fix: delete /data/waa-storage/* and let vanilla windowsarena/winarena create fresh install.","created_at":"2026-01-22T23:45:59Z"},{"id":8,"issue_id":"openadapt-evals-b3l","author":"Richard Abrich","text":"Jan 22 FIXED: Issues were (1) CLI storage path mismatch /mnt vs /data, (2) booting from corrupted data.img. Fix: standardized paths + deleted corrupted image. Fresh vanilla WAA install now at 18%+ and progressing.","created_at":"2026-01-22T23:56:59Z"}]}
66
{"id":"openadapt-evals-c3f","title":"Complete WAA validation","description":"Validate that the WAA benchmark setup works end-to-end. Run a single task to confirm the infrastructure is operational before scaling up to full evaluation.","notes":"2026-01-22: Attempted end-to-end live smoke run on Azure VM.\n\n- Command: uv run python -m openadapt_evals.benchmarks.cli smoke-live --vm-name waa-eval-vm --resource-group OPENADAPT-AGENTS --task-id notepad_1\n- VM start + public IP succeeded (172.171.112.41)\n- Blocker: az vm run-command invoke timed out while running 'docker start winarena' (container start never returned)\n- Result: WAA server never became reachable on :5000; live eval could not connect\n- Cleanup: VM deallocated at end to stop spend\n\nNext: run remote docker diagnostics (docker ps -a, docker logs winarena, systemctl status docker, disk space) and fix underlying image/container hang (likely winarena pull/extract / docker stuck).","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T17:44:18.817497-05:00","created_by":"Richard Abrich","updated_at":"2026-01-22T10:31:57.790605-05:00"}
77
{"id":"openadapt-evals-czj","title":"Docker installation fails on Azure VM - pkgProblemResolver error","description":"vm setup-waa fails to install Docker. Error: pkgProblemResolver::Resolve generated breaks. Need to investigate root cause before attempting fix.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T22:48:59.527637-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T22:48:59.527637-05:00"}
88
{"id":"openadapt-evals-dke","title":"SYSTEM: Create knowledge persistence workflow using Beads","description":"Every fix/approach must be logged as a Beads issue with:\n1. Problem description\n2. Attempted solution\n3. Result (worked/failed/partial)\n4. Root cause if known\n5. Files changed\n\nBefore any fix attempt, agent MUST:\n1. Run 'bd list --labels=fix,approach' to see prior attempts\n2. Review what was tried before\n3. Document new attempt BEFORE implementing\n\nAfter context compaction, first action:\n1. Run 'bd ready' for current tasks\n2. Run 'bd list --labels=recurring' for known recurring issues\n3. Check docs/RECURRING_ISSUES.md for patterns","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T19:00:18.155796-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T19:00:18.155796-05:00"}
9-
{"id":"openadapt-evals-gna","title":"Test simplified Dockerfile (Azure mode)","description":"Testing Dockerfile.simplified which uses vanilla WAA Azure mode: native OEM mechanism (C:\\oem), InstallFrom element for unattended install, VERSION=11e for no product key. Steps: 1) Delete current VM 2) Create fresh VM 3) Build simplified image 4) Test Windows installation via QEMU screenshots","notes":"2026-01-22: Confirmed the blocker is not just docker pull; even starting the existing 'winarena' container via az vm run-command timed out.\n\n- smoke-live tried to run docker start winarena via run-command and timed out (900s)\n- WAA server remained unreachable at http://172.171.112.41:5000\n- VM was deallocated after the attempt\n\nImplication: VM/docker state is unhealthy or container start is hanging (possibly due to incomplete image extraction / stuck daemon / disk pressure).\nNext: add/run a vm-debug command to capture docker/system logs and determine whether to rebuild VM/image, pin/mirror image (ACR), or adjust docker config.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-21T12:47:15.12243-05:00","created_by":"Richard Abrich","updated_at":"2026-01-22T10:32:01.038825-05:00","labels":["testing","waa"]}
9+
{"id":"openadapt-evals-gna","title":"Test simplified Dockerfile (Azure mode)","description":"Testing Dockerfile.simplified which uses vanilla WAA Azure mode: native OEM mechanism (C:\\oem), InstallFrom element for unattended install, VERSION=11e for no product key. Steps: 1) Delete current VM 2) Create fresh VM 3) Build simplified image 4) Test Windows installation via QEMU screenshots","notes":"2026-01-22: Confirmed the blocker is not just docker pull; even starting the existing 'winarena' container via az vm run-command timed out.\n\n- smoke-live tried to run docker start winarena via run-command and timed out (900s)\n- WAA server remained unreachable at http://172.171.112.41:5000\n- VM was deallocated after the attempt\n\nImplication: VM/docker state is unhealthy or container start is hanging (possibly due to incomplete image extraction / stuck daemon / disk pressure).\nNext: add/run a vm-debug command to capture docker/system logs and determine whether to rebuild VM/image, pin/mirror image (ACR), or adjust docker config.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-21T12:47:15.12243-05:00","created_by":"Richard Abrich","updated_at":"2026-01-22T10:32:01.038825-05:00","labels":["testing","waa"],"comments":[{"id":1,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"Session Recovery 2026-01-22 17:58: Previous agents killed during compaction. VM state: Docker/containerd unhealthy, disk /mnt only 32GB (need 47GB+ for vanilla WAA). Git-lfs failing. User feedback: 1) use beads, 2) larger disk, 3) clean up CLI, 4) vanilla WAA config.","created_at":"2026-01-22T18:05:45Z"},{"id":2,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"Launched 3 parallel agents: ae159fc (VM disk upgrade), aabad47 (CLI cleanup), aee4e8a (fix containerd). Check /private/tmp/claude/-Users-abrichr-oa-src-openadapt-ml/tasks/*.output for results.","created_at":"2026-01-22T18:06:18Z"},{"id":3,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"WORKFLOW DOCUMENTED: VM config changes = delete VM -\u003e update code -\u003e relaunch. Added to CLAUDE.md. Default VM size now D8ds_v5 (300GB). Launching fresh VM now.","created_at":"2026-01-22T18:09:12Z"},{"id":4,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 18:20: VM resources cleaned up, launched agent a9be1f8 to add auto-cleanup to CLI, WAA setup retrying in background (b04fcbe). Workflow documented in CLAUDE.md and STATUS.md.","created_at":"2026-01-22T18:11:56Z"},{"id":5,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 18:30: VM created with D8s_v3 fallback (D8ds_v5 quota 0), IP 20.120.37.97. Restored waa_deploy symlink. Docker image building. W\u0026B integration agent a21c3ef running.","created_at":"2026-01-22T18:25:29Z"},{"id":6,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 19:05: WAA Docker image built successfully! Container running. Windows booting. VM: 20.120.37.97, VNC: http://20.120.37.97:8006","created_at":"2026-01-22T18:47:03Z"}]}
1010
{"id":"openadapt-evals-sz4","title":"RCA: Windows product key prompt recurring issue","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.266286-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.493102-05:00","closed_at":"2026-01-20T20:32:06.493102-05:00","close_reason":"RCA complete - root cause is VERSION mismatch (CLI=11, Dockerfile=11e). Fix documented in RECURRING_ISSUES.md and WINDOWS_PRODUCT_KEY_RCA.md"}
1111
{"id":"openadapt-evals-wis","title":"Add pre-flight check to detect Windows install issues","status":"closed","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.865052-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.757261-05:00","closed_at":"2026-01-20T20:32:06.757261-05:00","close_reason":"Duplicate of openadapt-evals-0dt"}

.env.example

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# =============================================================================
2+
# API Keys for VLM backends
3+
# =============================================================================
4+
ANTHROPIC_API_KEY="your-anthropic-api-key"
5+
OPENAI_API_KEY="your-openai-api-key"
6+
GOOGLE_API_KEY="your-google-api-key"
7+
8+
# =============================================================================
9+
# Lambda Labs (for GPU training)
10+
# =============================================================================
11+
LAMBDA_API_KEY="your-lambda-api-key"
12+
13+
# =============================================================================
14+
# Weights & Biases (for experiment tracking)
15+
# =============================================================================
16+
# Get your API key from: https://wandb.ai/authorize
17+
WANDB_API_KEY="your-wandb-api-key"
18+
19+
# Optional wandb settings
20+
# WANDB_PROJECT="openadapt-evals" # Default project name
21+
# WANDB_ENTITY="your-team" # Team/organization name
22+
# WANDB_MODE="online" # online, offline, or disabled
23+
24+
# =============================================================================
25+
# Azure Credentials (auto-generated by setup_azure.py)
26+
# =============================================================================
27+
# AZURE_CLIENT_ID=
28+
# AZURE_CLIENT_SECRET=
29+
# AZURE_TENANT_ID=
30+
# AZURE_SUBSCRIPTION_ID=
31+
# AZURE_ML_RESOURCE_GROUP=
32+
# AZURE_ML_WORKSPACE_NAME=
33+
# AZURE_DOCKER_IMAGE=

CLAUDE.md

Lines changed: 128 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,11 @@ uv sync
4242
# Run mock evaluation (no VM required)
4343
uv run python -m openadapt_evals.benchmarks.cli mock --tasks 10
4444

45-
# Run live evaluation against WAA server
46-
uv run python -m openadapt_evals.benchmarks.cli live --agent api-claude --server http://vm-ip:5000 --task-ids notepad_1
45+
# Run live evaluation (simplified - uses localhost:5001 by default)
46+
uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --task notepad_1
47+
48+
# Run live evaluation (full control)
49+
uv run python -m openadapt_evals.benchmarks.cli live --agent api-claude --server http://localhost:5001 --task-ids notepad_1
4750

4851
# Azure parallel evaluation
4952
uv run python -m openadapt_evals.benchmarks.cli azure --workers 10 --waa-path /path/to/WAA
@@ -52,12 +55,96 @@ uv run python -m openadapt_evals.benchmarks.cli azure --workers 10 --waa-path /p
5255
uv run python -m openadapt_evals.benchmarks.cli up
5356
```
5457

58+
---
59+
60+
## 🎯 WAA BENCHMARK WORKFLOW (COMPLETE GUIDE)
61+
62+
### Architecture Overview
63+
64+
The WAA setup spans TWO repos with distinct responsibilities:
65+
66+
```
67+
LOCAL MACHINE
68+
├── openadapt-ml CLI (VM management)
69+
│ - vm setup-waa # Create VM + Docker + WAA
70+
│ - vm monitor # Dashboard + SSH tunnels
71+
│ - vm deallocate # Stop billing
72+
73+
├── openadapt-evals CLI (benchmark execution)
74+
│ - run # Simplified benchmark run
75+
│ - live # Full control live eval
76+
│ - mock # No VM needed
77+
78+
└── SSH Tunnels (auto-managed by vm monitor)
79+
- localhost:5001 → VM:5000 (WAA Flask API)
80+
- localhost:8006 → VM:8006 (noVNC)
81+
82+
AZURE VM (Ubuntu)
83+
└── Docker
84+
└── windowsarena/winarena:latest
85+
└── QEMU (Windows 11)
86+
├── WAA Flask server (port 5000)
87+
└── Navi agent (executes tasks)
88+
```
89+
90+
### Step-by-Step Workflow
91+
92+
**Step 1: Setup VM (from openadapt-ml, first time only)**
93+
```bash
94+
cd /Users/abrichr/oa/src/openadapt-ml
95+
uv run python -m openadapt_ml.benchmarks.cli vm setup-waa ```
96+
97+
**Step 2: Start Dashboard and Tunnels (from openadapt-ml)**
98+
```bash
99+
uv run python -m openadapt_ml.benchmarks.cli vm monitor
100+
```
101+
Keep this running! It manages SSH tunnels automatically.
102+
103+
**Step 3: Run Benchmark (from openadapt-evals)**
104+
```bash
105+
cd /Users/abrichr/oa/src/openadapt-evals
106+
107+
# Quick smoke test (no API key needed)
108+
uv run python -m openadapt_evals.benchmarks.cli run --agent noop --task notepad_1
109+
110+
# With OpenAI
111+
uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --task notepad_1
112+
113+
# With Claude
114+
uv run python -m openadapt_evals.benchmarks.cli run --agent api-claude --task notepad_1
115+
116+
# Multiple tasks
117+
uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --tasks notepad_1,notepad_2
118+
```
119+
120+
**Step 4: View Results**
121+
```bash
122+
uv run python -m openadapt_evals.benchmarks.cli view --run-name live_eval
123+
```
124+
125+
**Step 5: Stop VM (from openadapt-ml)**
126+
```bash
127+
cd /Users/abrichr/oa/src/openadapt-ml
128+
uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y
129+
```
130+
131+
### Key Points
132+
133+
1. **Two CLIs** - openadapt-ml manages VM/Docker, openadapt-evals runs benchmarks
134+
2. **SSH tunnels required** - Azure NSG blocks direct port access
135+
3. **Default server is localhost:5001** - The `run` command uses this automatically
136+
4. **WAA runs INSIDE Windows** - Not on the Ubuntu host
137+
5. **Results in benchmark_results/** - Use `view` command to see them
138+
139+
---
140+
55141
## CLI Commands
56142

57143
| Command | Description |
58144
|---------|-------------|
145+
| `run` | **Simplified live evaluation** (uses localhost:5001 by default) |
59146
| `mock` | Run with mock adapter (testing, no VM) |
60-
| `live` | Run against live WAA server |
147+
| `live` | Run against live WAA server (full control) |
61148
| `azure` | Run parallel evaluation on Azure |
62149
| `probe` | Check if WAA server is ready |
63150
| `view` | Generate HTML viewer for results |
@@ -69,6 +156,30 @@ uv run python -m openadapt_evals.benchmarks.cli up
69156
| `vm-status` | Check Azure VM status and IP |
70157
| `vm-setup` | Full WAA container setup (automated) |
71158

159+
### `run` Command (Recommended for Live Evaluation)
160+
161+
The `run` command is a simplified wrapper around `live` with good defaults:
162+
163+
```bash
164+
# Single task
165+
uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --task notepad_1
166+
167+
# Multiple tasks
168+
uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --tasks notepad_1,notepad_2
169+
170+
# Smoke test (no API key)
171+
uv run python -m openadapt_evals.benchmarks.cli run --agent noop --task notepad_1
172+
173+
# With custom server
174+
uv run python -m openadapt_evals.benchmarks.cli run --server http://localhost:5001 --agent api-claude --task notepad_1
175+
```
176+
177+
**Defaults:**
178+
- `--server http://localhost:5001` (matches openadapt-ml tunnel)
179+
- `--max-steps 15`
180+
- `--output benchmark_results`
181+
- `--run-name live_eval`
182+
72183
## Architecture
73184

74185
```
@@ -182,14 +293,24 @@ adapter = WAALiveAdapter(server_url="http://vm:5000")
182293

183294
## Environment Variables
184295

296+
**Auto-loaded from `.env` via `config.py`** - no need to pass explicitly on CLI.
297+
298+
```bash
299+
# .env file (create in repo root, not committed to git)
300+
OPENAI_API_KEY=sk-...
301+
ANTHROPIC_API_KEY=sk-ant-...
302+
```
303+
185304
| Variable | Description |
186305
|----------|-------------|
187-
| `ANTHROPIC_API_KEY` | For Claude agents |
188-
| `OPENAI_API_KEY` | For GPT agents |
306+
| `ANTHROPIC_API_KEY` | For Claude agents (api-claude) |
307+
| `OPENAI_API_KEY` | For GPT agents (api-openai) |
189308
| `AZURE_SUBSCRIPTION_ID` | Azure subscription |
190309
| `AZURE_ML_RESOURCE_GROUP` | Azure ML resource group |
191310
| `AZURE_ML_WORKSPACE_NAME` | Azure ML workspace |
192311

312+
Optional override on any command: `[--api-key KEY]`
313+
193314
## Azure Quota Management
194315

195316
Stale compute instances exhaust quota. Use cleanup:
@@ -206,7 +327,8 @@ Auto-cleanup is enabled by default. Only use `--no-cleanup` for debugging.
206327

207328
## WAA /evaluate Endpoint
208329

209-
Deploy the endpoint to WAA server:
330+
Deploy the endpoint to the WAA server. WAALiveAdapter requires `/evaluate` to
331+
be available; evaluations fail without it:
210332

211333
```bash
212334
scp openadapt_evals/server/waa_server_patch.py azureuser@vm:/tmp/

0 commit comments

Comments
 (0)