OpenAdaptAI
diff --git a/‎.beads/issues.jsonl‎
Lines changed: 2 additions & 2 deletions b/‎.beads/issues.jsonl‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎.env.example‎
Lines changed: 33 additions & 0 deletions b/‎.env.example‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 128 additions & 6 deletions b/‎CLAUDE.md‎
Lines changed: 128 additions & 6 deletions
@@ -2,10 +2,10 @@
 {"id":"openadapt-evals-0ms","title":"Run 20-50 task evaluation","description":"Run WAA benchmark on 20-50 tasks to measure baseline success rate. Target is \u003e80% success rate. This provides quantitative data on agent performance.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T17:44:26.461765-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T17:44:26.461765-05:00","dependencies":[{"issue_id":"openadapt-evals-0ms","depends_on_id":"openadapt-evals-c3f","type":"blocks","created_at":"2026-01-20T17:44:26.462904-05:00","created_by":"Richard Abrich"}]}
 {"id":"openadapt-evals-2ar","title":"Implement permanent fix for Windows unattended install","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.544113-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.634857-05:00","closed_at":"2026-01-20T20:32:06.634857-05:00","close_reason":"Duplicate of openadapt-evals-b3l"}
 {"id":"openadapt-evals-5o8","title":"Analyze evaluation results","description":"Analyze WAA evaluation results to identify failure modes, success patterns, and improvement opportunities. Document findings and create actionable next steps.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T17:44:29.782932-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T17:44:29.782932-05:00","dependencies":[{"issue_id":"openadapt-evals-5o8","depends_on_id":"openadapt-evals-0ms","type":"blocks","created_at":"2026-01-20T17:44:29.783756-05:00","created_by":"Richard Abrich"}]}
-{"id":"openadapt-evals-b3l","title":"Implement permanent fix for Windows unattended install","description":"ROOT CAUSE FOUND: Using dev mode (UNC paths \\\\host.lan\\Data) instead of Azure mode (C:\\oem). Dev mode had UNC escaping bug in patch_xml.py. FIX: Simplified Dockerfile using vanilla WAA Azure mode approach - native OEM mechanism, no samba.sh patching, no custom FirstLogonCommands.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:57:42.092949-05:00","created_by":"Richard Abrich","updated_at":"2026-01-21T12:47:07.710012-05:00"}
+{"id":"openadapt-evals-b3l","title":"Implement permanent fix for Windows unattended install","description":"ROOT CAUSE FOUND: Using dev mode (UNC paths \\\\host.lan\\Data) instead of Azure mode (C:\\oem). Dev mode had UNC escaping bug in patch_xml.py. FIX: Simplified Dockerfile using vanilla WAA Azure mode approach - native OEM mechanism, no samba.sh patching, no custom FirstLogonCommands.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:57:42.092949-05:00","created_by":"Richard Abrich","updated_at":"2026-01-21T12:47:07.710012-05:00","comments":[{"id":7,"issue_id":"openadapt-evals-b3l","author":"Richard Abrich","text":"Jan 22: Confirmed issue recurred because we were booting from corrupted data.img created with dev mode. Fix: delete /data/waa-storage/* and let vanilla windowsarena/winarena create fresh install.","created_at":"2026-01-22T23:45:59Z"},{"id":8,"issue_id":"openadapt-evals-b3l","author":"Richard Abrich","text":"Jan 22 FIXED: Issues were (1) CLI storage path mismatch /mnt vs /data, (2) booting from corrupted data.img. Fix: standardized paths + deleted corrupted image. Fresh vanilla WAA install now at 18%+ and progressing.","created_at":"2026-01-22T23:56:59Z"}]}
 {"id":"openadapt-evals-c3f","title":"Complete WAA validation","description":"Validate that the WAA benchmark setup works end-to-end. Run a single task to confirm the infrastructure is operational before scaling up to full evaluation.","notes":"2026-01-22: Attempted end-to-end live smoke run on Azure VM.\n\n- Command: uv run python -m openadapt_evals.benchmarks.cli smoke-live --vm-name waa-eval-vm --resource-group OPENADAPT-AGENTS --task-id notepad_1\n- VM start + public IP succeeded (172.171.112.41)\n- Blocker: az vm run-command invoke timed out while running 'docker start winarena' (container start never returned)\n- Result: WAA server never became reachable on :5000; live eval could not connect\n- Cleanup: VM deallocated at end to stop spend\n\nNext: run remote docker diagnostics (docker ps -a, docker logs winarena, systemctl status docker, disk space) and fix underlying image/container hang (likely winarena pull/extract / docker stuck).","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T17:44:18.817497-05:00","created_by":"Richard Abrich","updated_at":"2026-01-22T10:31:57.790605-05:00"}
 {"id":"openadapt-evals-czj","title":"Docker installation fails on Azure VM - pkgProblemResolver error","description":"vm setup-waa fails to install Docker. Error: pkgProblemResolver::Resolve generated breaks. Need to investigate root cause before attempting fix.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T22:48:59.527637-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T22:48:59.527637-05:00"}
 {"id":"openadapt-evals-dke","title":"SYSTEM: Create knowledge persistence workflow using Beads","description":"Every fix/approach must be logged as a Beads issue with:\n1. Problem description\n2. Attempted solution\n3. Result (worked/failed/partial)\n4. Root cause if known\n5. Files changed\n\nBefore any fix attempt, agent MUST:\n1. Run 'bd list --labels=fix,approach' to see prior attempts\n2. Review what was tried before\n3. Document new attempt BEFORE implementing\n\nAfter context compaction, first action:\n1. Run 'bd ready' for current tasks\n2. Run 'bd list --labels=recurring' for known recurring issues\n3. Check docs/RECURRING_ISSUES.md for patterns","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T19:00:18.155796-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T19:00:18.155796-05:00"}
-{"id":"openadapt-evals-gna","title":"Test simplified Dockerfile (Azure mode)","description":"Testing Dockerfile.simplified which uses vanilla WAA Azure mode: native OEM mechanism (C:\\oem), InstallFrom element for unattended install, VERSION=11e for no product key. Steps: 1) Delete current VM 2) Create fresh VM 3) Build simplified image 4) Test Windows installation via QEMU screenshots","notes":"2026-01-22: Confirmed the blocker is not just docker pull; even starting the existing 'winarena' container via az vm run-command timed out.\n\n- smoke-live tried to run docker start winarena via run-command and timed out (900s)\n- WAA server remained unreachable at http://172.171.112.41:5000\n- VM was deallocated after the attempt\n\nImplication: VM/docker state is unhealthy or container start is hanging (possibly due to incomplete image extraction / stuck daemon / disk pressure).\nNext: add/run a vm-debug command to capture docker/system logs and determine whether to rebuild VM/image, pin/mirror image (ACR), or adjust docker config.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-21T12:47:15.12243-05:00","created_by":"Richard Abrich","updated_at":"2026-01-22T10:32:01.038825-05:00","labels":["testing","waa"]}
+{"id":"openadapt-evals-gna","title":"Test simplified Dockerfile (Azure mode)","description":"Testing Dockerfile.simplified which uses vanilla WAA Azure mode: native OEM mechanism (C:\\oem), InstallFrom element for unattended install, VERSION=11e for no product key. Steps: 1) Delete current VM 2) Create fresh VM 3) Build simplified image 4) Test Windows installation via QEMU screenshots","notes":"2026-01-22: Confirmed the blocker is not just docker pull; even starting the existing 'winarena' container via az vm run-command timed out.\n\n- smoke-live tried to run docker start winarena via run-command and timed out (900s)\n- WAA server remained unreachable at http://172.171.112.41:5000\n- VM was deallocated after the attempt\n\nImplication: VM/docker state is unhealthy or container start is hanging (possibly due to incomplete image extraction / stuck daemon / disk pressure).\nNext: add/run a vm-debug command to capture docker/system logs and determine whether to rebuild VM/image, pin/mirror image (ACR), or adjust docker config.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-21T12:47:15.12243-05:00","created_by":"Richard Abrich","updated_at":"2026-01-22T10:32:01.038825-05:00","labels":["testing","waa"],"comments":[{"id":1,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"Session Recovery 2026-01-22 17:58: Previous agents killed during compaction. VM state: Docker/containerd unhealthy, disk /mnt only 32GB (need 47GB+ for vanilla WAA). Git-lfs failing. User feedback: 1) use beads, 2) larger disk, 3) clean up CLI, 4) vanilla WAA config.","created_at":"2026-01-22T18:05:45Z"},{"id":2,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"Launched 3 parallel agents: ae159fc (VM disk upgrade), aabad47 (CLI cleanup), aee4e8a (fix containerd). Check /private/tmp/claude/-Users-abrichr-oa-src-openadapt-ml/tasks/*.output for results.","created_at":"2026-01-22T18:06:18Z"},{"id":3,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"WORKFLOW DOCUMENTED: VM config changes = delete VM -\u003e update code -\u003e relaunch. Added to CLAUDE.md. Default VM size now D8ds_v5 (300GB). Launching fresh VM now.","created_at":"2026-01-22T18:09:12Z"},{"id":4,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 18:20: VM resources cleaned up, launched agent a9be1f8 to add auto-cleanup to CLI, WAA setup retrying in background (b04fcbe). Workflow documented in CLAUDE.md and STATUS.md.","created_at":"2026-01-22T18:11:56Z"},{"id":5,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 18:30: VM created with D8s_v3 fallback (D8ds_v5 quota 0), IP 20.120.37.97. Restored waa_deploy symlink. Docker image building. W\u0026B integration agent a21c3ef running.","created_at":"2026-01-22T18:25:29Z"},{"id":6,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 19:05: WAA Docker image built successfully! Container running. Windows booting. VM: 20.120.37.97, VNC: http://20.120.37.97:8006","created_at":"2026-01-22T18:47:03Z"}]}
 {"id":"openadapt-evals-sz4","title":"RCA: Windows product key prompt recurring issue","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.266286-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.493102-05:00","closed_at":"2026-01-20T20:32:06.493102-05:00","close_reason":"RCA complete - root cause is VERSION mismatch (CLI=11, Dockerfile=11e). Fix documented in RECURRING_ISSUES.md and WINDOWS_PRODUCT_KEY_RCA.md"}
 {"id":"openadapt-evals-wis","title":"Add pre-flight check to detect Windows install issues","status":"closed","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.865052-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.757261-05:00","closed_at":"2026-01-20T20:32:06.757261-05:00","close_reason":"Duplicate of openadapt-evals-0dt"}
@@ -0,0 +1,33 @@
+# =============================================================================
+# API Keys for VLM backends
+# =============================================================================
+ANTHROPIC_API_KEY="your-anthropic-api-key"
+OPENAI_API_KEY="your-openai-api-key"
+GOOGLE_API_KEY="your-google-api-key"
+
+# =============================================================================
+# Lambda Labs (for GPU training)
+# =============================================================================
+LAMBDA_API_KEY="your-lambda-api-key"
+
+# =============================================================================
+# Weights & Biases (for experiment tracking)
+# =============================================================================
+# Get your API key from: https://wandb.ai/authorize
+WANDB_API_KEY="your-wandb-api-key"
+
+# Optional wandb settings
+# WANDB_PROJECT="openadapt-evals"  # Default project name
+# WANDB_ENTITY="your-team"         # Team/organization name
+# WANDB_MODE="online"              # online, offline, or disabled
+
+# =============================================================================
+# Azure Credentials (auto-generated by setup_azure.py)
+# =============================================================================
+# AZURE_CLIENT_ID=
+# AZURE_CLIENT_SECRET=
+# AZURE_TENANT_ID=
+# AZURE_SUBSCRIPTION_ID=
+# AZURE_ML_RESOURCE_GROUP=
+# AZURE_ML_WORKSPACE_NAME=
+# AZURE_DOCKER_IMAGE=
@@ -42,8 +42,11 @@ uv sync
 # Run mock evaluation (no VM required)
 uv run python -m openadapt_evals.benchmarks.cli mock --tasks 10
 
-# Run live evaluation against WAA server
-uv run python -m openadapt_evals.benchmarks.cli live --agent api-claude --server http://vm-ip:5000 --task-ids notepad_1
+# Run live evaluation (simplified - uses localhost:5001 by default)
+uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --task notepad_1
+
+# Run live evaluation (full control)
+uv run python -m openadapt_evals.benchmarks.cli live --agent api-claude --server http://localhost:5001 --task-ids notepad_1
 
 # Azure parallel evaluation
 uv run python -m openadapt_evals.benchmarks.cli azure --workers 10 --waa-path /path/to/WAA
@@ -52,12 +55,96 @@ uv run python -m openadapt_evals.benchmarks.cli azure --workers 10 --waa-path /p
 uv run python -m openadapt_evals.benchmarks.cli up
 ```
 
+---
+
+## 🎯 WAA BENCHMARK WORKFLOW (COMPLETE GUIDE)
+
+### Architecture Overview
+
+The WAA setup spans TWO repos with distinct responsibilities:
+
+```
+LOCAL MACHINE
+├── openadapt-ml CLI (VM management)
+│   - vm setup-waa    # Create VM + Docker + WAA
+│   - vm monitor      # Dashboard + SSH tunnels
+│   - vm deallocate   # Stop billing
+│
+├── openadapt-evals CLI (benchmark execution)
+│   - run             # Simplified benchmark run
+│   - live            # Full control live eval
+│   - mock            # No VM needed
+│
+└── SSH Tunnels (auto-managed by vm monitor)
+    - localhost:5001 → VM:5000 (WAA Flask API)
+    - localhost:8006 → VM:8006 (noVNC)
+
+AZURE VM (Ubuntu)
+└── Docker
+    └── windowsarena/winarena:latest
+        └── QEMU (Windows 11)
+            ├── WAA Flask server (port 5000)
+            └── Navi agent (executes tasks)
+```
+
+### Step-by-Step Workflow
+
+**Step 1: Setup VM (from openadapt-ml, first time only)**
+```bash
+cd /Users/abrichr/oa/src/openadapt-ml
+uv run python -m openadapt_ml.benchmarks.cli vm setup-waa ```
+
+**Step 2: Start Dashboard and Tunnels (from openadapt-ml)**
+```bash
+uv run python -m openadapt_ml.benchmarks.cli vm monitor
+```
+Keep this running! It manages SSH tunnels automatically.
+
+**Step 3: Run Benchmark (from openadapt-evals)**
+```bash
+cd /Users/abrichr/oa/src/openadapt-evals
+
+# Quick smoke test (no API key needed)
+uv run python -m openadapt_evals.benchmarks.cli run --agent noop --task notepad_1
+
+# With OpenAI
+uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --task notepad_1
+
+# With Claude
+uv run python -m openadapt_evals.benchmarks.cli run --agent api-claude --task notepad_1
+
+# Multiple tasks
+uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --tasks notepad_1,notepad_2
+```
+
+**Step 4: View Results**
+```bash
+uv run python -m openadapt_evals.benchmarks.cli view --run-name live_eval
+```
+
+**Step 5: Stop VM (from openadapt-ml)**
+```bash
+cd /Users/abrichr/oa/src/openadapt-ml
+uv run python -m openadapt_ml.benchmarks.cli vm deallocate -y
+```
+
+### Key Points
+
+1. **Two CLIs** - openadapt-ml manages VM/Docker, openadapt-evals runs benchmarks
+2. **SSH tunnels required** - Azure NSG blocks direct port access
+3. **Default server is localhost:5001** - The `run` command uses this automatically
+4. **WAA runs INSIDE Windows** - Not on the Ubuntu host
+5. **Results in benchmark_results/** - Use `view` command to see them
+
+---
+
 ## CLI Commands
 
 | Command | Description |
 |---------|-------------|
+| `run` | **Simplified live evaluation** (uses localhost:5001 by default) |
 | `mock` | Run with mock adapter (testing, no VM) |
-| `live` | Run against live WAA server |
+| `live` | Run against live WAA server (full control) |
 | `azure` | Run parallel evaluation on Azure |
 | `probe` | Check if WAA server is ready |
 | `view` | Generate HTML viewer for results |
@@ -69,6 +156,30 @@ uv run python -m openadapt_evals.benchmarks.cli up
 | `vm-status` | Check Azure VM status and IP |
 | `vm-setup` | Full WAA container setup (automated) |
 
+### `run` Command (Recommended for Live Evaluation)
+
+The `run` command is a simplified wrapper around `live` with good defaults:
+
+```bash
+# Single task
+uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --task notepad_1
+
+# Multiple tasks
+uv run python -m openadapt_evals.benchmarks.cli run --agent api-openai --tasks notepad_1,notepad_2
+
+# Smoke test (no API key)
+uv run python -m openadapt_evals.benchmarks.cli run --agent noop --task notepad_1
+
+# With custom server
+uv run python -m openadapt_evals.benchmarks.cli run --server http://localhost:5001 --agent api-claude --task notepad_1
+```
+
+**Defaults:**
+- `--server http://localhost:5001` (matches openadapt-ml tunnel)
+- `--max-steps 15`
+- `--output benchmark_results`
+- `--run-name live_eval`
+
 ## Architecture
 
 ```
@@ -182,14 +293,24 @@ adapter = WAALiveAdapter(server_url="http://vm:5000")
 
 ## Environment Variables
 
+**Auto-loaded from `.env` via `config.py`** - no need to pass explicitly on CLI.
+
+```bash
+# .env file (create in repo root, not committed to git)
+OPENAI_API_KEY=sk-...
+ANTHROPIC_API_KEY=sk-ant-...
+```
+
 | Variable | Description |
 |----------|-------------|
-| `ANTHROPIC_API_KEY` | For Claude agents |
-| `OPENAI_API_KEY` | For GPT agents |
+| `ANTHROPIC_API_KEY` | For Claude agents (api-claude) |
+| `OPENAI_API_KEY` | For GPT agents (api-openai) |
 | `AZURE_SUBSCRIPTION_ID` | Azure subscription |
 | `AZURE_ML_RESOURCE_GROUP` | Azure ML resource group |
 | `AZURE_ML_WORKSPACE_NAME` | Azure ML workspace |
 
+Optional override on any command: `[--api-key KEY]`
+
 ## Azure Quota Management
 
 Stale compute instances exhaust quota. Use cleanup:
@@ -206,7 +327,8 @@ Auto-cleanup is enabled by default. Only use `--no-cleanup` for debugging.
 
 ## WAA /evaluate Endpoint
 
-Deploy the endpoint to WAA server:
+Deploy the endpoint to the WAA server. WAALiveAdapter requires `/evaluate` to
+be available; evaluations fail without it:
 
 ```bash
 scp openadapt_evals/server/waa_server_patch.py azureuser@vm:/tmp/