feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation by abrichr · Pull Request #35 · OpenAdaptAI/openadapt-evals

abrichr · 2026-02-19T20:55:52Z

Summary

End-to-end improvements to the WAA evaluation pipeline, from infrastructure fixes through instrumentation to automated demo-conditioned evaluation.

Infrastructure and Pool Automation

Fix WAA probe IP, add QMP support, add pool-auto command
Use waa-auto Docker image instead of broken windowsarena/winarena
Replace fragile streaming SSH with docker exec + tail
Add dedicated evaluate server with socat proxy
Add 120s timeout to OpenAI API calls

Agent Fixes (6 targeted reliability improvements)

Handle double_click, right_click, and drag in action parser
Fix coordinate normalization: auto-detect screen size from screenshot
Update system prompt to reference screenshot dimensions
Kill OneDrive/notification overlays before each task
Add loop detector improvements
Filter a11y noise from server console logs in prompts
Demo format preference: use JSON annotated demos when available

Eval-Suite CLI Command

New eval-suite command automates full evaluation cycle
Supports --no-pool-create / --no-pool-cleanup for existing VMs
Runs zero-shot and demo-conditioned variants automatically

Visualization and Instrumentation

Add agent instrumentation per step
Add comparison viewer for side-by-side ZS vs DC replay
Enhance viewer with Agent Thinking panel, heatmap overlay

Evaluation Results

Two full eval rounds (GPT-5.1) across 3 tasks x 2 conditions = 12 runs
DC agent shows more purposeful behavior
All WAA scores 0.00 -- architectural mismatch identified (raw coords vs SoM element IDs)

Test plan

216 tests pass
Mock eval passes
Pool create/wait/eval-suite/cleanup end-to-end on Azure (2 full runs)
12 live WAA evals completed
Viewers render correctly
eval-suite --help shows all options

The `while True: pass` loop burned an entire CPU core during recording. Replace with `time.sleep(0.5)` to yield CPU while waiting for Ctrl+C. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Call recorder.wait_for_ready() before entering the wait loop - Use recorder.is_recording check and 1s sleep to match CLI behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The third WAA task requires .docx files in Documents. The script now creates empty report.docx, meeting_notes.docx, and proposal.docx before recording that task, and cleans up any Archive folder from previous runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Change "Press Ctrl+C" to "press Ctrl 3 times" (matches stop sequence) - Clarify wormhole send instructions (each send blocks until received) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The DOCKER_SETUP_SCRIPT builds waa-auto:latest (based on dockurr/windows:latest which can auto-download Windows ISO) but WAA_START_SCRIPT and setup-waa were starting windowsarena/winarena:latest which uses the old dockurr/windows v0.00 that cannot download the ISO, causing "ISO file not found" error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three bugs prevented pool-run from working: 1. WAA probe used 172.30.0.2 (QEMU guest IP) but Docker port-forwards to localhost — pool-wait timed out every time. Changed to localhost in pool.py and vm_monitor.py. 2. dockurr/windows base image doesn't configure QMP (QEMU Machine Protocol). WAA client needs QMP on port 7200 for VM status. Added ARGUMENTS env var to inject -qmp flag into QEMU startup. 3. Config defaults had Standard_D2_v3 (8GB, OOMs) and old windowsarena/winarena image. Fixed to D8ds_v5 and waa-auto. Also adds: - pool-auto command: single oa-vm pool-auto --workers N --tasks M chains create → wait → run - /evaluate endpoint injection in waa_deploy Dockerfile - Handle WAA server wrapping 404 in 500 responses (live.py) - openai dependency for API agents Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tion Replace fragile streaming SSH with docker exec -d (detached) for starting benchmarks. Logs stream via tail -f --pid which auto-exits when the benchmark finishes. On SSH drop, reconnects and resumes. Also adds 120s timeout to OpenAI API calls to prevent infinite hangs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

WAA's run.py ignores --tasks and runs all 154 tasks based on worker_id/num_workers. Fix by creating a subset test JSON with only the requested number of tasks and passing it via --test_all_meta_path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a standalone evaluate server (port 5050) that runs inside the WAA Docker container and has direct access to WAA evaluator modules. This avoids needing to patch the WAA Flask server's /evaluate endpoint. - Add evaluate_server.py and start_with_evaluate.sh - Add evaluate_url config to WAALiveConfig - Set up socat proxy (5051→5050) for Docker bridge networking - Add SSH tunnel for evaluate port - Simplify Dockerfile Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ments Instrumentation (captures richer data per step): - Propagate agent logs (LLM response, parse strategy, demo info, loop detection, memory) from ApiAgent to execution trace - Add per-step timing (agent_think_ms, env_execute_ms) - Capture token counts from OpenAI/Anthropic API responses Viewer enhancements (viewer.py): - Agent Thinking panel showing LLM response, memory, parse strategy - Action timeline bar color-coded by action type - Click heatmap overlay showing click frequency hotspots - Click marker using raw pixel coords for correct positioning Comparison viewer (new): - comparison_viewer.py generates side-by-side HTML comparisons - Synchronized step slider, click markers, action diffs - First-divergence detection, action type distribution charts - CLI 'compare' command for generating comparisons - Demo prompts and initial eval results for 3 WAA tasks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

_parse_computer_action() only handled click, type, press, hotkey, and scroll. Any other action (double_click, right_click, drag) fell through to the default return of type="done", which prematurely terminated the task. This caused the demo-conditioned notepad eval to stop after 1 step when the agent correctly issued computer.double_click() to open Notepad. Also add a warning log when an unrecognized action falls through, and update viewer regexes to handle double_click/right_click coordinates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…dcoded config WAALiveConfig defaulted to 1920x1200 but actual VM screen is 1280x720. This caused stored action.x/y to be normalized against the wrong resolution. Now detects real dimensions from the screenshot via PIL, uses them for viewport, denormalization, window_rect, and drag coordinates. Viewers use a divergence check for backward compatibility with old data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ZS vs demo-conditioned on 3 WAA tasks (GPT-5.1). DC agent signals completion on 2/3 tasks (Settings: 11 steps, Notepad: 8 steps) while ZS hits max steps on all 3. Includes Playwright screenshots of comparison viewers and step-by-step screenshots. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace inline 25-line Dockerfile in pool.py with SCP of waa_deploy/ build context. This eliminates drift between the inline and full Dockerfile, and ensures evaluate_server.py + Flask are included in the container image. Adds evaluate server health check during pool-wait. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

WAA evaluator getters (get_vm_file, get_cloud_file) expect env.cache_dir for downloading/caching files during evaluation. Without it, the compare_text_file metric fails with AttributeError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

WAA tasks use a 'config' array with preconditions (file downloads, app launches, sleeps) that must run before the agent starts. Previously _run_task_setup() looked for non-existent 'setup'/'init' keys, so task preconditions were never executed — causing Archive and other tasks with file dependencies to always score 0. - Add /setup endpoint to evaluate_server.py with 11 handlers mirroring WAA's SetupController (download, launch, sleep, execute, open, etc.) - Add requests-toolbelt to Dockerfile for multipart file uploads - Rewrite _run_task_setup() in live.py to POST config array to evaluate server's /setup endpoint - Increase reset delay from 1s to 5s to match WAA defaults Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

New `eval-suite` CLI command that automates the full WAA evaluation cycle: pool-create → pool-wait → SSH tunnel → run task×condition matrix → comparison summary → pool-cleanup. Replaces ~20 manual commands with a single invocation. Features: - Auto-creates Azure VM pool and waits for WAA readiness - Builds eval matrix: ZS for all tasks, DC for tasks with matching demos - Runs evals sequentially, prints comparison table at end - SSH tunnels managed automatically via SSHTunnelManager - Supports --no-pool-create/--no-pool-cleanup for existing VMs - Also adds anthropic as a direct dependency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Kill OneDrive notifications during environment reset (dominated a11y tree) - Loop detector: don't substitute Escape for hotkey loops (was destroying Save As dialogs in near-successful DC Notepad runs) - Loop detector: progressive directional offsets instead of fixed +50px - A11y tree: filter notification noise + increase truncation limit to 8000 - Demo discovery: prefer .txt (natural language) over .json (normalized coords) - Pool-wait timeout: increase default from 40 to 50 minutes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove _filter_a11y_noise and _A11Y_NOISE_PATTERNS — the a11y data from the WAA /accessibility endpoint is real UIA XML, not server logs. Pass it through as-is instead of trying to heuristically filter notification noise. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ing mode Implement Qwen3VLAgent for local inference using Qwen3-VL-8B-Instruct. Supports [0,1000] coordinate normalization, full action space (click, type, press, scroll, drag, wait, finished), optional <think> blocks, and demo-conditioned inference. Register qwen3vl in all CLI commands (mock, run, live, eval-suite) with --model-path and --use-thinking args. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move system prompt to system role message in _run_inference() instead of cramming it into the user turn. _build_prompt() now returns only the user turn text (instruction + history + output instruction), matching the training data format produced by convert_demos.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Keep evaluate server and socat proxy from feature branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Implements ClaudeComputerUseAgent using Anthropic's native computer_use tool (computer_20251124 beta). Key features: - Structured tool_use/tool_result protocol (no regex parsing) - Multi-turn conversation maintained across steps - Internal loop for screenshot/wait actions: when Claude requests a screenshot, the agent sends the current screen back and calls the API again, instead of returning "done" to the runner (this was causing premature episode termination after 1 step) - Demo injection for demo-conditioned inference - Coordinate normalization (pixel → [0,1]) Also includes: - 28 unit tests for all action types, conversation management, demo injection, screenshot encoding, and edge cases - VM pool optimization design doc (pre-baked image, deallocate/resume, Windows disk persistence, ACR integration) - Hybrid agent architecture design doc (Track 1: Claude CU, Track 2: Qwen3-VL) - Cleanup: remove .swp files, cost_report.json, update .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Claude Computer Use (Sonnet 4.6) achieves 100% success on all 3 WAA tasks in both zero-shot and demo-conditioned modes after the screenshot/wait internal retry fix (commit 0b185eb). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…cycle Phase 1 of VM pool optimization: stop compute billing without destroying VMs. Deallocated VMs keep their disks (~$0.25/day vs $0.38/hr running). Resume takes ~5 min vs ~42 min for full pool-create. New commands: - `oa-vm pool-pause` — deallocate all pool VMs - `oa-vm pool-resume` — start VMs, wait for WAA readiness New AzureVMManager methods: deallocate_vm(), start_vm() (SDK + CLI fallback) New PoolManager methods: pause(), resume() Updated resource_tracker for paused pool cost awareness. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…commands Extend record_waa_demos.py with three new fire subcommands: - record-waa: interactive recording via WAA API + VNC with step-by-step screenshot capture, redo support, and prefix-matched task IDs - annotate: VLM annotation of recorded before/after screenshots using the same prompt templates and provider abstraction from openadapt-ml - eval: delegates to eval-suite with --demo-dir for demo-conditioned runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…mprovements - Add image-create/image-list/image-delete CLI commands for Azure Managed Images - Support --image flag on pool-create to skip Docker setup (golden images) - Support --use-acr flag to pull waa-auto from ACR instead of building on VM - Add ACR config settings (acr_name, acr_login_server) - Fix WAA storage path: /home/azureuser/waa-storage instead of /mnt - Add auto-pause timer tracking (auto_pause_at, auto_pause_hours on VMPool) - Add stale pool warnings (7/14 day thresholds) in pool-status and resource tracker - Show accumulated idle cost in pool-status Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…dling, exit code - Fix drag actions mapped as type="click" instead of type="drag" in ApiAgent - Add raise_for_status() to all screenshot requests in record-waa via helper - Propagate eval-suite subprocess exit code in cmd_eval_dc Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds GitHub Actions workflow that runs pytest on push to main and on PRs. Excludes tests requiring openadapt-ml (not installed in CI) and tests depending on missing fixture files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abrichr and others added 13 commits February 18, 2026 18:52

fix(recording): replace busy-wait loop with time.sleep

639f005

The `while True: pass` loop burned an entire CPU core during recording. Replace with `time.sleep(0.5)` to yield CPU while waiting for Ctrl+C. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: add wait_for_ready() and match CLI recording loop pattern

428fd9c

- Call recorder.wait_for_ready() before entering the wait loop - Use recorder.is_recording check and 1s sleep to match CLI behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: update stop instructions and clarify wormhole send flow

a155e48

- Change "Press Ctrl+C" to "press Ctrl 3 times" (matches stop sequence) - Clarify wormhole send instructions (each send blocks until received) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abrichr changed the title ~~fix(pool): fix WAA pool automation end-to-end~~ feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation Feb 22, 2026

abrichr and others added 16 commits February 22, 2026 01:30

fix(pool): resolve merge conflict in WAA startup script

2fc8546

Keep evaluate server and socat proxy from feature branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: update beads local state

01395ea

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abrichr and others added 2 commits February 24, 2026 01:31

ci: add test workflow for PR checks

3034a2f

Adds GitHub Actions workflow that runs pytest on push to main and on PRs. Excludes tests requiring openadapt-ml (not installed in CI) and tests depending on missing fixture files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(ci): install dev extras for pytest in test workflow

d2ea481

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abrichr merged commit 19a11ee into main Feb 24, 2026
1 check passed

abrichr mentioned this pull request Feb 24, 2026

fix(recording): replace busy-wait loop with time.sleep #32

Closed

1 task

abrichr deleted the fix/pool-automation branch February 28, 2026 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation#35

feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation#35
abrichr merged 31 commits into
mainfrom
fix/pool-automation

abrichr commented Feb 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrichr commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Infrastructure and Pool Automation

Agent Fixes (6 targeted reliability improvements)

Eval-Suite CLI Command

Visualization and Instrumentation

Evaluation Results

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abrichr commented Feb 19, 2026 •

edited

Loading