Skip to content

feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation#35

Merged
abrichr merged 31 commits into
mainfrom
fix/pool-automation
Feb 24, 2026
Merged

feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation#35
abrichr merged 31 commits into
mainfrom
fix/pool-automation

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Feb 19, 2026

Summary

End-to-end improvements to the WAA evaluation pipeline, from infrastructure fixes through instrumentation to automated demo-conditioned evaluation.

Infrastructure and Pool Automation

  • Fix WAA probe IP, add QMP support, add pool-auto command
  • Use waa-auto Docker image instead of broken windowsarena/winarena
  • Replace fragile streaming SSH with docker exec + tail
  • Add dedicated evaluate server with socat proxy
  • Add 120s timeout to OpenAI API calls

Agent Fixes (6 targeted reliability improvements)

  • Handle double_click, right_click, and drag in action parser
  • Fix coordinate normalization: auto-detect screen size from screenshot
  • Update system prompt to reference screenshot dimensions
  • Kill OneDrive/notification overlays before each task
  • Add loop detector improvements
  • Filter a11y noise from server console logs in prompts
  • Demo format preference: use JSON annotated demos when available

Eval-Suite CLI Command

  • New eval-suite command automates full evaluation cycle
  • Supports --no-pool-create / --no-pool-cleanup for existing VMs
  • Runs zero-shot and demo-conditioned variants automatically

Visualization and Instrumentation

  • Add agent instrumentation per step
  • Add comparison viewer for side-by-side ZS vs DC replay
  • Enhance viewer with Agent Thinking panel, heatmap overlay

Evaluation Results

  • Two full eval rounds (GPT-5.1) across 3 tasks x 2 conditions = 12 runs
  • DC agent shows more purposeful behavior
  • All WAA scores 0.00 -- architectural mismatch identified (raw coords vs SoM element IDs)

Test plan

  • 216 tests pass
  • Mock eval passes
  • Pool create/wait/eval-suite/cleanup end-to-end on Azure (2 full runs)
  • 12 live WAA evals completed
  • Viewers render correctly
  • eval-suite --help shows all options

abrichr and others added 13 commits February 18, 2026 18:52
The `while True: pass` loop burned an entire CPU core during recording.
Replace with `time.sleep(0.5)` to yield CPU while waiting for Ctrl+C.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Call recorder.wait_for_ready() before entering the wait loop
- Use recorder.is_recording check and 1s sleep to match CLI behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The third WAA task requires .docx files in Documents. The script now
creates empty report.docx, meeting_notes.docx, and proposal.docx
before recording that task, and cleans up any Archive folder from
previous runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change "Press Ctrl+C" to "press Ctrl 3 times" (matches stop sequence)
- Clarify wormhole send instructions (each send blocks until received)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The DOCKER_SETUP_SCRIPT builds waa-auto:latest (based on dockurr/windows:latest
which can auto-download Windows ISO) but WAA_START_SCRIPT and setup-waa were
starting windowsarena/winarena:latest which uses the old dockurr/windows v0.00
that cannot download the ISO, causing "ISO file not found" error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three bugs prevented pool-run from working:

1. WAA probe used 172.30.0.2 (QEMU guest IP) but Docker port-forwards
   to localhost — pool-wait timed out every time. Changed to localhost
   in pool.py and vm_monitor.py.

2. dockurr/windows base image doesn't configure QMP (QEMU Machine
   Protocol). WAA client needs QMP on port 7200 for VM status. Added
   ARGUMENTS env var to inject -qmp flag into QEMU startup.

3. Config defaults had Standard_D2_v3 (8GB, OOMs) and old
   windowsarena/winarena image. Fixed to D8ds_v5 and waa-auto.

Also adds:
- pool-auto command: single oa-vm pool-auto --workers N --tasks M
  chains create → wait → run
- /evaluate endpoint injection in waa_deploy Dockerfile
- Handle WAA server wrapping 404 in 500 responses (live.py)
- openai dependency for API agents

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion

Replace fragile streaming SSH with docker exec -d (detached) for
starting benchmarks. Logs stream via tail -f --pid which auto-exits
when the benchmark finishes. On SSH drop, reconnects and resumes.
Also adds 120s timeout to OpenAI API calls to prevent infinite hangs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WAA's run.py ignores --tasks and runs all 154 tasks based on
worker_id/num_workers. Fix by creating a subset test JSON with
only the requested number of tasks and passing it via
--test_all_meta_path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a standalone evaluate server (port 5050) that runs inside the WAA
Docker container and has direct access to WAA evaluator modules. This
avoids needing to patch the WAA Flask server's /evaluate endpoint.

- Add evaluate_server.py and start_with_evaluate.sh
- Add evaluate_url config to WAALiveConfig
- Set up socat proxy (5051→5050) for Docker bridge networking
- Add SSH tunnel for evaluate port
- Simplify Dockerfile

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ments

Instrumentation (captures richer data per step):
- Propagate agent logs (LLM response, parse strategy, demo info,
  loop detection, memory) from ApiAgent to execution trace
- Add per-step timing (agent_think_ms, env_execute_ms)
- Capture token counts from OpenAI/Anthropic API responses

Viewer enhancements (viewer.py):
- Agent Thinking panel showing LLM response, memory, parse strategy
- Action timeline bar color-coded by action type
- Click heatmap overlay showing click frequency hotspots
- Click marker using raw pixel coords for correct positioning

Comparison viewer (new):
- comparison_viewer.py generates side-by-side HTML comparisons
- Synchronized step slider, click markers, action diffs
- First-divergence detection, action type distribution charts
- CLI 'compare' command for generating comparisons
- Demo prompts and initial eval results for 3 WAA tasks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_parse_computer_action() only handled click, type, press, hotkey, and
scroll. Any other action (double_click, right_click, drag) fell through
to the default return of type="done", which prematurely terminated the
task. This caused the demo-conditioned notepad eval to stop after 1 step
when the agent correctly issued computer.double_click() to open Notepad.

Also add a warning log when an unrecognized action falls through,
and update viewer regexes to handle double_click/right_click coordinates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dcoded config

WAALiveConfig defaulted to 1920x1200 but actual VM screen is 1280x720.
This caused stored action.x/y to be normalized against the wrong resolution.
Now detects real dimensions from the screenshot via PIL, uses them for
viewport, denormalization, window_rect, and drag coordinates. Viewers use
a divergence check for backward compatibility with old data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ZS vs demo-conditioned on 3 WAA tasks (GPT-5.1). DC agent signals
completion on 2/3 tasks (Settings: 11 steps, Notepad: 8 steps) while
ZS hits max steps on all 3. Includes Playwright screenshots of
comparison viewers and step-by-step screenshots.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abrichr abrichr changed the title fix(pool): fix WAA pool automation end-to-end feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation Feb 22, 2026
abrichr and others added 16 commits February 22, 2026 01:30
Replace inline 25-line Dockerfile in pool.py with SCP of waa_deploy/
build context. This eliminates drift between the inline and full
Dockerfile, and ensures evaluate_server.py + Flask are included in the
container image. Adds evaluate server health check during pool-wait.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WAA evaluator getters (get_vm_file, get_cloud_file) expect env.cache_dir
for downloading/caching files during evaluation. Without it, the
compare_text_file metric fails with AttributeError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WAA tasks use a 'config' array with preconditions (file downloads, app
launches, sleeps) that must run before the agent starts. Previously
_run_task_setup() looked for non-existent 'setup'/'init' keys, so task
preconditions were never executed — causing Archive and other tasks with
file dependencies to always score 0.

- Add /setup endpoint to evaluate_server.py with 11 handlers mirroring
  WAA's SetupController (download, launch, sleep, execute, open, etc.)
- Add requests-toolbelt to Dockerfile for multipart file uploads
- Rewrite _run_task_setup() in live.py to POST config array to evaluate
  server's /setup endpoint
- Increase reset delay from 1s to 5s to match WAA defaults

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New `eval-suite` CLI command that automates the full WAA evaluation
cycle: pool-create → pool-wait → SSH tunnel → run task×condition matrix
→ comparison summary → pool-cleanup. Replaces ~20 manual commands with
a single invocation.

Features:
- Auto-creates Azure VM pool and waits for WAA readiness
- Builds eval matrix: ZS for all tasks, DC for tasks with matching demos
- Runs evals sequentially, prints comparison table at end
- SSH tunnels managed automatically via SSHTunnelManager
- Supports --no-pool-create/--no-pool-cleanup for existing VMs
- Also adds anthropic as a direct dependency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Kill OneDrive notifications during environment reset (dominated a11y tree)
- Loop detector: don't substitute Escape for hotkey loops (was destroying
  Save As dialogs in near-successful DC Notepad runs)
- Loop detector: progressive directional offsets instead of fixed +50px
- A11y tree: filter notification noise + increase truncation limit to 8000
- Demo discovery: prefer .txt (natural language) over .json (normalized coords)
- Pool-wait timeout: increase default from 40 to 50 minutes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove _filter_a11y_noise and _A11Y_NOISE_PATTERNS — the a11y data from
the WAA /accessibility endpoint is real UIA XML, not server logs. Pass it
through as-is instead of trying to heuristically filter notification noise.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ing mode

Implement Qwen3VLAgent for local inference using Qwen3-VL-8B-Instruct.
Supports [0,1000] coordinate normalization, full action space (click,
type, press, scroll, drag, wait, finished), optional <think> blocks,
and demo-conditioned inference. Register qwen3vl in all CLI commands
(mock, run, live, eval-suite) with --model-path and --use-thinking args.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move system prompt to system role message in _run_inference() instead of
cramming it into the user turn. _build_prompt() now returns only the user
turn text (instruction + history + output instruction), matching the
training data format produced by convert_demos.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep evaluate server and socat proxy from feature branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements ClaudeComputerUseAgent using Anthropic's native computer_use
tool (computer_20251124 beta). Key features:
- Structured tool_use/tool_result protocol (no regex parsing)
- Multi-turn conversation maintained across steps
- Internal loop for screenshot/wait actions: when Claude requests a
  screenshot, the agent sends the current screen back and calls the API
  again, instead of returning "done" to the runner (this was causing
  premature episode termination after 1 step)
- Demo injection for demo-conditioned inference
- Coordinate normalization (pixel → [0,1])

Also includes:
- 28 unit tests for all action types, conversation management, demo
  injection, screenshot encoding, and edge cases
- VM pool optimization design doc (pre-baked image, deallocate/resume,
  Windows disk persistence, ACR integration)
- Hybrid agent architecture design doc (Track 1: Claude CU, Track 2: Qwen3-VL)
- Cleanup: remove .swp files, cost_report.json, update .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Computer Use (Sonnet 4.6) achieves 100% success on all 3 WAA tasks
in both zero-shot and demo-conditioned modes after the screenshot/wait
internal retry fix (commit 0b185eb).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cycle

Phase 1 of VM pool optimization: stop compute billing without destroying
VMs. Deallocated VMs keep their disks (~$0.25/day vs $0.38/hr running).
Resume takes ~5 min vs ~42 min for full pool-create.

New commands:
- `oa-vm pool-pause` — deallocate all pool VMs
- `oa-vm pool-resume` — start VMs, wait for WAA readiness

New AzureVMManager methods: deallocate_vm(), start_vm() (SDK + CLI fallback)
New PoolManager methods: pause(), resume()
Updated resource_tracker for paused pool cost awareness.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…commands

Extend record_waa_demos.py with three new fire subcommands:
- record-waa: interactive recording via WAA API + VNC with step-by-step
  screenshot capture, redo support, and prefix-matched task IDs
- annotate: VLM annotation of recorded before/after screenshots using
  the same prompt templates and provider abstraction from openadapt-ml
- eval: delegates to eval-suite with --demo-dir for demo-conditioned runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mprovements

- Add image-create/image-list/image-delete CLI commands for Azure Managed Images
- Support --image flag on pool-create to skip Docker setup (golden images)
- Support --use-acr flag to pull waa-auto from ACR instead of building on VM
- Add ACR config settings (acr_name, acr_login_server)
- Fix WAA storage path: /home/azureuser/waa-storage instead of /mnt
- Add auto-pause timer tracking (auto_pause_at, auto_pause_hours on VMPool)
- Add stale pool warnings (7/14 day thresholds) in pool-status and resource tracker
- Show accumulated idle cost in pool-status

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dling, exit code

- Fix drag actions mapped as type="click" instead of type="drag" in ApiAgent
- Add raise_for_status() to all screenshot requests in record-waa via helper
- Propagate eval-suite subprocess exit code in cmd_eval_dc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
abrichr and others added 2 commits February 24, 2026 01:31
Adds GitHub Actions workflow that runs pytest on push to main and on PRs.
Excludes tests requiring openadapt-ml (not installed in CI) and tests
depending on missing fixture files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abrichr abrichr merged commit 19a11ee into main Feb 24, 2026
1 check passed
@abrichr abrichr deleted the fix/pool-automation branch February 28, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant