feat: WAA eval pipeline fixes, instrumentation, and demo-conditioned evaluation#35
Merged
Conversation
The `while True: pass` loop burned an entire CPU core during recording. Replace with `time.sleep(0.5)` to yield CPU while waiting for Ctrl+C. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Call recorder.wait_for_ready() before entering the wait loop - Use recorder.is_recording check and 1s sleep to match CLI behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The third WAA task requires .docx files in Documents. The script now creates empty report.docx, meeting_notes.docx, and proposal.docx before recording that task, and cleans up any Archive folder from previous runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change "Press Ctrl+C" to "press Ctrl 3 times" (matches stop sequence) - Clarify wormhole send instructions (each send blocks until received) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The DOCKER_SETUP_SCRIPT builds waa-auto:latest (based on dockurr/windows:latest which can auto-download Windows ISO) but WAA_START_SCRIPT and setup-waa were starting windowsarena/winarena:latest which uses the old dockurr/windows v0.00 that cannot download the ISO, causing "ISO file not found" error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three bugs prevented pool-run from working: 1. WAA probe used 172.30.0.2 (QEMU guest IP) but Docker port-forwards to localhost — pool-wait timed out every time. Changed to localhost in pool.py and vm_monitor.py. 2. dockurr/windows base image doesn't configure QMP (QEMU Machine Protocol). WAA client needs QMP on port 7200 for VM status. Added ARGUMENTS env var to inject -qmp flag into QEMU startup. 3. Config defaults had Standard_D2_v3 (8GB, OOMs) and old windowsarena/winarena image. Fixed to D8ds_v5 and waa-auto. Also adds: - pool-auto command: single oa-vm pool-auto --workers N --tasks M chains create → wait → run - /evaluate endpoint injection in waa_deploy Dockerfile - Handle WAA server wrapping 404 in 500 responses (live.py) - openai dependency for API agents Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion Replace fragile streaming SSH with docker exec -d (detached) for starting benchmarks. Logs stream via tail -f --pid which auto-exits when the benchmark finishes. On SSH drop, reconnects and resumes. Also adds 120s timeout to OpenAI API calls to prevent infinite hangs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WAA's run.py ignores --tasks and runs all 154 tasks based on worker_id/num_workers. Fix by creating a subset test JSON with only the requested number of tasks and passing it via --test_all_meta_path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a standalone evaluate server (port 5050) that runs inside the WAA Docker container and has direct access to WAA evaluator modules. This avoids needing to patch the WAA Flask server's /evaluate endpoint. - Add evaluate_server.py and start_with_evaluate.sh - Add evaluate_url config to WAALiveConfig - Set up socat proxy (5051→5050) for Docker bridge networking - Add SSH tunnel for evaluate port - Simplify Dockerfile Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ments Instrumentation (captures richer data per step): - Propagate agent logs (LLM response, parse strategy, demo info, loop detection, memory) from ApiAgent to execution trace - Add per-step timing (agent_think_ms, env_execute_ms) - Capture token counts from OpenAI/Anthropic API responses Viewer enhancements (viewer.py): - Agent Thinking panel showing LLM response, memory, parse strategy - Action timeline bar color-coded by action type - Click heatmap overlay showing click frequency hotspots - Click marker using raw pixel coords for correct positioning Comparison viewer (new): - comparison_viewer.py generates side-by-side HTML comparisons - Synchronized step slider, click markers, action diffs - First-divergence detection, action type distribution charts - CLI 'compare' command for generating comparisons - Demo prompts and initial eval results for 3 WAA tasks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_parse_computer_action() only handled click, type, press, hotkey, and scroll. Any other action (double_click, right_click, drag) fell through to the default return of type="done", which prematurely terminated the task. This caused the demo-conditioned notepad eval to stop after 1 step when the agent correctly issued computer.double_click() to open Notepad. Also add a warning log when an unrecognized action falls through, and update viewer regexes to handle double_click/right_click coordinates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dcoded config WAALiveConfig defaulted to 1920x1200 but actual VM screen is 1280x720. This caused stored action.x/y to be normalized against the wrong resolution. Now detects real dimensions from the screenshot via PIL, uses them for viewport, denormalization, window_rect, and drag coordinates. Viewers use a divergence check for backward compatibility with old data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ZS vs demo-conditioned on 3 WAA tasks (GPT-5.1). DC agent signals completion on 2/3 tasks (Settings: 11 steps, Notepad: 8 steps) while ZS hits max steps on all 3. Includes Playwright screenshots of comparison viewers and step-by-step screenshots. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace inline 25-line Dockerfile in pool.py with SCP of waa_deploy/ build context. This eliminates drift between the inline and full Dockerfile, and ensures evaluate_server.py + Flask are included in the container image. Adds evaluate server health check during pool-wait. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WAA evaluator getters (get_vm_file, get_cloud_file) expect env.cache_dir for downloading/caching files during evaluation. Without it, the compare_text_file metric fails with AttributeError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WAA tasks use a 'config' array with preconditions (file downloads, app launches, sleeps) that must run before the agent starts. Previously _run_task_setup() looked for non-existent 'setup'/'init' keys, so task preconditions were never executed — causing Archive and other tasks with file dependencies to always score 0. - Add /setup endpoint to evaluate_server.py with 11 handlers mirroring WAA's SetupController (download, launch, sleep, execute, open, etc.) - Add requests-toolbelt to Dockerfile for multipart file uploads - Rewrite _run_task_setup() in live.py to POST config array to evaluate server's /setup endpoint - Increase reset delay from 1s to 5s to match WAA defaults Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New `eval-suite` CLI command that automates the full WAA evaluation cycle: pool-create → pool-wait → SSH tunnel → run task×condition matrix → comparison summary → pool-cleanup. Replaces ~20 manual commands with a single invocation. Features: - Auto-creates Azure VM pool and waits for WAA readiness - Builds eval matrix: ZS for all tasks, DC for tasks with matching demos - Runs evals sequentially, prints comparison table at end - SSH tunnels managed automatically via SSHTunnelManager - Supports --no-pool-create/--no-pool-cleanup for existing VMs - Also adds anthropic as a direct dependency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Kill OneDrive notifications during environment reset (dominated a11y tree) - Loop detector: don't substitute Escape for hotkey loops (was destroying Save As dialogs in near-successful DC Notepad runs) - Loop detector: progressive directional offsets instead of fixed +50px - A11y tree: filter notification noise + increase truncation limit to 8000 - Demo discovery: prefer .txt (natural language) over .json (normalized coords) - Pool-wait timeout: increase default from 40 to 50 minutes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove _filter_a11y_noise and _A11Y_NOISE_PATTERNS — the a11y data from the WAA /accessibility endpoint is real UIA XML, not server logs. Pass it through as-is instead of trying to heuristically filter notification noise. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ing mode Implement Qwen3VLAgent for local inference using Qwen3-VL-8B-Instruct. Supports [0,1000] coordinate normalization, full action space (click, type, press, scroll, drag, wait, finished), optional <think> blocks, and demo-conditioned inference. Register qwen3vl in all CLI commands (mock, run, live, eval-suite) with --model-path and --use-thinking args. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move system prompt to system role message in _run_inference() instead of cramming it into the user turn. _build_prompt() now returns only the user turn text (instruction + history + output instruction), matching the training data format produced by convert_demos.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep evaluate server and socat proxy from feature branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements ClaudeComputerUseAgent using Anthropic's native computer_use tool (computer_20251124 beta). Key features: - Structured tool_use/tool_result protocol (no regex parsing) - Multi-turn conversation maintained across steps - Internal loop for screenshot/wait actions: when Claude requests a screenshot, the agent sends the current screen back and calls the API again, instead of returning "done" to the runner (this was causing premature episode termination after 1 step) - Demo injection for demo-conditioned inference - Coordinate normalization (pixel → [0,1]) Also includes: - 28 unit tests for all action types, conversation management, demo injection, screenshot encoding, and edge cases - VM pool optimization design doc (pre-baked image, deallocate/resume, Windows disk persistence, ACR integration) - Hybrid agent architecture design doc (Track 1: Claude CU, Track 2: Qwen3-VL) - Cleanup: remove .swp files, cost_report.json, update .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Computer Use (Sonnet 4.6) achieves 100% success on all 3 WAA tasks in both zero-shot and demo-conditioned modes after the screenshot/wait internal retry fix (commit 0b185eb). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cycle Phase 1 of VM pool optimization: stop compute billing without destroying VMs. Deallocated VMs keep their disks (~$0.25/day vs $0.38/hr running). Resume takes ~5 min vs ~42 min for full pool-create. New commands: - `oa-vm pool-pause` — deallocate all pool VMs - `oa-vm pool-resume` — start VMs, wait for WAA readiness New AzureVMManager methods: deallocate_vm(), start_vm() (SDK + CLI fallback) New PoolManager methods: pause(), resume() Updated resource_tracker for paused pool cost awareness. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…commands Extend record_waa_demos.py with three new fire subcommands: - record-waa: interactive recording via WAA API + VNC with step-by-step screenshot capture, redo support, and prefix-matched task IDs - annotate: VLM annotation of recorded before/after screenshots using the same prompt templates and provider abstraction from openadapt-ml - eval: delegates to eval-suite with --demo-dir for demo-conditioned runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mprovements - Add image-create/image-list/image-delete CLI commands for Azure Managed Images - Support --image flag on pool-create to skip Docker setup (golden images) - Support --use-acr flag to pull waa-auto from ACR instead of building on VM - Add ACR config settings (acr_name, acr_login_server) - Fix WAA storage path: /home/azureuser/waa-storage instead of /mnt - Add auto-pause timer tracking (auto_pause_at, auto_pause_hours on VMPool) - Add stale pool warnings (7/14 day thresholds) in pool-status and resource tracker - Show accumulated idle cost in pool-status Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dling, exit code - Fix drag actions mapped as type="click" instead of type="drag" in ApiAgent - Add raise_for_status() to all screenshot requests in record-waa via helper - Propagate eval-suite subprocess exit code in cmd_eval_dc Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds GitHub Actions workflow that runs pytest on push to main and on PRs. Excludes tests requiring openadapt-ml (not installed in CI) and tests depending on missing fixture files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end improvements to the WAA evaluation pipeline, from infrastructure fixes through instrumentation to automated demo-conditioned evaluation.
Infrastructure and Pool Automation
Agent Fixes (6 targeted reliability improvements)
Eval-Suite CLI Command
Visualization and Instrumentation
Evaluation Results
Test plan