|
1 | 1 | # CHANGELOG |
2 | 2 |
|
3 | 3 |
|
| 4 | +## v0.4.0 (2026-02-24) |
| 5 | + |
| 6 | +### Features |
| 7 | + |
| 8 | +- Waa eval pipeline — recording, annotation, golden images, and CI |
| 9 | + ([#35](https://github.com/OpenAdaptAI/openadapt-evals/pull/35), |
| 10 | + [`19a11ee`](https://github.com/OpenAdaptAI/openadapt-evals/commit/19a11ee36938d4adb3b585e25ffb972424ea52db)) |
| 11 | + |
| 12 | +* fix(recording): replace busy-wait loop with time.sleep |
| 13 | + |
| 14 | +The `while True: pass` loop burned an entire CPU core during recording. Replace with |
| 15 | + `time.sleep(0.5)` to yield CPU while waiting for Ctrl+C. |
| 16 | + |
| 17 | +Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
| 18 | + |
| 19 | +* fix: add wait_for_ready() and match CLI recording loop pattern |
| 20 | + |
| 21 | +- Call recorder.wait_for_ready() before entering the wait loop - Use recorder.is_recording check and |
| 22 | + 1s sleep to match CLI behavior |
| 23 | + |
| 24 | +* fix: auto-create dummy .docx files for archive task |
| 25 | + |
| 26 | +The third WAA task requires .docx files in Documents. The script now creates empty report.docx, |
| 27 | + meeting_notes.docx, and proposal.docx before recording that task, and cleans up any Archive folder |
| 28 | + from previous runs. |
| 29 | + |
| 30 | +* fix: update stop instructions and clarify wormhole send flow |
| 31 | + |
| 32 | +- Change "Press Ctrl+C" to "press Ctrl 3 times" (matches stop sequence) - Clarify wormhole send |
| 33 | + instructions (each send blocks until received) |
| 34 | + |
| 35 | +* fix(pool): use waa-auto image instead of broken windowsarena/winarena |
| 36 | + |
| 37 | +The DOCKER_SETUP_SCRIPT builds waa-auto:latest (based on dockurr/windows:latest which can |
| 38 | + auto-download Windows ISO) but WAA_START_SCRIPT and setup-waa were starting |
| 39 | + windowsarena/winarena:latest which uses the old dockurr/windows v0.00 that cannot download the |
| 40 | + ISO, causing "ISO file not found" error. |
| 41 | + |
| 42 | +* fix(pool): fix WAA probe IP, add QMP support, add pool-auto command |
| 43 | + |
| 44 | +Three bugs prevented pool-run from working: |
| 45 | + |
| 46 | +1. WAA probe used 172.30.0.2 (QEMU guest IP) but Docker port-forwards to localhost — pool-wait timed |
| 47 | + out every time. Changed to localhost in pool.py and vm_monitor.py. |
| 48 | + |
| 49 | +2. dockurr/windows base image doesn't configure QMP (QEMU Machine Protocol). WAA client needs QMP on |
| 50 | + port 7200 for VM status. Added ARGUMENTS env var to inject -qmp flag into QEMU startup. |
| 51 | + |
| 52 | +3. Config defaults had Standard_D2_v3 (8GB, OOMs) and old windowsarena/winarena image. Fixed to |
| 53 | + D8ds_v5 and waa-auto. |
| 54 | + |
| 55 | +Also adds: - pool-auto command: single oa-vm pool-auto --workers N --tasks M chains create → wait → |
| 56 | + run - /evaluate endpoint injection in waa_deploy Dockerfile - Handle WAA server wrapping 404 in |
| 57 | + 500 responses (live.py) - openai dependency for API agents |
| 58 | + |
| 59 | +* fix(pool): use docker exec -d + tail -f for resilient benchmark execution |
| 60 | + |
| 61 | +Replace fragile streaming SSH with docker exec -d (detached) for starting benchmarks. Logs stream |
| 62 | + via tail -f --pid which auto-exits when the benchmark finishes. On SSH drop, reconnects and |
| 63 | + resumes. Also adds 120s timeout to OpenAI API calls to prevent infinite hangs. |
| 64 | + |
| 65 | +* fix(pool): limit tasks with --test_all_meta_path subset JSON |
| 66 | + |
| 67 | +WAA's run.py ignores --tasks and runs all 154 tasks based on worker_id/num_workers. Fix by creating |
| 68 | + a subset test JSON with only the requested number of tasks and passing it via |
| 69 | + --test_all_meta_path. |
| 70 | + |
| 71 | +* feat(pool): add dedicated evaluate server with socat proxy |
| 72 | + |
| 73 | +Add a standalone evaluate server (port 5050) that runs inside the WAA Docker container and has |
| 74 | + direct access to WAA evaluator modules. This avoids needing to patch the WAA Flask server's |
| 75 | + /evaluate endpoint. |
| 76 | + |
| 77 | +- Add evaluate_server.py and start_with_evaluate.sh - Add evaluate_url config to WAALiveConfig - Set |
| 78 | + up socat proxy (5051→5050) for Docker bridge networking - Add SSH tunnel for evaluate port - |
| 79 | + Simplify Dockerfile |
| 80 | + |
| 81 | +* feat(viz): add instrumentation, comparison viewer, and viewer enhancements |
| 82 | + |
| 83 | +Instrumentation (captures richer data per step): - Propagate agent logs (LLM response, parse |
| 84 | + strategy, demo info, loop detection, memory) from ApiAgent to execution trace - Add per-step |
| 85 | + timing (agent_think_ms, env_execute_ms) - Capture token counts from OpenAI/Anthropic API responses |
| 86 | + |
| 87 | +Viewer enhancements (viewer.py): - Agent Thinking panel showing LLM response, memory, parse strategy |
| 88 | + - Action timeline bar color-coded by action type - Click heatmap overlay showing click frequency |
| 89 | + hotspots - Click marker using raw pixel coords for correct positioning |
| 90 | + |
| 91 | +Comparison viewer (new): - comparison_viewer.py generates side-by-side HTML comparisons - |
| 92 | + Synchronized step slider, click markers, action diffs - First-divergence detection, action type |
| 93 | + distribution charts - CLI 'compare' command for generating comparisons - Demo prompts and initial |
| 94 | + eval results for 3 WAA tasks |
| 95 | + |
| 96 | +* fix(agent): handle double_click, right_click, and drag in action parser |
| 97 | + |
| 98 | +_parse_computer_action() only handled click, type, press, hotkey, and scroll. Any other action |
| 99 | + (double_click, right_click, drag) fell through to the default return of type="done", which |
| 100 | + prematurely terminated the task. This caused the demo-conditioned notepad eval to stop after 1 |
| 101 | + step when the agent correctly issued computer.double_click() to open Notepad. |
| 102 | + |
| 103 | +Also add a warning log when an unrecognized action falls through, and update viewer regexes to |
| 104 | + handle double_click/right_click coordinates. |
| 105 | + |
| 106 | +* fix(coords): detect actual screen size from screenshot instead of hardcoded config |
| 107 | + |
| 108 | +WAALiveConfig defaulted to 1920x1200 but actual VM screen is 1280x720. This caused stored action.x/y |
| 109 | + to be normalized against the wrong resolution. Now detects real dimensions from the screenshot via |
| 110 | + PIL, uses them for viewport, denormalization, window_rect, and drag coordinates. Viewers use a |
| 111 | + divergence check for backward compatibility with old data. |
| 112 | + |
| 113 | +* docs: add Feb 21 eval results with comparison screenshots |
| 114 | + |
| 115 | +ZS vs demo-conditioned on 3 WAA tasks (GPT-5.1). DC agent signals completion on 2/3 tasks (Settings: |
| 116 | + 11 steps, Notepad: 8 steps) while ZS hits max steps on all 3. Includes Playwright screenshots of |
| 117 | + comparison viewers and step-by-step screenshots. |
| 118 | + |
| 119 | +* fix(pool): consolidate Dockerfiles and deploy evaluate server |
| 120 | + |
| 121 | +Replace inline 25-line Dockerfile in pool.py with SCP of waa_deploy/ build context. This eliminates |
| 122 | + drift between the inline and full Dockerfile, and ensures evaluate_server.py + Flask are included |
| 123 | + in the container image. Adds evaluate server health check during pool-wait. |
| 124 | + |
| 125 | +* fix(evaluate): add cache_dir to MockEnv for WAA file getters |
| 126 | + |
| 127 | +WAA evaluator getters (get_vm_file, get_cloud_file) expect env.cache_dir for downloading/caching |
| 128 | + files during evaluation. Without it, the compare_text_file metric fails with AttributeError. |
| 129 | + |
| 130 | +* feat(setup): implement WAA task setup config array processing |
| 131 | + |
| 132 | +WAA tasks use a 'config' array with preconditions (file downloads, app launches, sleeps) that must |
| 133 | + run before the agent starts. Previously _run_task_setup() looked for non-existent 'setup'/'init' |
| 134 | + keys, so task preconditions were never executed — causing Archive and other tasks with file |
| 135 | + dependencies to always score 0. |
| 136 | + |
| 137 | +- Add /setup endpoint to evaluate_server.py with 11 handlers mirroring WAA's SetupController |
| 138 | + (download, launch, sleep, execute, open, etc.) - Add requests-toolbelt to Dockerfile for multipart |
| 139 | + file uploads - Rewrite _run_task_setup() in live.py to POST config array to evaluate server's |
| 140 | + /setup endpoint - Increase reset delay from 1s to 5s to match WAA defaults |
| 141 | + |
| 142 | +* feat(cli): add eval-suite command for automated full-cycle evaluation |
| 143 | + |
| 144 | +New `eval-suite` CLI command that automates the full WAA evaluation cycle: pool-create → pool-wait → |
| 145 | + SSH tunnel → run task×condition matrix |
| 146 | + |
| 147 | +→ comparison summary → pool-cleanup. Replaces ~20 manual commands with a single invocation. |
| 148 | + |
| 149 | +Features: - Auto-creates Azure VM pool and waits for WAA readiness - Builds eval matrix: ZS for all |
| 150 | + tasks, DC for tasks with matching demos - Runs evals sequentially, prints comparison table at end |
| 151 | + - SSH tunnels managed automatically via SSHTunnelManager - Supports |
| 152 | + --no-pool-create/--no-pool-cleanup for existing VMs - Also adds anthropic as a direct dependency |
| 153 | + |
| 154 | +* fix(agent): improve eval reliability with 6 targeted fixes |
| 155 | + |
| 156 | +- Kill OneDrive notifications during environment reset (dominated a11y tree) - Loop detector: don't |
| 157 | + substitute Escape for hotkey loops (was destroying Save As dialogs in near-successful DC Notepad |
| 158 | + runs) - Loop detector: progressive directional offsets instead of fixed +50px - A11y tree: filter |
| 159 | + notification noise + increase truncation limit to 8000 - Demo discovery: prefer .txt (natural |
| 160 | + language) over .json (normalized coords) - Pool-wait timeout: increase default from 40 to 50 |
| 161 | + minutes |
| 162 | + |
| 163 | +* fix(agent): pass through raw a11y tree without filtering |
| 164 | + |
| 165 | +Remove _filter_a11y_noise and _A11Y_NOISE_PATTERNS — the a11y data from the WAA /accessibility |
| 166 | + endpoint is real UIA XML, not server logs. Pass it through as-is instead of trying to |
| 167 | + heuristically filter notification noise. |
| 168 | + |
| 169 | +* feat(agent): add Qwen3-VL agent with normalized coordinates and thinking mode |
| 170 | + |
| 171 | +Implement Qwen3VLAgent for local inference using Qwen3-VL-8B-Instruct. Supports [0,1000] coordinate |
| 172 | + normalization, full action space (click, type, press, scroll, drag, wait, finished), optional |
| 173 | + <think> blocks, and demo-conditioned inference. Register qwen3vl in all CLI commands (mock, run, |
| 174 | + live, eval-suite) with --model-path and --use-thinking args. |
| 175 | + |
| 176 | +* fix(agent): align training and inference prompt formats |
| 177 | + |
| 178 | +Move system prompt to system role message in _run_inference() instead of cramming it into the user |
| 179 | + turn. _build_prompt() now returns only the user turn text (instruction + history + output |
| 180 | + instruction), matching the training data format produced by convert_demos.py. |
| 181 | + |
| 182 | +* feat(agent): add ClaudeComputerUseAgent with screenshot/wait loop fix |
| 183 | + |
| 184 | +Implements ClaudeComputerUseAgent using Anthropic's native computer_use tool (computer_20251124 |
| 185 | + beta). Key features: - Structured tool_use/tool_result protocol (no regex parsing) - Multi-turn |
| 186 | + conversation maintained across steps - Internal loop for screenshot/wait actions: when Claude |
| 187 | + requests a screenshot, the agent sends the current screen back and calls the API again, instead of |
| 188 | + returning "done" to the runner (this was causing premature episode termination after 1 step) - |
| 189 | + Demo injection for demo-conditioned inference - Coordinate normalization (pixel → [0,1]) |
| 190 | + |
| 191 | +Also includes: - 28 unit tests for all action types, conversation management, demo injection, |
| 192 | + screenshot encoding, and edge cases - VM pool optimization design doc (pre-baked image, |
| 193 | + deallocate/resume, Windows disk persistence, ACR integration) - Hybrid agent architecture design |
| 194 | + doc (Track 1: Claude CU, Track 2: Qwen3-VL) - Cleanup: remove .swp files, cost_report.json, update |
| 195 | + .gitignore |
| 196 | + |
| 197 | +* docs: add eval suite v2 results — 6/6 tasks scored 1.00 |
| 198 | + |
| 199 | +Claude Computer Use (Sonnet 4.6) achieves 100% success on all 3 WAA tasks in both zero-shot and |
| 200 | + demo-conditioned modes after the screenshot/wait internal retry fix (commit 0b185eb). |
| 201 | + |
| 202 | +* feat(pool): add pool-pause and pool-resume for deallocate/resume lifecycle |
| 203 | + |
| 204 | +Phase 1 of VM pool optimization: stop compute billing without destroying VMs. Deallocated VMs keep |
| 205 | + their disks (~$0.25/day vs $0.38/hr running). Resume takes ~5 min vs ~42 min for full pool-create. |
| 206 | + |
| 207 | +New commands: - `oa-vm pool-pause` — deallocate all pool VMs - `oa-vm pool-resume` — start VMs, wait |
| 208 | + for WAA readiness |
| 209 | + |
| 210 | +New AzureVMManager methods: deallocate_vm(), start_vm() (SDK + CLI fallback) New PoolManager |
| 211 | + methods: pause(), resume() Updated resource_tracker for paused pool cost awareness. |
| 212 | + |
| 213 | +* feat(scripts): add WAA API recording, VLM annotation, and DC eval subcommands |
| 214 | + |
| 215 | +Extend record_waa_demos.py with three new fire subcommands: - record-waa: interactive recording via |
| 216 | + WAA API + VNC with step-by-step screenshot capture, redo support, and prefix-matched task IDs - |
| 217 | + annotate: VLM annotation of recorded before/after screenshots using the same prompt templates and |
| 218 | + provider abstraction from openadapt-ml - eval: delegates to eval-suite with --demo-dir for |
| 219 | + demo-conditioned runs |
| 220 | + |
| 221 | +* feat(infra): add golden image support, ACR pull, and pool lifecycle improvements |
| 222 | + |
| 223 | +- Add image-create/image-list/image-delete CLI commands for Azure Managed Images - Support --image |
| 224 | + flag on pool-create to skip Docker setup (golden images) - Support --use-acr flag to pull waa-auto |
| 225 | + from ACR instead of building on VM - Add ACR config settings (acr_name, acr_login_server) - Fix |
| 226 | + WAA storage path: /home/azureuser/waa-storage instead of /mnt - Add auto-pause timer tracking |
| 227 | + (auto_pause_at, auto_pause_hours on VMPool) - Add stale pool warnings (7/14 day thresholds) in |
| 228 | + pool-status and resource tracker - Show accumulated idle cost in pool-status |
| 229 | + |
| 230 | +* chore: update beads local state |
| 231 | + |
| 232 | +* fix: address review findings — drag action type, screenshot error handling, exit code |
| 233 | + |
| 234 | +- Fix drag actions mapped as type="click" instead of type="drag" in ApiAgent - Add |
| 235 | + raise_for_status() to all screenshot requests in record-waa via helper - Propagate eval-suite |
| 236 | + subprocess exit code in cmd_eval_dc |
| 237 | + |
| 238 | +* ci: add test workflow for PR checks |
| 239 | + |
| 240 | +Adds GitHub Actions workflow that runs pytest on push to main and on PRs. Excludes tests requiring |
| 241 | + openadapt-ml (not installed in CI) and tests depending on missing fixture files. |
| 242 | + |
| 243 | +* fix(ci): install dev extras for pytest in test workflow |
| 244 | + |
| 245 | +--------- |
| 246 | + |
| 247 | +Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
| 248 | + |
| 249 | + |
4 | 250 | ## v0.3.3 (2026-02-18) |
5 | 251 |
|
6 | 252 | ### Bug Fixes |
|
0 commit comments