Skip to content

Commit e54ef1b

Browse files
author
semantic-release
committed
chore: release 0.4.0
1 parent 19a11ee commit e54ef1b

2 files changed

Lines changed: 247 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,252 @@
11
# CHANGELOG
22

33

4+
## v0.4.0 (2026-02-24)
5+
6+
### Features
7+
8+
- Waa eval pipeline — recording, annotation, golden images, and CI
9+
([#35](https://github.com/OpenAdaptAI/openadapt-evals/pull/35),
10+
[`19a11ee`](https://github.com/OpenAdaptAI/openadapt-evals/commit/19a11ee36938d4adb3b585e25ffb972424ea52db))
11+
12+
* fix(recording): replace busy-wait loop with time.sleep
13+
14+
The `while True: pass` loop burned an entire CPU core during recording. Replace with
15+
`time.sleep(0.5)` to yield CPU while waiting for Ctrl+C.
16+
17+
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
18+
19+
* fix: add wait_for_ready() and match CLI recording loop pattern
20+
21+
- Call recorder.wait_for_ready() before entering the wait loop - Use recorder.is_recording check and
22+
1s sleep to match CLI behavior
23+
24+
* fix: auto-create dummy .docx files for archive task
25+
26+
The third WAA task requires .docx files in Documents. The script now creates empty report.docx,
27+
meeting_notes.docx, and proposal.docx before recording that task, and cleans up any Archive folder
28+
from previous runs.
29+
30+
* fix: update stop instructions and clarify wormhole send flow
31+
32+
- Change "Press Ctrl+C" to "press Ctrl 3 times" (matches stop sequence) - Clarify wormhole send
33+
instructions (each send blocks until received)
34+
35+
* fix(pool): use waa-auto image instead of broken windowsarena/winarena
36+
37+
The DOCKER_SETUP_SCRIPT builds waa-auto:latest (based on dockurr/windows:latest which can
38+
auto-download Windows ISO) but WAA_START_SCRIPT and setup-waa were starting
39+
windowsarena/winarena:latest which uses the old dockurr/windows v0.00 that cannot download the
40+
ISO, causing "ISO file not found" error.
41+
42+
* fix(pool): fix WAA probe IP, add QMP support, add pool-auto command
43+
44+
Three bugs prevented pool-run from working:
45+
46+
1. WAA probe used 172.30.0.2 (QEMU guest IP) but Docker port-forwards to localhost — pool-wait timed
47+
out every time. Changed to localhost in pool.py and vm_monitor.py.
48+
49+
2. dockurr/windows base image doesn't configure QMP (QEMU Machine Protocol). WAA client needs QMP on
50+
port 7200 for VM status. Added ARGUMENTS env var to inject -qmp flag into QEMU startup.
51+
52+
3. Config defaults had Standard_D2_v3 (8GB, OOMs) and old windowsarena/winarena image. Fixed to
53+
D8ds_v5 and waa-auto.
54+
55+
Also adds: - pool-auto command: single oa-vm pool-auto --workers N --tasks M chains create → wait →
56+
run - /evaluate endpoint injection in waa_deploy Dockerfile - Handle WAA server wrapping 404 in
57+
500 responses (live.py) - openai dependency for API agents
58+
59+
* fix(pool): use docker exec -d + tail -f for resilient benchmark execution
60+
61+
Replace fragile streaming SSH with docker exec -d (detached) for starting benchmarks. Logs stream
62+
via tail -f --pid which auto-exits when the benchmark finishes. On SSH drop, reconnects and
63+
resumes. Also adds 120s timeout to OpenAI API calls to prevent infinite hangs.
64+
65+
* fix(pool): limit tasks with --test_all_meta_path subset JSON
66+
67+
WAA's run.py ignores --tasks and runs all 154 tasks based on worker_id/num_workers. Fix by creating
68+
a subset test JSON with only the requested number of tasks and passing it via
69+
--test_all_meta_path.
70+
71+
* feat(pool): add dedicated evaluate server with socat proxy
72+
73+
Add a standalone evaluate server (port 5050) that runs inside the WAA Docker container and has
74+
direct access to WAA evaluator modules. This avoids needing to patch the WAA Flask server's
75+
/evaluate endpoint.
76+
77+
- Add evaluate_server.py and start_with_evaluate.sh - Add evaluate_url config to WAALiveConfig - Set
78+
up socat proxy (5051→5050) for Docker bridge networking - Add SSH tunnel for evaluate port -
79+
Simplify Dockerfile
80+
81+
* feat(viz): add instrumentation, comparison viewer, and viewer enhancements
82+
83+
Instrumentation (captures richer data per step): - Propagate agent logs (LLM response, parse
84+
strategy, demo info, loop detection, memory) from ApiAgent to execution trace - Add per-step
85+
timing (agent_think_ms, env_execute_ms) - Capture token counts from OpenAI/Anthropic API responses
86+
87+
Viewer enhancements (viewer.py): - Agent Thinking panel showing LLM response, memory, parse strategy
88+
- Action timeline bar color-coded by action type - Click heatmap overlay showing click frequency
89+
hotspots - Click marker using raw pixel coords for correct positioning
90+
91+
Comparison viewer (new): - comparison_viewer.py generates side-by-side HTML comparisons -
92+
Synchronized step slider, click markers, action diffs - First-divergence detection, action type
93+
distribution charts - CLI 'compare' command for generating comparisons - Demo prompts and initial
94+
eval results for 3 WAA tasks
95+
96+
* fix(agent): handle double_click, right_click, and drag in action parser
97+
98+
_parse_computer_action() only handled click, type, press, hotkey, and scroll. Any other action
99+
(double_click, right_click, drag) fell through to the default return of type="done", which
100+
prematurely terminated the task. This caused the demo-conditioned notepad eval to stop after 1
101+
step when the agent correctly issued computer.double_click() to open Notepad.
102+
103+
Also add a warning log when an unrecognized action falls through, and update viewer regexes to
104+
handle double_click/right_click coordinates.
105+
106+
* fix(coords): detect actual screen size from screenshot instead of hardcoded config
107+
108+
WAALiveConfig defaulted to 1920x1200 but actual VM screen is 1280x720. This caused stored action.x/y
109+
to be normalized against the wrong resolution. Now detects real dimensions from the screenshot via
110+
PIL, uses them for viewport, denormalization, window_rect, and drag coordinates. Viewers use a
111+
divergence check for backward compatibility with old data.
112+
113+
* docs: add Feb 21 eval results with comparison screenshots
114+
115+
ZS vs demo-conditioned on 3 WAA tasks (GPT-5.1). DC agent signals completion on 2/3 tasks (Settings:
116+
11 steps, Notepad: 8 steps) while ZS hits max steps on all 3. Includes Playwright screenshots of
117+
comparison viewers and step-by-step screenshots.
118+
119+
* fix(pool): consolidate Dockerfiles and deploy evaluate server
120+
121+
Replace inline 25-line Dockerfile in pool.py with SCP of waa_deploy/ build context. This eliminates
122+
drift between the inline and full Dockerfile, and ensures evaluate_server.py + Flask are included
123+
in the container image. Adds evaluate server health check during pool-wait.
124+
125+
* fix(evaluate): add cache_dir to MockEnv for WAA file getters
126+
127+
WAA evaluator getters (get_vm_file, get_cloud_file) expect env.cache_dir for downloading/caching
128+
files during evaluation. Without it, the compare_text_file metric fails with AttributeError.
129+
130+
* feat(setup): implement WAA task setup config array processing
131+
132+
WAA tasks use a 'config' array with preconditions (file downloads, app launches, sleeps) that must
133+
run before the agent starts. Previously _run_task_setup() looked for non-existent 'setup'/'init'
134+
keys, so task preconditions were never executed — causing Archive and other tasks with file
135+
dependencies to always score 0.
136+
137+
- Add /setup endpoint to evaluate_server.py with 11 handlers mirroring WAA's SetupController
138+
(download, launch, sleep, execute, open, etc.) - Add requests-toolbelt to Dockerfile for multipart
139+
file uploads - Rewrite _run_task_setup() in live.py to POST config array to evaluate server's
140+
/setup endpoint - Increase reset delay from 1s to 5s to match WAA defaults
141+
142+
* feat(cli): add eval-suite command for automated full-cycle evaluation
143+
144+
New `eval-suite` CLI command that automates the full WAA evaluation cycle: pool-create → pool-wait →
145+
SSH tunnel → run task×condition matrix
146+
147+
→ comparison summary → pool-cleanup. Replaces ~20 manual commands with a single invocation.
148+
149+
Features: - Auto-creates Azure VM pool and waits for WAA readiness - Builds eval matrix: ZS for all
150+
tasks, DC for tasks with matching demos - Runs evals sequentially, prints comparison table at end
151+
- SSH tunnels managed automatically via SSHTunnelManager - Supports
152+
--no-pool-create/--no-pool-cleanup for existing VMs - Also adds anthropic as a direct dependency
153+
154+
* fix(agent): improve eval reliability with 6 targeted fixes
155+
156+
- Kill OneDrive notifications during environment reset (dominated a11y tree) - Loop detector: don't
157+
substitute Escape for hotkey loops (was destroying Save As dialogs in near-successful DC Notepad
158+
runs) - Loop detector: progressive directional offsets instead of fixed +50px - A11y tree: filter
159+
notification noise + increase truncation limit to 8000 - Demo discovery: prefer .txt (natural
160+
language) over .json (normalized coords) - Pool-wait timeout: increase default from 40 to 50
161+
minutes
162+
163+
* fix(agent): pass through raw a11y tree without filtering
164+
165+
Remove _filter_a11y_noise and _A11Y_NOISE_PATTERNS — the a11y data from the WAA /accessibility
166+
endpoint is real UIA XML, not server logs. Pass it through as-is instead of trying to
167+
heuristically filter notification noise.
168+
169+
* feat(agent): add Qwen3-VL agent with normalized coordinates and thinking mode
170+
171+
Implement Qwen3VLAgent for local inference using Qwen3-VL-8B-Instruct. Supports [0,1000] coordinate
172+
normalization, full action space (click, type, press, scroll, drag, wait, finished), optional
173+
<think> blocks, and demo-conditioned inference. Register qwen3vl in all CLI commands (mock, run,
174+
live, eval-suite) with --model-path and --use-thinking args.
175+
176+
* fix(agent): align training and inference prompt formats
177+
178+
Move system prompt to system role message in _run_inference() instead of cramming it into the user
179+
turn. _build_prompt() now returns only the user turn text (instruction + history + output
180+
instruction), matching the training data format produced by convert_demos.py.
181+
182+
* feat(agent): add ClaudeComputerUseAgent with screenshot/wait loop fix
183+
184+
Implements ClaudeComputerUseAgent using Anthropic's native computer_use tool (computer_20251124
185+
beta). Key features: - Structured tool_use/tool_result protocol (no regex parsing) - Multi-turn
186+
conversation maintained across steps - Internal loop for screenshot/wait actions: when Claude
187+
requests a screenshot, the agent sends the current screen back and calls the API again, instead of
188+
returning "done" to the runner (this was causing premature episode termination after 1 step) -
189+
Demo injection for demo-conditioned inference - Coordinate normalization (pixel → [0,1])
190+
191+
Also includes: - 28 unit tests for all action types, conversation management, demo injection,
192+
screenshot encoding, and edge cases - VM pool optimization design doc (pre-baked image,
193+
deallocate/resume, Windows disk persistence, ACR integration) - Hybrid agent architecture design
194+
doc (Track 1: Claude CU, Track 2: Qwen3-VL) - Cleanup: remove .swp files, cost_report.json, update
195+
.gitignore
196+
197+
* docs: add eval suite v2 results — 6/6 tasks scored 1.00
198+
199+
Claude Computer Use (Sonnet 4.6) achieves 100% success on all 3 WAA tasks in both zero-shot and
200+
demo-conditioned modes after the screenshot/wait internal retry fix (commit 0b185eb).
201+
202+
* feat(pool): add pool-pause and pool-resume for deallocate/resume lifecycle
203+
204+
Phase 1 of VM pool optimization: stop compute billing without destroying VMs. Deallocated VMs keep
205+
their disks (~$0.25/day vs $0.38/hr running). Resume takes ~5 min vs ~42 min for full pool-create.
206+
207+
New commands: - `oa-vm pool-pause` — deallocate all pool VMs - `oa-vm pool-resume` — start VMs, wait
208+
for WAA readiness
209+
210+
New AzureVMManager methods: deallocate_vm(), start_vm() (SDK + CLI fallback) New PoolManager
211+
methods: pause(), resume() Updated resource_tracker for paused pool cost awareness.
212+
213+
* feat(scripts): add WAA API recording, VLM annotation, and DC eval subcommands
214+
215+
Extend record_waa_demos.py with three new fire subcommands: - record-waa: interactive recording via
216+
WAA API + VNC with step-by-step screenshot capture, redo support, and prefix-matched task IDs -
217+
annotate: VLM annotation of recorded before/after screenshots using the same prompt templates and
218+
provider abstraction from openadapt-ml - eval: delegates to eval-suite with --demo-dir for
219+
demo-conditioned runs
220+
221+
* feat(infra): add golden image support, ACR pull, and pool lifecycle improvements
222+
223+
- Add image-create/image-list/image-delete CLI commands for Azure Managed Images - Support --image
224+
flag on pool-create to skip Docker setup (golden images) - Support --use-acr flag to pull waa-auto
225+
from ACR instead of building on VM - Add ACR config settings (acr_name, acr_login_server) - Fix
226+
WAA storage path: /home/azureuser/waa-storage instead of /mnt - Add auto-pause timer tracking
227+
(auto_pause_at, auto_pause_hours on VMPool) - Add stale pool warnings (7/14 day thresholds) in
228+
pool-status and resource tracker - Show accumulated idle cost in pool-status
229+
230+
* chore: update beads local state
231+
232+
* fix: address review findings — drag action type, screenshot error handling, exit code
233+
234+
- Fix drag actions mapped as type="click" instead of type="drag" in ApiAgent - Add
235+
raise_for_status() to all screenshot requests in record-waa via helper - Propagate eval-suite
236+
subprocess exit code in cmd_eval_dc
237+
238+
* ci: add test workflow for PR checks
239+
240+
Adds GitHub Actions workflow that runs pytest on push to main and on PRs. Excludes tests requiring
241+
openadapt-ml (not installed in CI) and tests depending on missing fixture files.
242+
243+
* fix(ci): install dev extras for pytest in test workflow
244+
245+
---------
246+
247+
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
248+
249+
4250
## v0.3.3 (2026-02-18)
5251

6252
### Bug Fixes

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "openadapt-evals"
7-
version = "0.3.3"
7+
version = "0.4.0"
88
description = "Evaluation infrastructure for GUI agent benchmarks"
99
readme = "README.md"
1010
requires-python = ">=3.10"

0 commit comments

Comments
 (0)