Skip to content

Commit 16a0e35

Browse files
author
semantic-release
committed
chore: release 0.30.0
1 parent f1f3870 commit 16a0e35

2 files changed

Lines changed: 246 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,251 @@
11
# CHANGELOG
22

33

4+
## v0.30.0 (2026-03-04)
5+
6+
### Bug Fixes
7+
8+
- **controller**: Prevent plan step drift and reduce VLM false negatives
9+
([#97](https://github.com/OpenAdaptAI/openadapt-evals/pull/97),
10+
[`f1f3870`](https://github.com/OpenAdaptAI/openadapt-evals/commit/f1f3870c3d0dd1740b2943b9d25b28b14583e4a4))
11+
12+
* fix(controller): prevent plan step drift and reduce VLM false negatives
13+
14+
Two improvements to the closed-loop demo-conditioned controller:
15+
16+
1. Plan step tracking drift prevention: _advance_plan_steps() now only compares current step vs next
17+
step, advancing at most one step per call. Previously, bulk keyword matching could jump 5+ steps
18+
on a single action.
19+
20+
2. VLM verification prompt tuning: Added "partially_verified" status for cases where the core
21+
outcome is achieved but with minor deviations (cursor position, formatting). Rewrote all
22+
verification prompts to be outcome-focused, reducing false negatives from live eval scenarios.
23+
24+
Adds 68 new tests (8 drift prevention + 21 VLM prompt + 9 false-negative regressions + 30 existing
25+
test updates). All 147 controller tests pass.
26+
27+
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
28+
29+
* docs(cost): add LLM agent economics analysis
30+
31+
Analyzes unit economics of the closed-loop controller architecture: Claude agent costs, VLM verifier
32+
costs, scaling projections, and a three-phase strategy from loop-as-product to
33+
trained-model-as-product.
34+
35+
* fix(agent): replace pyautogui.drag() with mouseDown/moveTo/mouseUp
36+
37+
pyautogui.drag() uses relative coordinates that compound with starting position errors, making it
38+
unreliable for small targets like LibreOffice fill handles (~3x3 pixels). Replace with explicit
39+
mouseDown/moveTo/mouseUp sequence with timing delays for reliable drag operations.
40+
41+
Also adds drag case to _build_pixel_command() for the pixel_action() path.
42+
43+
* fix: prevent heuristic/verifier drift and surface partial steps in goal verification
44+
45+
Three issues addressed:
46+
47+
1. Heuristic/verifier step drift: The agent's keyword-based _advance_plan_steps() heuristic and the
48+
DemoController's VLM verifier operated on independent state, allowing them to disagree on which
49+
step was current. Fix: add _external_step_control flag to the agent that the DemoController sets
50+
at init, making _advance_plan_steps() a no-op when the controller manages step progression via VLM
51+
verification.
52+
53+
2. partially_verified invisible to goal verification: When steps were marked partially_verified, the
54+
final goal verification pass had no visibility into which steps had partial completions. Fix:
55+
_verify_goal() now builds a step verification summary and augments the goal text with it when
56+
noteworthy statuses (partially_verified, failed) exist.
57+
58+
3. Missing integration tests: Added TestHeuristicVerifierSync (4 tests) and
59+
TestGoalVerificationContext (5 tests) that verify the heuristic is properly disabled under
60+
controller management, step advancement is driven by VLM verification, and partial/failed step
61+
context reaches goal verification. Also added 2 agent-level tests for _external_step_control
62+
behavior.
63+
64+
* fix: suppress stale agent plan progress under external step control
65+
66+
When DemoController sets _external_step_control=True, the agent's internal plan progress injection
67+
and done-override logic now become no-ops. This prevents the agent from sending conflicting
68+
step-tracking signals to the Claude model (agent says "step 1 in progress" while controller says
69+
"step 3 is current").
70+
71+
Three specific suppressions: 1. _build_initial_messages skips plan progress text injection 2.
72+
Follow-up messages skip plan progress / demo re-injection 3. Premature "done" override is left to
73+
the controller
74+
75+
Adds integration tests exercising agent+controller interaction: - Agent suppresses progress under
76+
external control - Agent injects progress normally without external control - Controller's
77+
augmented task instruction reaches the agent - Done override handled by controller, not agent
78+
79+
* fix(adapter): ensure target app is focused after task setup
80+
81+
After WAA setup (close_all → verify_apps → download → open), the target application may be behind
82+
other windows, still loading, or obscured by notifications. This wastes 6+ agent steps recovering.
83+
84+
Add _ensure_app_focused() with multi-strategy approach: - Maps task related_apps to window title
85+
patterns - Uses WAA /setup/activate_window endpoint (same as WAA postconfig) - Falls back to
86+
Alt+Tab - Retries 3x with increasing delays (2s, 3s, 5s) - Verifies foreground window title via
87+
pygetwindow on VM - Runs during reset(), does NOT count against agent step budget
88+
89+
Also adds _APP_WINDOW_PATTERNS mapping, _get_expected_window_patterns(),
90+
_check_foreground_matches(), and _normalize_app_name() helpers.
91+
92+
* docs: add systematic failure mode analysis and training strategy
93+
94+
Comprehensive analysis of GUI agent failure modes with taxonomy, recording system design, training
95+
viability assessment, and prioritized action plan. Key findings:
96+
97+
- 4-category taxonomy: Environment, Agent Planning, Grounding, Verifier - Existing
98+
ExecutionTraceCollector needs only minor extensions - SFT on 50-100 corrected trajectories
99+
expected 10-30pp improvement - Deterministic infrastructure fixes should come first (Tier 1)
100+
101+
* fix: address PR #97 review comments with clarifying comments and test dep
102+
103+
- Add comment in reset() explaining why _external_step_control is not reset - Add comment on hasattr
104+
guard explaining MagicMock behavior is acceptable - Add docstring note in
105+
TestFalseNegativeRegressions about VLM response limitation - Add flask to test
106+
optional-dependencies for CI coverage
107+
108+
---------
109+
110+
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
111+
112+
### Features
113+
114+
- Add GPU training automation for verl-agent E2E workflow
115+
([#87](https://github.com/OpenAdaptAI/openadapt-evals/pull/87),
116+
[`da17355`](https://github.com/OpenAdaptAI/openadapt-evals/commit/da173553c138ba6c818485ce377589e8d6241200))
117+
118+
* feat: add GPU training automation for verl-agent E2E workflow
119+
120+
- Add GPU_VM_SIZE_FALLBACKS to azure_vm.py (NC48ads_A100_v4, NC24ads, NC12s_v3) - Add
121+
GPU_INSTANCE_TYPE_FALLBACKS to aws_vm.py (p3.8xlarge, g5.12xlarge, p3.2xlarge) - Update
122+
find_available_size_and_region(gpu=True) on both providers + protocol - Add
123+
scripts/setup_gpu_training.sh: installs conda, vLLM, flash-attn, verl-agent - Add
124+
scripts/train_verl_e2e.py: provisions GPU VM, uploads setup, launches training - Add oa-vm
125+
gpu-setup and gpu-train CLI commands
126+
127+
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
128+
129+
* fix: correct verl-agent Hydra config paths and document integration gap
130+
131+
Validated all 17 Hydra config paths against verl-agent's actual schema (ppo_trainer.yaml +
132+
make_envs()). Key fixes:
133+
134+
- env.env_name: use 'waa_desktop' short name, not Python import path (verl-agent uses hardcoded
135+
dispatch, not dynamic imports) - Remove env.env_kwargs (doesn't exist), use env.waa.* sub-keys -
136+
Add data.train_files/val_files (required parquet, generated via data_preprocess.prepare --mode
137+
visual) - Add missing overrides: algorithm.gamma, gpu_memory_utilization, ppo_mini_batch_size,
138+
filter_overlong_prompts, test_freq - Add prepare_training_data() and patch_env_manager() steps -
139+
Document the EnvironmentManagerBase integration gap in decision doc
140+
141+
* fix: replace EnvironmentManagerBase with VAGEN registry-based env integration
142+
143+
The previous implementation incorrectly assumed verl-agent uses an EnvironmentManagerBase ABC with a
144+
hardcoded make_envs() dispatch. Research reveals VAGEN actually uses: - GymImageEnv protocol
145+
(which WAADesktopEnv already implements) - YAML-based env registry
146+
(vagen/configs/env_registry.yaml) - GymAgentLoop for training-time rollout orchestration
147+
148+
Changes: - Replace patch_env_manager() with register_waa_env() (YAML registry) - Add
149+
register_in_vagen() and generate_env_spec() helpers to verl_env.py - Update launch_training() to
150+
generate proper VAGEN training config - Fix Integration Gap section in decision doc (no
151+
EnvironmentManagerBase) - Update training config YAML with architecture diagram - Add 5 new tests
152+
for registration helpers (40 total, all passing) - Export new helpers from adapters/__init__.py
153+
154+
* fix: correct is_action_valid logic, scroll_direction, stale refs, and DRY violation
155+
156+
Review fixes for the GPU training automation branch:
157+
158+
- Fix is_action_valid: was inverted (DONE()→invalid, garbage→valid), now uses regex match on
159+
original action string - Fix scroll_direction: SCROLL parsing now populates
160+
BenchmarkAction.scroll_direction - Fix stale repo URLs: mll-lab-nu/VAGEN → RAGEN-AI/VAGEN across
161+
vendored files and docs - Fix stale branch ref: setup_gpu_training.sh referenced merged spike
162+
branch, now uses main - Fix stale repo URL: langfengQ/verl-agent → RAGEN-AI/VAGEN in setup script
163+
- Add --recurse-submodules to git clone (verl is a VAGEN submodule) - Remove dead params from
164+
register_waa_env() (waa_server, task_id, max_steps) - Deduplicate training command: vm_cli.py now
165+
delegates to launch_training() - Update test count in docs: 21 → 40+ - Add 3 new tests for
166+
is_action_valid behavior - Add scroll_direction assertion to existing scroll test
167+
168+
All 43 tests pass.
169+
170+
* fix: resolve lint errors (undefined use_fast, unused imports, f-strings)
171+
172+
- Remove undefined `use_fast` guard — always log tried sizes on failure - Remove unused PoolManager
173+
import in vm_cli.py - Remove extraneous f-string prefixes - Remove unused boto3 and SSH_OPTS
174+
imports in aws_vm.py
175+
176+
* fix: add evaluate_url support and E2E validation test
177+
178+
WAADesktopEnv now correctly separates: - server_url (port 5000): Windows VM Flask API (/screenshot,
179+
/execute_windows) - evaluate_url (port 5001): evaluate_server.py (/setup, /evaluate, /probe)
180+
181+
Previously, the single server_url default pointed at 5001 (evaluate server only), which caused 404s
182+
for screenshots and action execution.
183+
184+
Also adds scripts/test_verl_env_e2e.py, validated on AWS g5.xlarge (A10G) with UNIX socket bridge
185+
proxy chain to Azure WAA VM.
186+
187+
* fix: use Deep Learning AMI for GPU instances and fix setup issues
188+
189+
- Add _find_latest_dl_ami() for GPU VMs (pre-installed NVIDIA drivers + CUDA) - Add gpu param to
190+
create_vm() to select DL AMI vs standard Ubuntu - Reorder GPU_INSTANCE_TYPE_FALLBACKS: prefer g5
191+
(Ampere/A10G) over p3 (Volta/V100) since OSS NVIDIA driver requires GSP (Turing+) - Make
192+
OPENADAPT_EVALS_BRANCH configurable via env var in setup script - Add conda TOS acceptance step
193+
(required since Miniconda 2025)
194+
195+
Validated on AWS g5.xlarge with NVIDIA A10G 24GB GPU.
196+
197+
* docs: add GPU E2E validation report with artifacts
198+
199+
Documents the successful end-to-end validation of the verl-agent/VAGEN training pipeline on AWS
200+
g5.xlarge (A10G 24GB) connecting to Azure WAA VM. Includes architecture diagrams, proxy chain
201+
details, raw test output, version listings, and issues discovered during validation.
202+
203+
* fix: resolve port inconsistencies and add missing context in validation docs
204+
205+
- Standardize evaluate_url port to 5051 (socat bridge) across all docs - Add Artifact Stage column
206+
to validation results table mapping tests to raw output - Add docs commit (c2555ef) to PR #87
207+
commit list - Clarify 5050 vs 5051 port mapping in architecture diagrams and data flow - Expand
208+
e2e_test_output.txt Stage 7/8 with sub-steps matching README table - Add SSH tunnel tip about
209+
socat bridge still being required
210+
211+
* fix: clarify uvicorn version discrepancy and complete commit list
212+
213+
- Add note to gpu_vm_stack_versions.txt explaining that the full pip list is from Stage 5 (vLLM
214+
install) and uvicorn was later downgraded by VAGEN - Add b7efb4f to the commit list in README.md
215+
216+
* fix: guard flash-attn install for Ampere+ GPUs and validate training data
217+
218+
- Check GPU compute capability before installing flash-attn; V100s (sm_70) don't support Flash
219+
Attention 2 (requires sm_80+) and would fail at build or runtime - Add post-preparation validation
220+
to prepare_training_data() ensuring the expected parquet files exist and are non-empty, rather
221+
than silently proceeding with missing data
222+
223+
* fix: update test to match server_url default port 5000
224+
225+
The generate_env_spec() default server_url is http://localhost:5000 (WAA Flask API port), not 5001.
226+
The test expectation was stale.
227+
228+
* fix: split server_url/evaluate_url in training config and CLI args
229+
230+
The two-port WAA architecture uses separate endpoints: - server_url (port 5000): WAA Flask API for
231+
screenshots and actions - evaluate_url (port 5001): evaluate_server for setup and evaluate
232+
233+
Previously --waa-server defaulted to port 5001 and was assigned to server_url, conflating the two
234+
endpoints. This fixes: - train_verl_e2e.py: --waa-server default 5000, add --evaluate-server -
235+
vm_cli.py gpu-train: same CLI arg fixes, pass evaluate_url through - train_waa_vagen.yaml: correct
236+
server_url to 5000, add evaluate_url - Fix nested single quotes in register_waa_env (heredoc
237+
instead) - Replace fragile sys.path.insert with importlib.util
238+
239+
* fix: correct stale port in verl_env docstring and SSH tunnel comment
240+
241+
- verl_env.py docstring: server_url example 5001 -> 5000, add evaluate_url - train_waa_vagen.yaml:
242+
SSH tunnel dest 5050 -> 5051 (socat bridge, not broken Docker port)
243+
244+
---------
245+
246+
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
247+
248+
4249
## v0.29.0 (2026-03-03)
5250

6251
### Documentation

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "openadapt-evals"
7-
version = "0.29.0"
7+
version = "0.30.0"
88
description = "Evaluation infrastructure for GUI agent benchmarks"
99
readme = "README.md"
1010
requires-python = ">=3.10"

0 commit comments

Comments
 (0)