Skip to content

Commit a641256

Browse files
author
semantic-release
committed
chore: release 0.27.0
1 parent 5c0aa52 commit a641256

2 files changed

Lines changed: 113 additions & 34 deletions

File tree

CHANGELOG.md

Lines changed: 112 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,21 @@
11
# CHANGELOG
22

33

4+
## v0.27.0 (2026-03-03)
5+
6+
### Features
7+
8+
- Add observe_pil() convenience method for PIL image output
9+
([#93](https://github.com/OpenAdaptAI/openadapt-evals/pull/93),
10+
[`5c0aa52`](https://github.com/OpenAdaptAI/openadapt-evals/commit/5c0aa527fbf6d7a404bb6289f588bf8fdfe32800))
11+
12+
Add observe_pil() to WAALiveAdapter and RLEnvironment for VLM/RL pipelines that work with PIL images
13+
directly. Also clean up changelog formatting (remove leaked Co-authored-by trailer lines, fix
14+
collapsed bullet lists).
15+
16+
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
17+
18+
419
## v0.26.0 (2026-03-03)
520

621
### Documentation
@@ -9,17 +24,24 @@
924
([#90](https://github.com/OpenAdaptAI/openadapt-evals/pull/90),
1025
[`ca6a936`](https://github.com/OpenAdaptAI/openadapt-evals/commit/ca6a9362556852bd6ad040ba9ac7a5dfe3a7d880))
1126

27+
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
28+
1229
### Features
1330

1431
- Add TaskVerifierRegistry for custom task verification
1532
([#89](https://github.com/OpenAdaptAI/openadapt-evals/pull/89),
1633
[`639a6a2`](https://github.com/OpenAdaptAI/openadapt-evals/commit/639a6a2ba2a15e0c7a2a3bd65fa57a38f6966965))
17-
- TaskVerifierRegistry with decorator and programmatic registration
18-
- VerificationResult dataclass with success/score/details
19-
- WAALiveAdapter.run_powershell() for executing PowerShell on the VM
20-
- Built-in clear_browsing_data reference verifier
21-
- 33 tests covering registry operations and built-in verifiers
22-
- Exports from evaluation package and main package __init__
34+
35+
Add a registry pattern for custom task verifiers that can inspect VM state after task execution.
36+
This enables GoTo IT Autopilot (and other integrators) to register domain-specific verification
37+
functions without subclassing BenchmarkAdapter.
38+
39+
- TaskVerifierRegistry with decorator and programmatic registration - VerificationResult dataclass
40+
with success/score/details - WAALiveAdapter.run_powershell() for executing PowerShell on the VM -
41+
Built-in clear_browsing_data reference verifier - 33 tests covering registry operations and
42+
built-in verifiers - Exports from evaluation package and main package __init__
43+
44+
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2345

2446

2547
## v0.25.1 (2026-03-03)
@@ -29,15 +51,17 @@
2951
- Address review findings in verl-agent adapter
3052
([#88](https://github.com/OpenAdaptAI/openadapt-evals/pull/88),
3153
[`879c53c`](https://github.com/OpenAdaptAI/openadapt-evals/commit/879c53c182853104a4a8c5e179e810185180d37a))
32-
- Fix SCROLL direction not forwarded to BenchmarkAction.scroll_direction
33-
- Fix DRAG parsing to include end_x/end_y coordinates
34-
- Fix is_action_valid logic: use pattern match instead of inverted condition
35-
- Fix fractional coord conversion: trust _use_fractional flag
36-
- Convert drag end coordinates from fractional to pixel
37-
- Add health_check() method returning ready/busy/needs_recovery/not_initialized
38-
- Add DRAG to system prompt DSL documentation
39-
- Fix vendored VAGEN source URL (mll-lab-nu -> RAGEN-AI)
40-
- Add 12 new tests
54+
55+
- Fix SCROLL direction not forwarded to BenchmarkAction.scroll_direction - Fix DRAG parsing to
56+
include end_x/end_y coordinates - Fix is_action_valid logic: use pattern match instead of inverted
57+
condition - Fix fractional coord conversion: trust _use_fractional flag instead of checking value
58+
ranges (0 and 1 are ambiguous between frac and pixel) - Convert drag end coordinates (end_x/end_y)
59+
from fractional to pixel - Add health_check() method returning
60+
ready/busy/needs_recovery/not_initialized - Add DRAG to system prompt DSL documentation - Fix
61+
vendored VAGEN source URL (mll-lab-nu -> RAGEN-AI) - Add 12 new tests: scroll direction, drag
62+
coords, health_check, is_action_valid
63+
64+
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
4165

4266

4367
## v0.25.0 (2026-03-03)
@@ -47,31 +71,86 @@
4771
- **agent**: Replace manual string escaping with repr() and fix CU agent bugs
4872
([#83](https://github.com/OpenAdaptAI/openadapt-evals/pull/83),
4973
[`ffcb41d`](https://github.com/OpenAdaptAI/openadapt-evals/commit/ffcb41d9a2dd6cc53eae4bf478d2d3b139d22b84))
50-
1. Replace `_escape_for_pyautogui()` with `repr()` in `_build_type_commands()`
51-
2. Fix drag coordinate field names: camelCase to snake_case per Claude computer_use API
52-
3. Add `_clamp_coord()` to prevent (0,0) coordinates from triggering PyAutoGUI fail-safe
53-
4. Re-inject demo text at every step to prevent context drift in demo-conditioned evaluation
54-
5. Add command logging in WAALiveAdapter.step() for debugging
55-
- Add `scripts/transform_demo_format.py`: transforms rigid demos into adaptive format
56-
- Fix tests for repr() escaping and coordinate clamping
57-
- Add eval analysis document with ZS vs DC results
74+
75+
* fix(agent): replace manual string escaping with repr() and fix CU agent bugs
76+
77+
Five reliability fixes for eval runs:
78+
79+
1. Replace _escape_for_pyautogui() with repr() in _build_type_commands() - eliminates entire class
80+
of string-embedding bugs (newlines, tabs, quotes, unicode) using Python's own escaping mechanism
81+
82+
2. Fix drag coordinate field names: startCoordinate/endCoordinate (camelCase) →
83+
start_coordinate/coordinate (snake_case) per Claude computer_use API
84+
85+
3. Add _clamp_coord() to prevent (0,0) coordinates from triggering PyAutoGUI fail-safe, applied to
86+
click, drag, and mouse_move actions
87+
88+
4. Re-inject demo text at every step in tool_result messages to prevent context drift in
89+
demo-conditioned evaluation
90+
91+
5. Add command logging in WAALiveAdapter.step() for debugging
92+
93+
Also adds docs/eval_analysis_2026_03_02.md documenting ZS vs DC eval results and literature review
94+
on demo-conditioning approaches.
95+
96+
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
97+
98+
* feat: add multi-level demo format transform and fix tests
99+
100+
- Add scripts/transform_demo_format.py: transforms rigid {Observation, Intent, Action, Result} demos
101+
into adaptive {Think, Action, Expect} format with PLAN section (Option D from eval analysis) -
102+
LLM-assisted mode (default): uses vlm_call() for semantic transform - Rule-based mode (--no-llm):
103+
free, no API calls needed - Supports --dry-run for preview
104+
105+
- Fix tests for repr() escaping and coordinate clamping: - Remove TestEscapeForPyautogui (tests
106+
deleted function) - Update TestBuildTypeCommands for repr() output format - Add
107+
test_all_special_chars_produce_valid_python invariant test - Fix drag test to use snake_case field
108+
names - Fix coordinate edge test to expect clamped (0.005, 0.005)
109+
110+
- Regenerate uv.lock for consilium package name resolution
111+
112+
* docs: add DC-multilevel eval results to analysis
113+
114+
DC-multilevel (new {Think, Action, Expect} + PLAN format) showed clear improvement over DC-rigid:
115+
agent followed the plan, entered all headers and years, typed correct formula, used drag-fill.
116+
Still scored 0.0 due to premature task completion (finished 1/3 columns), but qualitatively the
117+
best behavior across all three conditions.
118+
119+
---------
120+
121+
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
58122

59123
### Features
60124

61125
- Add VAGEN/verl-agent environment adapter for VLM RL training
62126
([`0183321`](https://github.com/OpenAdaptAI/openadapt-evals/commit/018332168389b4be74660ecbc754eef5768f6b51))
63127

64-
WAADesktopEnv implements the GymImageEnv protocol from VAGEN, enabling desktop GUI automation
128+
* feat: add VAGEN/verl-agent environment adapter for VLM RL training
129+
130+
WAADesktopEnv implements the GymImageEnv protocol from VAGEN, enabling desktop GUI automation
65131
training with verl-agent's multi-turn VLM RL pipeline (GiGPO, GRPO, PPO).
66-
- Async interface (reset/step/close/system_prompt)
67-
- Action DSL parsing (CLICK, TYPE, KEY, SCROLL, WAIT, DONE)
68-
- Fractional coordinate support (0.0-1.0)
69-
- Lazy adapter initialization
70-
- 21 tests passing with mock adapter
71-
- Example VAGEN training config included
72-
- Comprehensive verl-agent decision document
73-
- Vendor GymImageEnv base classes from VAGEN
74-
- Fact-check framework review in verl decision doc
132+
133+
The adapter translates between openadapt-evals BenchmarkObservation (PNG bytes + a11y tree) and
134+
VAGEN's observation format (obs_str + multi_modal_input with PIL images).
135+
136+
- Async interface (reset/step/close/system_prompt) - Action DSL parsing (CLICK, TYPE, KEY, SCROLL,
137+
WAIT, DONE) - Fractional coordinate support (0.0-1.0) - Lazy adapter initialization - 21 tests
138+
passing with mock adapter - Example VAGEN training config included
139+
140+
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
141+
142+
* docs: add comprehensive verl-agent decision document
143+
144+
Records the full reasoning chain for choosing verl-agent/VAGEN: - Framework comparison (TRL,
145+
standalone, verl-agent, VAGEN, OpenRLHF, Unsloth) - Key insight: per-step verification via GiGPO
146+
for long-horizon GUI tasks - TRL multi-turn VLM blocker (issues #5119, #5120) - "Environment is
147+
the moat" strategic framing - Architecture diagram and migration path
148+
149+
* feat: add verl-agent as optional dependency
150+
151+
* feat: vendor GymImageEnv base classes from VAGEN
152+
153+
* docs: fact-check framework review in verl decision doc
75154

76155
Update Sections E (OpenRLHF), F (Unsloth), TRL, and comparison matrix with accurate details from
77156
thorough review:

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "openadapt-evals"
7-
version = "0.26.0"
7+
version = "0.27.0"
88
description = "Evaluation infrastructure for GUI agent benchmarks"
99
readme = "README.md"
1010
requires-python = ">=3.10"

0 commit comments

Comments
 (0)