Skip to content

Commit 03b4f3d

Browse files
author
semantic-release
committed
chore: release 0.25.0
1 parent 0183321 commit 03b4f3d

2 files changed

Lines changed: 102 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,107 @@
11
# CHANGELOG
22

33

4+
## v0.25.0 (2026-03-03)
5+
6+
### Bug Fixes
7+
8+
- **agent**: Replace manual string escaping with repr() and fix CU agent bugs
9+
([#83](https://github.com/OpenAdaptAI/openadapt-evals/pull/83),
10+
[`ffcb41d`](https://github.com/OpenAdaptAI/openadapt-evals/commit/ffcb41d9a2dd6cc53eae4bf478d2d3b139d22b84))
11+
12+
* fix(agent): replace manual string escaping with repr() and fix CU agent bugs
13+
14+
Five reliability fixes for eval runs:
15+
16+
1. Replace _escape_for_pyautogui() with repr() in _build_type_commands() - eliminates entire class
17+
of string-embedding bugs (newlines, tabs, quotes, unicode) using Python's own escaping mechanism
18+
19+
2. Fix drag coordinate field names: startCoordinate/endCoordinate (camelCase) →
20+
start_coordinate/coordinate (snake_case) per Claude computer_use API
21+
22+
3. Add _clamp_coord() to prevent (0,0) coordinates from triggering PyAutoGUI fail-safe, applied to
23+
click, drag, and mouse_move actions
24+
25+
4. Re-inject demo text at every step in tool_result messages to prevent context drift in
26+
demo-conditioned evaluation
27+
28+
5. Add command logging in WAALiveAdapter.step() for debugging
29+
30+
Also adds docs/eval_analysis_2026_03_02.md documenting ZS vs DC eval results and literature review
31+
on demo-conditioning approaches.
32+
33+
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
34+
35+
* feat: add multi-level demo format transform and fix tests
36+
37+
- Add scripts/transform_demo_format.py: transforms rigid {Observation, Intent, Action, Result} demos
38+
into adaptive {Think, Action, Expect} format with PLAN section (Option D from eval analysis) -
39+
LLM-assisted mode (default): uses vlm_call() for semantic transform - Rule-based mode (--no-llm):
40+
free, no API calls needed - Supports --dry-run for preview
41+
42+
- Fix tests for repr() escaping and coordinate clamping: - Remove TestEscapeForPyautogui (tests
43+
deleted function) - Update TestBuildTypeCommands for repr() output format - Add
44+
test_all_special_chars_produce_valid_python invariant test - Fix drag test to use snake_case field
45+
names - Fix coordinate edge test to expect clamped (0.005, 0.005)
46+
47+
- Regenerate uv.lock for consilium package name resolution
48+
49+
* docs: add DC-multilevel eval results to analysis
50+
51+
DC-multilevel (new {Think, Action, Expect} + PLAN format) showed clear improvement over DC-rigid:
52+
agent followed the plan, entered all headers and years, typed correct formula, used drag-fill.
53+
Still scored 0.0 due to premature task completion (finished 1/3 columns), but qualitatively the
54+
best behavior across all three conditions.
55+
56+
---------
57+
58+
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
59+
60+
### Features
61+
62+
- Add VAGEN/verl-agent environment adapter for VLM RL training
63+
([`0183321`](https://github.com/OpenAdaptAI/openadapt-evals/commit/018332168389b4be74660ecbc754eef5768f6b51))
64+
65+
* feat: add VAGEN/verl-agent environment adapter for VLM RL training
66+
67+
WAADesktopEnv implements the GymImageEnv protocol from VAGEN, enabling desktop GUI automation
68+
training with verl-agent's multi-turn VLM RL pipeline (GiGPO, GRPO, PPO).
69+
70+
The adapter translates between openadapt-evals BenchmarkObservation (PNG bytes + a11y tree) and
71+
VAGEN's observation format (obs_str + multi_modal_input with PIL images).
72+
73+
- Async interface (reset/step/close/system_prompt) - Action DSL parsing (CLICK, TYPE, KEY, SCROLL,
74+
WAIT, DONE) - Fractional coordinate support (0.0-1.0) - Lazy adapter initialization - 21 tests
75+
passing with mock adapter - Example VAGEN training config included
76+
77+
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
78+
79+
* docs: add comprehensive verl-agent decision document
80+
81+
Records the full reasoning chain for choosing verl-agent/VAGEN: - Framework comparison (TRL,
82+
standalone, verl-agent, VAGEN, OpenRLHF, Unsloth) - Key insight: per-step verification via GiGPO
83+
for long-horizon GUI tasks - TRL multi-turn VLM blocker (issues #5119, #5120) - "Environment is
84+
the moat" strategic framing - Architecture diagram and migration path
85+
86+
* feat: add verl-agent as optional dependency
87+
88+
* feat: vendor GymImageEnv base classes from VAGEN
89+
90+
* docs: fact-check framework review in verl decision doc
91+
92+
Update Sections E (OpenRLHF), F (Unsloth), TRL, and comparison matrix with accurate details from
93+
thorough review:
94+
95+
- OpenRLHF: document AgentTrainer multi-turn support and OpenRLHF-M fork - Unsloth: nuanced
96+
assessment — single-turn VLM works, multi-turn text via ART works, but multi-turn VLM blocked by
97+
rollout_func issue (#3573) - TRL: add note about OpenEnv/rollout_func for text models (VLM
98+
blocked) - Comparison matrix: add Unsloth column with footnotes
99+
100+
---------
101+
102+
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
103+
104+
4105
## v0.24.0 (2026-03-03)
5106

6107
### Documentation

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "openadapt-evals"
7-
version = "0.24.0"
7+
version = "0.25.0"
88
description = "Evaluation infrastructure for GUI agent benchmarks"
99
readme = "README.md"
1010
requires-python = ">=3.10"

0 commit comments

Comments
 (0)