99 ([ #90 ] ( https://github.com/OpenAdaptAI/openadapt-evals/pull/90 ) ,
1010 [ ` ca6a936 ` ] ( https://github.com/OpenAdaptAI/openadapt-evals/commit/ca6a9362556852bd6ad040ba9ac7a5dfe3a7d880 ) )
1111
12- Co-authored-by: Claude Opus 4.6 < noreply@anthropic.com >
13-
1412### Features
1513
1614- Add TaskVerifierRegistry for custom task verification
1715 ([ #89 ] ( https://github.com/OpenAdaptAI/openadapt-evals/pull/89 ) ,
1816 [ ` 639a6a2 ` ] ( https://github.com/OpenAdaptAI/openadapt-evals/commit/639a6a2ba2a15e0c7a2a3bd65fa57a38f6966965 ) )
19-
20- Add a registry pattern for custom task verifiers that can inspect VM state after task execution.
21- This enables GoTo IT Autopilot (and other integrators) to register domain-specific verification
22- functions without subclassing BenchmarkAdapter.
23-
24- - TaskVerifierRegistry with decorator and programmatic registration - VerificationResult dataclass
25- with success/score/details - WAALiveAdapter.run_powershell() for executing PowerShell on the VM -
26- Built-in clear_browsing_data reference verifier - 33 tests covering registry operations and
27- built-in verifiers - Exports from evaluation package and main package __ init__
28-
29- Co-authored-by: Claude Opus 4.6 < noreply@anthropic.com >
17+ - TaskVerifierRegistry with decorator and programmatic registration
18+ - VerificationResult dataclass with success/score/details
19+ - WAALiveAdapter.run_powershell() for executing PowerShell on the VM
20+ - Built-in clear_browsing_data reference verifier
21+ - 33 tests covering registry operations and built-in verifiers
22+ - Exports from evaluation package and main package __ init__
3023
3124
3225## v0.25.1 (2026-03-03)
@@ -36,17 +29,15 @@ Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
3629- Address review findings in verl-agent adapter
3730 ([ #88 ] ( https://github.com/OpenAdaptAI/openadapt-evals/pull/88 ) ,
3831 [ ` 879c53c ` ] ( https://github.com/OpenAdaptAI/openadapt-evals/commit/879c53c182853104a4a8c5e179e810185180d37a ) )
39-
40- - Fix SCROLL direction not forwarded to BenchmarkAction.scroll_direction - Fix DRAG parsing to
41- include end_x/end_y coordinates - Fix is_action_valid logic: use pattern match instead of inverted
42- condition - Fix fractional coord conversion: trust _ use_fractional flag instead of checking value
43- ranges (0 and 1 are ambiguous between frac and pixel) - Convert drag end coordinates (end_x/end_y)
44- from fractional to pixel - Add health_check() method returning
45- ready/busy/needs_recovery/not_initialized - Add DRAG to system prompt DSL documentation - Fix
46- vendored VAGEN source URL (mll-lab-nu -> RAGEN-AI) - Add 12 new tests: scroll direction, drag
47- coords, health_check, is_action_valid
48-
49- Co-authored-by: Claude Opus 4.6 < noreply@anthropic.com >
32+ - Fix SCROLL direction not forwarded to BenchmarkAction.scroll_direction
33+ - Fix DRAG parsing to include end_x/end_y coordinates
34+ - Fix is_action_valid logic: use pattern match instead of inverted condition
35+ - Fix fractional coord conversion: trust _ use_fractional flag
36+ - Convert drag end coordinates from fractional to pixel
37+ - Add health_check() method returning ready/busy/needs_recovery/not_initialized
38+ - Add DRAG to system prompt DSL documentation
39+ - Fix vendored VAGEN source URL (mll-lab-nu -> RAGEN-AI)
40+ - Add 12 new tests
5041
5142
5243## v0.25.0 (2026-03-03)
@@ -56,86 +47,31 @@ Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
5647- ** agent** : Replace manual string escaping with repr() and fix CU agent bugs
5748 ([ #83 ] ( https://github.com/OpenAdaptAI/openadapt-evals/pull/83 ) ,
5849 [ ` ffcb41d ` ] ( https://github.com/OpenAdaptAI/openadapt-evals/commit/ffcb41d9a2dd6cc53eae4bf478d2d3b139d22b84 ) )
59-
60- * fix(agent): replace manual string escaping with repr() and fix CU agent bugs
61-
62- Five reliability fixes for eval runs:
63-
64- 1 . Replace _ escape_for_pyautogui() with repr() in _ build_type_commands() - eliminates entire class
65- of string-embedding bugs (newlines, tabs, quotes, unicode) using Python's own escaping mechanism
66-
67- 2 . Fix drag coordinate field names: startCoordinate/endCoordinate (camelCase) →
68- start_coordinate/coordinate (snake_case) per Claude computer_use API
69-
70- 3 . Add _ clamp_coord() to prevent (0,0) coordinates from triggering PyAutoGUI fail-safe, applied to
71- click, drag, and mouse_move actions
72-
73- 4 . Re-inject demo text at every step in tool_result messages to prevent context drift in
74- demo-conditioned evaluation
75-
76- 5 . Add command logging in WAALiveAdapter.step() for debugging
77-
78- Also adds docs/eval_analysis_2026_03_02.md documenting ZS vs DC eval results and literature review
79- on demo-conditioning approaches.
80-
81- Co-Authored-By: Claude Opus 4.6 < noreply@anthropic.com >
82-
83- * feat: add multi-level demo format transform and fix tests
84-
85- - Add scripts/transform_demo_format.py: transforms rigid {Observation, Intent, Action, Result} demos
86- into adaptive {Think, Action, Expect} format with PLAN section (Option D from eval analysis) -
87- LLM-assisted mode (default): uses vlm_call() for semantic transform - Rule-based mode (--no-llm):
88- free, no API calls needed - Supports --dry-run for preview
89-
90- - Fix tests for repr() escaping and coordinate clamping: - Remove TestEscapeForPyautogui (tests
91- deleted function) - Update TestBuildTypeCommands for repr() output format - Add
92- test_all_special_chars_produce_valid_python invariant test - Fix drag test to use snake_case field
93- names - Fix coordinate edge test to expect clamped (0.005, 0.005)
94-
95- - Regenerate uv.lock for consilium package name resolution
96-
97- * docs: add DC-multilevel eval results to analysis
98-
99- DC-multilevel (new {Think, Action, Expect} + PLAN format) showed clear improvement over DC-rigid:
100- agent followed the plan, entered all headers and years, typed correct formula, used drag-fill.
101- Still scored 0.0 due to premature task completion (finished 1/3 columns), but qualitatively the
102- best behavior across all three conditions.
103-
104- ---------
105-
106- Co-authored-by: Claude Opus 4.6 < noreply@anthropic.com >
50+ 1 . Replace ` _escape_for_pyautogui() ` with ` repr() ` in ` _build_type_commands() `
51+ 2 . Fix drag coordinate field names: camelCase to snake_case per Claude computer_use API
52+ 3 . Add ` _clamp_coord() ` to prevent (0,0) coordinates from triggering PyAutoGUI fail-safe
53+ 4 . Re-inject demo text at every step to prevent context drift in demo-conditioned evaluation
54+ 5 . Add command logging in WAALiveAdapter.step() for debugging
55+ - Add ` scripts/transform_demo_format.py ` : transforms rigid demos into adaptive format
56+ - Fix tests for repr() escaping and coordinate clamping
57+ - Add eval analysis document with ZS vs DC results
10758
10859### Features
10960
11061- Add VAGEN/verl-agent environment adapter for VLM RL training
11162 ([ ` 0183321 ` ] ( https://github.com/OpenAdaptAI/openadapt-evals/commit/018332168389b4be74660ecbc754eef5768f6b51 ) )
11263
113- * feat: add VAGEN/verl-agent environment adapter for VLM RL training
114-
115- WAADesktopEnv implements the GymImageEnv protocol from VAGEN, enabling desktop GUI automation
64+ WAADesktopEnv implements the GymImageEnv protocol from VAGEN, enabling desktop GUI automation
11665 training with verl-agent's multi-turn VLM RL pipeline (GiGPO, GRPO, PPO).
117-
118- The adapter translates between openadapt-evals BenchmarkObservation (PNG bytes + a11y tree) and
119- VAGEN's observation format (obs_str + multi_modal_input with PIL images).
120-
121- - Async interface (reset/step/close/system_prompt) - Action DSL parsing (CLICK, TYPE, KEY, SCROLL,
122- WAIT, DONE) - Fractional coordinate support (0.0-1.0) - Lazy adapter initialization - 21 tests
123- passing with mock adapter - Example VAGEN training config included
124-
125- Co-Authored-By: Claude Opus 4.6 < noreply@anthropic.com >
126-
127- * docs: add comprehensive verl-agent decision document
128-
129- Records the full reasoning chain for choosing verl-agent/VAGEN: - Framework comparison (TRL,
130- standalone, verl-agent, VAGEN, OpenRLHF, Unsloth) - Key insight: per-step verification via GiGPO
131- for long-horizon GUI tasks - TRL multi-turn VLM blocker (issues #5119 , #5120 ) - "Environment is
132- the moat" strategic framing - Architecture diagram and migration path
133-
134- * feat: add verl-agent as optional dependency
135-
136- * feat: vendor GymImageEnv base classes from VAGEN
137-
138- * docs: fact-check framework review in verl decision doc
66+ - Async interface (reset/step/close/system_prompt)
67+ - Action DSL parsing (CLICK, TYPE, KEY, SCROLL, WAIT, DONE)
68+ - Fractional coordinate support (0.0-1.0)
69+ - Lazy adapter initialization
70+ - 21 tests passing with mock adapter
71+ - Example VAGEN training config included
72+ - Comprehensive verl-agent decision document
73+ - Vendor GymImageEnv base classes from VAGEN
74+ - Fact-check framework review in verl decision doc
13975
14076Update Sections E (OpenRLHF), F (Unsloth), TRL, and comparison matrix with accurate details from
14177 thorough review:
0 commit comments