|
1 | 1 | # CHANGELOG |
2 | 2 |
|
3 | 3 |
|
| 4 | +## v0.27.0 (2026-03-03) |
| 5 | + |
| 6 | +### Features |
| 7 | + |
| 8 | +- Add observe_pil() convenience method for PIL image output |
| 9 | + ([#93](https://github.com/OpenAdaptAI/openadapt-evals/pull/93), |
| 10 | + [`5c0aa52`](https://github.com/OpenAdaptAI/openadapt-evals/commit/5c0aa527fbf6d7a404bb6289f588bf8fdfe32800)) |
| 11 | + |
| 12 | +Add observe_pil() to WAALiveAdapter and RLEnvironment for VLM/RL pipelines that work with PIL images |
| 13 | + directly. Also clean up changelog formatting (remove leaked Co-authored-by trailer lines, fix |
| 14 | + collapsed bullet lists). |
| 15 | + |
| 16 | +Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
| 17 | + |
| 18 | + |
4 | 19 | ## v0.26.0 (2026-03-03) |
5 | 20 |
|
6 | 21 | ### Documentation |
|
9 | 24 | ([#90](https://github.com/OpenAdaptAI/openadapt-evals/pull/90), |
10 | 25 | [`ca6a936`](https://github.com/OpenAdaptAI/openadapt-evals/commit/ca6a9362556852bd6ad040ba9ac7a5dfe3a7d880)) |
11 | 26 |
|
| 27 | +Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
| 28 | + |
12 | 29 | ### Features |
13 | 30 |
|
14 | 31 | - Add TaskVerifierRegistry for custom task verification |
15 | 32 | ([#89](https://github.com/OpenAdaptAI/openadapt-evals/pull/89), |
16 | 33 | [`639a6a2`](https://github.com/OpenAdaptAI/openadapt-evals/commit/639a6a2ba2a15e0c7a2a3bd65fa57a38f6966965)) |
17 | | - - TaskVerifierRegistry with decorator and programmatic registration |
18 | | - - VerificationResult dataclass with success/score/details |
19 | | - - WAALiveAdapter.run_powershell() for executing PowerShell on the VM |
20 | | - - Built-in clear_browsing_data reference verifier |
21 | | - - 33 tests covering registry operations and built-in verifiers |
22 | | - - Exports from evaluation package and main package __init__ |
| 34 | + |
| 35 | +Add a registry pattern for custom task verifiers that can inspect VM state after task execution. |
| 36 | + This enables GoTo IT Autopilot (and other integrators) to register domain-specific verification |
| 37 | + functions without subclassing BenchmarkAdapter. |
| 38 | + |
| 39 | +- TaskVerifierRegistry with decorator and programmatic registration - VerificationResult dataclass |
| 40 | + with success/score/details - WAALiveAdapter.run_powershell() for executing PowerShell on the VM - |
| 41 | + Built-in clear_browsing_data reference verifier - 33 tests covering registry operations and |
| 42 | + built-in verifiers - Exports from evaluation package and main package __init__ |
| 43 | + |
| 44 | +Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
23 | 45 |
|
24 | 46 |
|
25 | 47 | ## v0.25.1 (2026-03-03) |
|
29 | 51 | - Address review findings in verl-agent adapter |
30 | 52 | ([#88](https://github.com/OpenAdaptAI/openadapt-evals/pull/88), |
31 | 53 | [`879c53c`](https://github.com/OpenAdaptAI/openadapt-evals/commit/879c53c182853104a4a8c5e179e810185180d37a)) |
32 | | - - Fix SCROLL direction not forwarded to BenchmarkAction.scroll_direction |
33 | | - - Fix DRAG parsing to include end_x/end_y coordinates |
34 | | - - Fix is_action_valid logic: use pattern match instead of inverted condition |
35 | | - - Fix fractional coord conversion: trust _use_fractional flag |
36 | | - - Convert drag end coordinates from fractional to pixel |
37 | | - - Add health_check() method returning ready/busy/needs_recovery/not_initialized |
38 | | - - Add DRAG to system prompt DSL documentation |
39 | | - - Fix vendored VAGEN source URL (mll-lab-nu -> RAGEN-AI) |
40 | | - - Add 12 new tests |
| 54 | + |
| 55 | +- Fix SCROLL direction not forwarded to BenchmarkAction.scroll_direction - Fix DRAG parsing to |
| 56 | + include end_x/end_y coordinates - Fix is_action_valid logic: use pattern match instead of inverted |
| 57 | + condition - Fix fractional coord conversion: trust _use_fractional flag instead of checking value |
| 58 | + ranges (0 and 1 are ambiguous between frac and pixel) - Convert drag end coordinates (end_x/end_y) |
| 59 | + from fractional to pixel - Add health_check() method returning |
| 60 | + ready/busy/needs_recovery/not_initialized - Add DRAG to system prompt DSL documentation - Fix |
| 61 | + vendored VAGEN source URL (mll-lab-nu -> RAGEN-AI) - Add 12 new tests: scroll direction, drag |
| 62 | + coords, health_check, is_action_valid |
| 63 | + |
| 64 | +Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
41 | 65 |
|
42 | 66 |
|
43 | 67 | ## v0.25.0 (2026-03-03) |
|
47 | 71 | - **agent**: Replace manual string escaping with repr() and fix CU agent bugs |
48 | 72 | ([#83](https://github.com/OpenAdaptAI/openadapt-evals/pull/83), |
49 | 73 | [`ffcb41d`](https://github.com/OpenAdaptAI/openadapt-evals/commit/ffcb41d9a2dd6cc53eae4bf478d2d3b139d22b84)) |
50 | | - 1. Replace `_escape_for_pyautogui()` with `repr()` in `_build_type_commands()` |
51 | | - 2. Fix drag coordinate field names: camelCase to snake_case per Claude computer_use API |
52 | | - 3. Add `_clamp_coord()` to prevent (0,0) coordinates from triggering PyAutoGUI fail-safe |
53 | | - 4. Re-inject demo text at every step to prevent context drift in demo-conditioned evaluation |
54 | | - 5. Add command logging in WAALiveAdapter.step() for debugging |
55 | | - - Add `scripts/transform_demo_format.py`: transforms rigid demos into adaptive format |
56 | | - - Fix tests for repr() escaping and coordinate clamping |
57 | | - - Add eval analysis document with ZS vs DC results |
| 74 | + |
| 75 | +* fix(agent): replace manual string escaping with repr() and fix CU agent bugs |
| 76 | + |
| 77 | +Five reliability fixes for eval runs: |
| 78 | + |
| 79 | +1. Replace _escape_for_pyautogui() with repr() in _build_type_commands() - eliminates entire class |
| 80 | + of string-embedding bugs (newlines, tabs, quotes, unicode) using Python's own escaping mechanism |
| 81 | + |
| 82 | +2. Fix drag coordinate field names: startCoordinate/endCoordinate (camelCase) → |
| 83 | + start_coordinate/coordinate (snake_case) per Claude computer_use API |
| 84 | + |
| 85 | +3. Add _clamp_coord() to prevent (0,0) coordinates from triggering PyAutoGUI fail-safe, applied to |
| 86 | + click, drag, and mouse_move actions |
| 87 | + |
| 88 | +4. Re-inject demo text at every step in tool_result messages to prevent context drift in |
| 89 | + demo-conditioned evaluation |
| 90 | + |
| 91 | +5. Add command logging in WAALiveAdapter.step() for debugging |
| 92 | + |
| 93 | +Also adds docs/eval_analysis_2026_03_02.md documenting ZS vs DC eval results and literature review |
| 94 | + on demo-conditioning approaches. |
| 95 | + |
| 96 | +Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
| 97 | + |
| 98 | +* feat: add multi-level demo format transform and fix tests |
| 99 | + |
| 100 | +- Add scripts/transform_demo_format.py: transforms rigid {Observation, Intent, Action, Result} demos |
| 101 | + into adaptive {Think, Action, Expect} format with PLAN section (Option D from eval analysis) - |
| 102 | + LLM-assisted mode (default): uses vlm_call() for semantic transform - Rule-based mode (--no-llm): |
| 103 | + free, no API calls needed - Supports --dry-run for preview |
| 104 | + |
| 105 | +- Fix tests for repr() escaping and coordinate clamping: - Remove TestEscapeForPyautogui (tests |
| 106 | + deleted function) - Update TestBuildTypeCommands for repr() output format - Add |
| 107 | + test_all_special_chars_produce_valid_python invariant test - Fix drag test to use snake_case field |
| 108 | + names - Fix coordinate edge test to expect clamped (0.005, 0.005) |
| 109 | + |
| 110 | +- Regenerate uv.lock for consilium package name resolution |
| 111 | + |
| 112 | +* docs: add DC-multilevel eval results to analysis |
| 113 | + |
| 114 | +DC-multilevel (new {Think, Action, Expect} + PLAN format) showed clear improvement over DC-rigid: |
| 115 | + agent followed the plan, entered all headers and years, typed correct formula, used drag-fill. |
| 116 | + Still scored 0.0 due to premature task completion (finished 1/3 columns), but qualitatively the |
| 117 | + best behavior across all three conditions. |
| 118 | + |
| 119 | +--------- |
| 120 | + |
| 121 | +Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
58 | 122 |
|
59 | 123 | ### Features |
60 | 124 |
|
61 | 125 | - Add VAGEN/verl-agent environment adapter for VLM RL training |
62 | 126 | ([`0183321`](https://github.com/OpenAdaptAI/openadapt-evals/commit/018332168389b4be74660ecbc754eef5768f6b51)) |
63 | 127 |
|
64 | | - WAADesktopEnv implements the GymImageEnv protocol from VAGEN, enabling desktop GUI automation |
| 128 | +* feat: add VAGEN/verl-agent environment adapter for VLM RL training |
| 129 | + |
| 130 | +WAADesktopEnv implements the GymImageEnv protocol from VAGEN, enabling desktop GUI automation |
65 | 131 | training with verl-agent's multi-turn VLM RL pipeline (GiGPO, GRPO, PPO). |
66 | | - - Async interface (reset/step/close/system_prompt) |
67 | | - - Action DSL parsing (CLICK, TYPE, KEY, SCROLL, WAIT, DONE) |
68 | | - - Fractional coordinate support (0.0-1.0) |
69 | | - - Lazy adapter initialization |
70 | | - - 21 tests passing with mock adapter |
71 | | - - Example VAGEN training config included |
72 | | - - Comprehensive verl-agent decision document |
73 | | - - Vendor GymImageEnv base classes from VAGEN |
74 | | - - Fact-check framework review in verl decision doc |
| 132 | + |
| 133 | +The adapter translates between openadapt-evals BenchmarkObservation (PNG bytes + a11y tree) and |
| 134 | + VAGEN's observation format (obs_str + multi_modal_input with PIL images). |
| 135 | + |
| 136 | +- Async interface (reset/step/close/system_prompt) - Action DSL parsing (CLICK, TYPE, KEY, SCROLL, |
| 137 | + WAIT, DONE) - Fractional coordinate support (0.0-1.0) - Lazy adapter initialization - 21 tests |
| 138 | + passing with mock adapter - Example VAGEN training config included |
| 139 | + |
| 140 | +Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
| 141 | + |
| 142 | +* docs: add comprehensive verl-agent decision document |
| 143 | + |
| 144 | +Records the full reasoning chain for choosing verl-agent/VAGEN: - Framework comparison (TRL, |
| 145 | + standalone, verl-agent, VAGEN, OpenRLHF, Unsloth) - Key insight: per-step verification via GiGPO |
| 146 | + for long-horizon GUI tasks - TRL multi-turn VLM blocker (issues #5119, #5120) - "Environment is |
| 147 | + the moat" strategic framing - Architecture diagram and migration path |
| 148 | + |
| 149 | +* feat: add verl-agent as optional dependency |
| 150 | + |
| 151 | +* feat: vendor GymImageEnv base classes from VAGEN |
| 152 | + |
| 153 | +* docs: fact-check framework review in verl decision doc |
75 | 154 |
|
76 | 155 | Update Sections E (OpenRLHF), F (Unsloth), TRL, and comparison matrix with accurate details from |
77 | 156 | thorough review: |
|
0 commit comments