|
1 | 1 | # CHANGELOG |
2 | 2 |
|
3 | 3 |
|
| 4 | +## v0.25.0 (2026-03-03) |
| 5 | + |
| 6 | +### Bug Fixes |
| 7 | + |
| 8 | +- **agent**: Replace manual string escaping with repr() and fix CU agent bugs |
| 9 | + ([#83](https://github.com/OpenAdaptAI/openadapt-evals/pull/83), |
| 10 | + [`ffcb41d`](https://github.com/OpenAdaptAI/openadapt-evals/commit/ffcb41d9a2dd6cc53eae4bf478d2d3b139d22b84)) |
| 11 | + |
| 12 | +* fix(agent): replace manual string escaping with repr() and fix CU agent bugs |
| 13 | + |
| 14 | +Five reliability fixes for eval runs: |
| 15 | + |
| 16 | +1. Replace _escape_for_pyautogui() with repr() in _build_type_commands() - eliminates entire class |
| 17 | + of string-embedding bugs (newlines, tabs, quotes, unicode) using Python's own escaping mechanism |
| 18 | + |
| 19 | +2. Fix drag coordinate field names: startCoordinate/endCoordinate (camelCase) → |
| 20 | + start_coordinate/coordinate (snake_case) per Claude computer_use API |
| 21 | + |
| 22 | +3. Add _clamp_coord() to prevent (0,0) coordinates from triggering PyAutoGUI fail-safe, applied to |
| 23 | + click, drag, and mouse_move actions |
| 24 | + |
| 25 | +4. Re-inject demo text at every step in tool_result messages to prevent context drift in |
| 26 | + demo-conditioned evaluation |
| 27 | + |
| 28 | +5. Add command logging in WAALiveAdapter.step() for debugging |
| 29 | + |
| 30 | +Also adds docs/eval_analysis_2026_03_02.md documenting ZS vs DC eval results and literature review |
| 31 | + on demo-conditioning approaches. |
| 32 | + |
| 33 | +Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
| 34 | + |
| 35 | +* feat: add multi-level demo format transform and fix tests |
| 36 | + |
| 37 | +- Add scripts/transform_demo_format.py: transforms rigid {Observation, Intent, Action, Result} demos |
| 38 | + into adaptive {Think, Action, Expect} format with PLAN section (Option D from eval analysis) - |
| 39 | + LLM-assisted mode (default): uses vlm_call() for semantic transform - Rule-based mode (--no-llm): |
| 40 | + free, no API calls needed - Supports --dry-run for preview |
| 41 | + |
| 42 | +- Fix tests for repr() escaping and coordinate clamping: - Remove TestEscapeForPyautogui (tests |
| 43 | + deleted function) - Update TestBuildTypeCommands for repr() output format - Add |
| 44 | + test_all_special_chars_produce_valid_python invariant test - Fix drag test to use snake_case field |
| 45 | + names - Fix coordinate edge test to expect clamped (0.005, 0.005) |
| 46 | + |
| 47 | +- Regenerate uv.lock for consilium package name resolution |
| 48 | + |
| 49 | +* docs: add DC-multilevel eval results to analysis |
| 50 | + |
| 51 | +DC-multilevel (new {Think, Action, Expect} + PLAN format) showed clear improvement over DC-rigid: |
| 52 | + agent followed the plan, entered all headers and years, typed correct formula, used drag-fill. |
| 53 | + Still scored 0.0 due to premature task completion (finished 1/3 columns), but qualitatively the |
| 54 | + best behavior across all three conditions. |
| 55 | + |
| 56 | +--------- |
| 57 | + |
| 58 | +Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
| 59 | + |
| 60 | +### Features |
| 61 | + |
| 62 | +- Add VAGEN/verl-agent environment adapter for VLM RL training |
| 63 | + ([`0183321`](https://github.com/OpenAdaptAI/openadapt-evals/commit/018332168389b4be74660ecbc754eef5768f6b51)) |
| 64 | + |
| 65 | +* feat: add VAGEN/verl-agent environment adapter for VLM RL training |
| 66 | + |
| 67 | +WAADesktopEnv implements the GymImageEnv protocol from VAGEN, enabling desktop GUI automation |
| 68 | + training with verl-agent's multi-turn VLM RL pipeline (GiGPO, GRPO, PPO). |
| 69 | + |
| 70 | +The adapter translates between openadapt-evals BenchmarkObservation (PNG bytes + a11y tree) and |
| 71 | + VAGEN's observation format (obs_str + multi_modal_input with PIL images). |
| 72 | + |
| 73 | +- Async interface (reset/step/close/system_prompt) - Action DSL parsing (CLICK, TYPE, KEY, SCROLL, |
| 74 | + WAIT, DONE) - Fractional coordinate support (0.0-1.0) - Lazy adapter initialization - 21 tests |
| 75 | + passing with mock adapter - Example VAGEN training config included |
| 76 | + |
| 77 | +Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
| 78 | + |
| 79 | +* docs: add comprehensive verl-agent decision document |
| 80 | + |
| 81 | +Records the full reasoning chain for choosing verl-agent/VAGEN: - Framework comparison (TRL, |
| 82 | + standalone, verl-agent, VAGEN, OpenRLHF, Unsloth) - Key insight: per-step verification via GiGPO |
| 83 | + for long-horizon GUI tasks - TRL multi-turn VLM blocker (issues #5119, #5120) - "Environment is |
| 84 | + the moat" strategic framing - Architecture diagram and migration path |
| 85 | + |
| 86 | +* feat: add verl-agent as optional dependency |
| 87 | + |
| 88 | +* feat: vendor GymImageEnv base classes from VAGEN |
| 89 | + |
| 90 | +* docs: fact-check framework review in verl decision doc |
| 91 | + |
| 92 | +Update Sections E (OpenRLHF), F (Unsloth), TRL, and comparison matrix with accurate details from |
| 93 | + thorough review: |
| 94 | + |
| 95 | +- OpenRLHF: document AgentTrainer multi-turn support and OpenRLHF-M fork - Unsloth: nuanced |
| 96 | + assessment — single-turn VLM works, multi-turn text via ART works, but multi-turn VLM blocked by |
| 97 | + rollout_func issue (#3573) - TRL: add note about OpenEnv/rollout_func for text models (VLM |
| 98 | + blocked) - Comparison matrix: add Unsloth column with footnotes |
| 99 | + |
| 100 | +--------- |
| 101 | + |
| 102 | +Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
| 103 | + |
| 104 | + |
4 | 105 | ## v0.24.0 (2026-03-03) |
5 | 106 |
|
6 | 107 | ### Documentation |
|
0 commit comments