Skip to content

Commit f7a7199

Browse files
author
semantic-release
committed
chore: release 0.71.3
1 parent 257bc7f commit f7a7199

2 files changed

Lines changed: 47 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,52 @@
11
# CHANGELOG
22

33

4+
## v0.71.3 (2026-03-28)
5+
6+
### Bug Fixes
7+
8+
- Eval infra, forced keyboard override, Outlines constrained decoding
9+
([#197](https://github.com/OpenAdaptAI/openadapt-evals/pull/197),
10+
[`257bc7f`](https://github.com/OpenAdaptAI/openadapt-evals/commit/257bc7f2f3595ed9d350ac8a482efdf805302a05))
11+
12+
* fix: per-step milestone tracking, forced keyboard override, eval infra
13+
14+
Evaluation infrastructure: - Per-step milestone high-water mark: milestones checked after each step,
15+
once passed they stay passed. Fixes transient states (open dialogs) being missed by
16+
end-of-episode-only evaluation. - evaluate_checks_local() fallback: when /evaluate endpoint is
17+
down, uses task config's own command/screenshot checks via /execute_windows - iptables retry loop
18+
in start_with_evaluate.sh: ensures port 5050 exemption persists even if DNAT rule is (re)applied
19+
later
20+
21+
Anti-loop forced override: - After 6 consecutive identical actions (planner ignoring warnings),
22+
bypasses planner entirely and emits first keyboard shortcut from demo guidance (e.g.,
23+
Ctrl+Shift+Delete). This breaks click loops where the grounder places clicks incorrectly.
24+
25+
Task setup fixes: - Chrome popup: registry policies, First Run sentinel, launch flags - Single-line
26+
PowerShell commands (fixes YAML escaping for /execute_windows) - Redesigned milestones: combined
27+
settings/dialog check, evidence-based
28+
29+
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
30+
31+
* fix: remove Alt+F4 from demo, add Outlines constrained decoding
32+
33+
Demo fix: - Remove step 4 (Alt+F4 close Chrome) from clear-browsing-data demo. Alt+F4 on desktop
34+
triggers Windows Shutdown dialog when Chrome loses focus. The task goal is clearing data, not
35+
closing Chrome. - Updated step 2 description to include "Delete from this device" button text
36+
(newer Chrome versions changed the label).
37+
38+
Constrained decoding (GRPO trainer): - Add `constrained_decoding` config flag (default False) - When
39+
enabled, uses Outlines RegexLogitsProcessor to force model output to match the action format regex
40+
(CLICK/TYPE/WAIT/DONE). Eliminates 5-15% of rollouts wasted on unparseable output. - Allows
41+
free-form Thought prefix before the action. - DFA compilation cached after first call (~2s
42+
one-time cost). - Graceful fallback if outlines not installed. - Added outlines>=0.1.0 to training
43+
optional dependencies.
44+
45+
---------
46+
47+
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
48+
49+
450
## v0.71.2 (2026-03-28)
551

652
### Bug Fixes

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "openadapt-evals"
7-
version = "0.71.2"
7+
version = "0.71.3"
88
description = "Evaluation infrastructure for GUI agent benchmarks"
99
readme = "README.md"
1010
requires-python = ">=3.10"

0 commit comments

Comments
 (0)