Skip to content

Commit 2263fa6

Browse files
author
semantic-release
committed
chore: release 0.64.1
1 parent e2f0928 commit 2263fa6

2 files changed

Lines changed: 38 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,43 @@
11
# CHANGELOG
22

33

4+
## v0.64.1 (2026-03-23)
5+
6+
### Bug Fixes
7+
8+
- Address flywheel regression bugs (VM reset, demo validation, alignment)
9+
([#187](https://github.com/OpenAdaptAI/openadapt-evals/pull/187),
10+
[`e2f0928`](https://github.com/OpenAdaptAI/openadapt-evals/commit/e2f0928b93817fe32a385ead8d27ed596df91378))
11+
12+
Fix five interacting bugs that caused the demo guidance regression (score 1.0 -> 0.5) on the
13+
notepad-hello task:
14+
15+
1. Add VM reset between phases: Phase 1 artifacts (open windows, typed text) leaked into Phase 3.
16+
New --reset-between-phases flag (default True) re-runs task setup commands to restore a clean
17+
desktop.
18+
19+
2. Validate demo quality: Before using a demo, check for placeholder screenshots, identical
20+
screenshots across steps, and doubled action text (e.g., "Hello WorldHello world") indicating a
21+
failed run. Warnings are logged and saved to demo_quality_warnings.json.
22+
23+
3. Force sequential alignment for short demos: When the demo has < 5 steps, disable pHash visual
24+
alignment (which cannot distinguish similar desktop screenshots) and use sequential step index
25+
instead. New use_visual_alignment parameter threaded through DemoGuidedAgent.
26+
27+
4. Remove step counts from guidance prompt: The "step N/N" prefix in DEMONSTRATION GUIDANCE caused
28+
the planner to interpret "last step" as "task is done" and prematurely signal DONE. Guidance now
29+
describes WHAT to do without revealing position in the demo sequence.
30+
31+
5. Evaluate on fresh screenshot: evaluate_dense() now takes a fresh screenshot from the adapter
32+
instead of using a cached one from a previous step/phase. Falls back to cached on failure.
33+
34+
Also adds task navigational ambiguity analysis identifying tasks where demo guidance should help
35+
most (high: Chrome clear data, VLC preferences, LibreOffice formatting; low: notepad, desktop
36+
folder, VS Code replace).
37+
38+
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
39+
40+
441
## v0.64.0 (2026-03-23)
542

643
### Features

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "openadapt-evals"
7-
version = "0.64.0"
7+
version = "0.64.1"
88
description = "Evaluation infrastructure for GUI agent benchmarks"
99
readme = "README.md"
1010
requires-python = ">=3.10"

0 commit comments

Comments
 (0)