|
1 | 1 | # CHANGELOG |
2 | 2 |
|
3 | 3 |
|
| 4 | +## v0.64.1 (2026-03-23) |
| 5 | + |
| 6 | +### Bug Fixes |
| 7 | + |
| 8 | +- Address flywheel regression bugs (VM reset, demo validation, alignment) |
| 9 | + ([#187](https://github.com/OpenAdaptAI/openadapt-evals/pull/187), |
| 10 | + [`e2f0928`](https://github.com/OpenAdaptAI/openadapt-evals/commit/e2f0928b93817fe32a385ead8d27ed596df91378)) |
| 11 | + |
| 12 | +Fix five interacting bugs that caused the demo guidance regression (score 1.0 -> 0.5) on the |
| 13 | + notepad-hello task: |
| 14 | + |
| 15 | +1. Add VM reset between phases: Phase 1 artifacts (open windows, typed text) leaked into Phase 3. |
| 16 | + New --reset-between-phases flag (default True) re-runs task setup commands to restore a clean |
| 17 | + desktop. |
| 18 | + |
| 19 | +2. Validate demo quality: Before using a demo, check for placeholder screenshots, identical |
| 20 | + screenshots across steps, and doubled action text (e.g., "Hello WorldHello world") indicating a |
| 21 | + failed run. Warnings are logged and saved to demo_quality_warnings.json. |
| 22 | + |
| 23 | +3. Force sequential alignment for short demos: When the demo has < 5 steps, disable pHash visual |
| 24 | + alignment (which cannot distinguish similar desktop screenshots) and use sequential step index |
| 25 | + instead. New use_visual_alignment parameter threaded through DemoGuidedAgent. |
| 26 | + |
| 27 | +4. Remove step counts from guidance prompt: The "step N/N" prefix in DEMONSTRATION GUIDANCE caused |
| 28 | + the planner to interpret "last step" as "task is done" and prematurely signal DONE. Guidance now |
| 29 | + describes WHAT to do without revealing position in the demo sequence. |
| 30 | + |
| 31 | +5. Evaluate on fresh screenshot: evaluate_dense() now takes a fresh screenshot from the adapter |
| 32 | + instead of using a cached one from a previous step/phase. Falls back to cached on failure. |
| 33 | + |
| 34 | +Also adds task navigational ambiguity analysis identifying tasks where demo guidance should help |
| 35 | + most (high: Chrome clear data, VLC preferences, LibreOffice formatting; low: notepad, desktop |
| 36 | + folder, VS Code replace). |
| 37 | + |
| 38 | +Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
| 39 | + |
| 40 | + |
4 | 41 | ## v0.64.0 (2026-03-23) |
5 | 42 |
|
6 | 43 | ### Features |
|
0 commit comments