Skip to content

Commit 3528bf9

Browse files
author
semantic-release
committed
chore: release 0.47.0
1 parent b311e13 commit 3528bf9

2 files changed

Lines changed: 115 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,120 @@
11
# CHANGELOG
22

33

4+
## v0.47.0 (2026-03-19)
5+
6+
### Bug Fixes
7+
8+
- Align trace report test assertions with reset-step behavior
9+
([#141](https://github.com/OpenAdaptAI/openadapt-evals/pull/141),
10+
[`8c6b815`](https://github.com/OpenAdaptAI/openadapt-evals/commit/8c6b8155e3dd95b548cb4454efd37c63bb0057b0))
11+
12+
The test_report_with_trajectory test expected trajectory data from step_index=0 to appear in the
13+
report, but generate_trace_report.py skips trajectory metadata for Step 0 (Reset) by design.
14+
Updated assertions to match the actual report output: step_index=0 data is absent, while
15+
step_index=1 and 2 data appears correctly under Steps 1 and 2.
16+
17+
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
18+
19+
- Improve WAA VM infrastructure reliability
20+
([#145](https://github.com/OpenAdaptAI/openadapt-evals/pull/145),
21+
[`641e6e5`](https://github.com/OpenAdaptAI/openadapt-evals/commit/641e6e5adeeacf7007b73a8d05a7dc099a6d6f9f))
22+
23+
1. Remote Docker build: add missing files (evaluate_server.py, start_with_evaluate.sh,
24+
patch_setup_ps1.py) to the SCP file list in _build_remote(). Without these, the Dockerfile COPY
25+
commands fail because the build context is incomplete.
26+
27+
2. LibreOffice sed patch: replace fragile chained sed commands with a standalone Python patch script
28+
(patch_setup_ps1.py). The old second sed matched the wrong occurrence of Add-ToEnvPath after the
29+
first sed inserted text containing the same pattern.
30+
31+
3. Chrome sign-in dialog: add _is_chrome_task() detection and _prepare_chrome_clean_state() to
32+
suppress the "Sign in to Chrome" modal that blocks automation on fresh VMs. Uses registry policies
33+
(BrowserSignin=0, SyncDisabled=1, PromotionalTabsEnabled=0) and creates the "First Run" sentinel
34+
file. Also adds Chrome first-run suppression to _apply_clean_desktop_policy() and to the
35+
Dockerfile FirstLogonCommands for defense-in-depth.
36+
37+
4. Default CMD: add CMD directive to Dockerfile so containers don't exit immediately if started
38+
without explicit command arguments.
39+
40+
5. start_with_evaluate.sh: add fallback to /run/entry.sh when no CMD arguments are provided (empty
41+
$@).
42+
43+
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
44+
45+
- Use persistent storage for WAA data instead of ephemeral /mnt
46+
([#144](https://github.com/OpenAdaptAI/openadapt-evals/pull/144),
47+
[`bef392d`](https://github.com/OpenAdaptAI/openadapt-evals/commit/bef392dbfd16477fbeed4459991d225715bc08f6))
48+
49+
Replace all /mnt/waa-storage with WAA_STORAGE_DIR constant pointing to /home/azureuser/waa-storage
50+
(persistent OS disk). Azure /mnt is ephemeral temp storage wiped on every deallocate, causing
51+
15-20 min cold reinstalls.
52+
53+
Also adds --os-disk-size-gb 128 to single-VM cmd_create path.
54+
55+
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
56+
57+
- Use structured planner output to prevent compound instruction drops
58+
([#146](https://github.com/OpenAdaptAI/openadapt-evals/pull/146),
59+
[`b311e13`](https://github.com/OpenAdaptAI/openadapt-evals/commit/b311e13de7fd548849e6e97efaa0c1ccd81afeb8))
60+
61+
* fix: use structured planner output to prevent compound instruction drops
62+
63+
The planner prompt now outputs structured action fields (action_type, action_value,
64+
target_description) instead of free-form instruction text. This fixes the compound instruction
65+
problem where type X then press Enter would only execute the type, dropping the Enter keypress.
66+
67+
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
68+
69+
* fix: Chrome task setup dismisses sign-in dialog + compound instruction research
70+
71+
Chrome tasks now launch with --no-first-run --disable-sync and press Escape to dismiss any sign-in
72+
dialog. Settings task navigates directly to chrome://settings/cookies via CLI arg.
73+
74+
Also adds compound instruction research doc.
75+
76+
---------
77+
78+
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
79+
80+
### Documentation
81+
82+
- Comprehensive workflow extraction pipeline + AReaL evaluation
83+
([`e5e848d`](https://github.com/OpenAdaptAI/openadapt-evals/commit/e5e848d8be857a4df7b94b80f761edc429f2cb4c))
84+
85+
Workflow extraction pipeline (1350 lines, self-contained): - 4-pass pipeline: PII scrub → VLM
86+
transcript → workflow extraction → cosine matching - All Pydantic classes inline (11 classes) -
87+
Simple cosine similarity threshold (>0.85) instead of HDBSCAN - Full test strategy with synthetic
88+
data families - Cost analysis, integration points, file layout
89+
90+
AReaL evaluation: recommended as training backend.
91+
92+
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
93+
94+
### Features
95+
96+
- Add workflow extraction Pydantic models, WAA adapter, and matching pipeline
97+
([#142](https://github.com/OpenAdaptAI/openadapt-evals/pull/142),
98+
[`fea40b6`](https://github.com/OpenAdaptAI/openadapt-evals/commit/fea40b63103b01854d74ea5f0b0dfb8cb5304cb3))
99+
100+
Implement Priority 1 of the workflow extraction pipeline: - Pydantic models for RecordingSession,
101+
Workflow, CanonicalWorkflow, WorkflowLibrary - WAARecordingAdapter to parse WAA meta.json
102+
recordings into normalized sessions - Cosine similarity matching for grouping workflows into
103+
canonical workflows - 31 tests with synthetic data families (settings toggles, spreadsheet entry,
104+
document formatting, file archiving) validating models, adapter, and matching
105+
106+
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
107+
108+
- Add workflow transcript generation pipeline (Pass 0 + Pass 1)
109+
([#143](https://github.com/OpenAdaptAI/openadapt-evals/pull/143),
110+
[`329826e`](https://github.com/OpenAdaptAI/openadapt-evals/commit/329826e1a8a8ce563cf04f7c912c886555c44629))
111+
112+
Pass 0: PII scrubbing wrapper. Pass 1: VLM-based transcript with batched screenshots, robust
113+
parsing, cost estimation. 14 tests.
114+
115+
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
116+
117+
4118
## v0.46.0 (2026-03-19)
5119

6120
### Features

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "openadapt-evals"
7-
version = "0.46.0"
7+
version = "0.47.0"
88
description = "Evaluation infrastructure for GUI agent benchmarks"
99
readme = "README.md"
1010
requires-python = ">=3.10"

0 commit comments

Comments
 (0)