|
1 | 1 | # CHANGELOG |
2 | 2 |
|
3 | 3 |
|
| 4 | +## v0.47.0 (2026-03-19) |
| 5 | + |
| 6 | +### Bug Fixes |
| 7 | + |
| 8 | +- Align trace report test assertions with reset-step behavior |
| 9 | + ([#141](https://github.com/OpenAdaptAI/openadapt-evals/pull/141), |
| 10 | + [`8c6b815`](https://github.com/OpenAdaptAI/openadapt-evals/commit/8c6b8155e3dd95b548cb4454efd37c63bb0057b0)) |
| 11 | + |
| 12 | +The test_report_with_trajectory test expected trajectory data from step_index=0 to appear in the |
| 13 | + report, but generate_trace_report.py skips trajectory metadata for Step 0 (Reset) by design. |
| 14 | + Updated assertions to match the actual report output: step_index=0 data is absent, while |
| 15 | + step_index=1 and 2 data appears correctly under Steps 1 and 2. |
| 16 | + |
| 17 | +Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
| 18 | + |
| 19 | +- Improve WAA VM infrastructure reliability |
| 20 | + ([#145](https://github.com/OpenAdaptAI/openadapt-evals/pull/145), |
| 21 | + [`641e6e5`](https://github.com/OpenAdaptAI/openadapt-evals/commit/641e6e5adeeacf7007b73a8d05a7dc099a6d6f9f)) |
| 22 | + |
| 23 | +1. Remote Docker build: add missing files (evaluate_server.py, start_with_evaluate.sh, |
| 24 | + patch_setup_ps1.py) to the SCP file list in _build_remote(). Without these, the Dockerfile COPY |
| 25 | + commands fail because the build context is incomplete. |
| 26 | + |
| 27 | +2. LibreOffice sed patch: replace fragile chained sed commands with a standalone Python patch script |
| 28 | + (patch_setup_ps1.py). The old second sed matched the wrong occurrence of Add-ToEnvPath after the |
| 29 | + first sed inserted text containing the same pattern. |
| 30 | + |
| 31 | +3. Chrome sign-in dialog: add _is_chrome_task() detection and _prepare_chrome_clean_state() to |
| 32 | + suppress the "Sign in to Chrome" modal that blocks automation on fresh VMs. Uses registry policies |
| 33 | + (BrowserSignin=0, SyncDisabled=1, PromotionalTabsEnabled=0) and creates the "First Run" sentinel |
| 34 | + file. Also adds Chrome first-run suppression to _apply_clean_desktop_policy() and to the |
| 35 | + Dockerfile FirstLogonCommands for defense-in-depth. |
| 36 | + |
| 37 | +4. Default CMD: add CMD directive to Dockerfile so containers don't exit immediately if started |
| 38 | + without explicit command arguments. |
| 39 | + |
| 40 | +5. start_with_evaluate.sh: add fallback to /run/entry.sh when no CMD arguments are provided (empty |
| 41 | + $@). |
| 42 | + |
| 43 | +Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
| 44 | + |
| 45 | +- Use persistent storage for WAA data instead of ephemeral /mnt |
| 46 | + ([#144](https://github.com/OpenAdaptAI/openadapt-evals/pull/144), |
| 47 | + [`bef392d`](https://github.com/OpenAdaptAI/openadapt-evals/commit/bef392dbfd16477fbeed4459991d225715bc08f6)) |
| 48 | + |
| 49 | +Replace all /mnt/waa-storage with WAA_STORAGE_DIR constant pointing to /home/azureuser/waa-storage |
| 50 | + (persistent OS disk). Azure /mnt is ephemeral temp storage wiped on every deallocate, causing |
| 51 | + 15-20 min cold reinstalls. |
| 52 | + |
| 53 | +Also adds --os-disk-size-gb 128 to single-VM cmd_create path. |
| 54 | + |
| 55 | +Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
| 56 | + |
| 57 | +- Use structured planner output to prevent compound instruction drops |
| 58 | + ([#146](https://github.com/OpenAdaptAI/openadapt-evals/pull/146), |
| 59 | + [`b311e13`](https://github.com/OpenAdaptAI/openadapt-evals/commit/b311e13de7fd548849e6e97efaa0c1ccd81afeb8)) |
| 60 | + |
| 61 | +* fix: use structured planner output to prevent compound instruction drops |
| 62 | + |
| 63 | +The planner prompt now outputs structured action fields (action_type, action_value, |
| 64 | + target_description) instead of free-form instruction text. This fixes the compound instruction |
| 65 | + problem where type X then press Enter would only execute the type, dropping the Enter keypress. |
| 66 | + |
| 67 | +Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
| 68 | + |
| 69 | +* fix: Chrome task setup dismisses sign-in dialog + compound instruction research |
| 70 | + |
| 71 | +Chrome tasks now launch with --no-first-run --disable-sync and press Escape to dismiss any sign-in |
| 72 | + dialog. Settings task navigates directly to chrome://settings/cookies via CLI arg. |
| 73 | + |
| 74 | +Also adds compound instruction research doc. |
| 75 | + |
| 76 | +--------- |
| 77 | + |
| 78 | +Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
| 79 | + |
| 80 | +### Documentation |
| 81 | + |
| 82 | +- Comprehensive workflow extraction pipeline + AReaL evaluation |
| 83 | + ([`e5e848d`](https://github.com/OpenAdaptAI/openadapt-evals/commit/e5e848d8be857a4df7b94b80f761edc429f2cb4c)) |
| 84 | + |
| 85 | +Workflow extraction pipeline (1350 lines, self-contained): - 4-pass pipeline: PII scrub → VLM |
| 86 | + transcript → workflow extraction → cosine matching - All Pydantic classes inline (11 classes) - |
| 87 | + Simple cosine similarity threshold (>0.85) instead of HDBSCAN - Full test strategy with synthetic |
| 88 | + data families - Cost analysis, integration points, file layout |
| 89 | + |
| 90 | +AReaL evaluation: recommended as training backend. |
| 91 | + |
| 92 | +Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
| 93 | + |
| 94 | +### Features |
| 95 | + |
| 96 | +- Add workflow extraction Pydantic models, WAA adapter, and matching pipeline |
| 97 | + ([#142](https://github.com/OpenAdaptAI/openadapt-evals/pull/142), |
| 98 | + [`fea40b6`](https://github.com/OpenAdaptAI/openadapt-evals/commit/fea40b63103b01854d74ea5f0b0dfb8cb5304cb3)) |
| 99 | + |
| 100 | +Implement Priority 1 of the workflow extraction pipeline: - Pydantic models for RecordingSession, |
| 101 | + Workflow, CanonicalWorkflow, WorkflowLibrary - WAARecordingAdapter to parse WAA meta.json |
| 102 | + recordings into normalized sessions - Cosine similarity matching for grouping workflows into |
| 103 | + canonical workflows - 31 tests with synthetic data families (settings toggles, spreadsheet entry, |
| 104 | + document formatting, file archiving) validating models, adapter, and matching |
| 105 | + |
| 106 | +Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
| 107 | + |
| 108 | +- Add workflow transcript generation pipeline (Pass 0 + Pass 1) |
| 109 | + ([#143](https://github.com/OpenAdaptAI/openadapt-evals/pull/143), |
| 110 | + [`329826e`](https://github.com/OpenAdaptAI/openadapt-evals/commit/329826e1a8a8ce563cf04f7c912c886555c44629)) |
| 111 | + |
| 112 | +Pass 0: PII scrubbing wrapper. Pass 1: VLM-based transcript with batched screenshots, robust |
| 113 | + parsing, cost estimation. 14 tests. |
| 114 | + |
| 115 | +Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
| 116 | + |
| 117 | + |
4 | 118 | ## v0.46.0 (2026-03-19) |
5 | 119 |
|
6 | 120 | ### Features |
|
0 commit comments