Skip to content

Commit 69ba370

Browse files
author
semantic-release
committed
chore: release 0.64.0
1 parent 748534b commit 69ba370

2 files changed

Lines changed: 31 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,36 @@
11
# CHANGELOG
22

33

4+
## v0.64.0 (2026-03-23)
5+
6+
### Features
7+
8+
- Automate full VM lifecycle in correction flywheel script
9+
([#186](https://github.com/OpenAdaptAI/openadapt-evals/pull/186),
10+
[`748534b`](https://github.com/OpenAdaptAI/openadapt-evals/commit/748534bffd3b024cd587aec22abc2697f511af6f))
11+
12+
Integrate all manual infrastructure steps so the flywheel runs end-to-end deterministically with a
13+
single command:
14+
15+
python scripts/run_correction_flywheel.py \ --task-config
16+
example_tasks/clear-browsing-data-chrome.yaml \ --demo-dir ./demos --manage-vm --setup-tunnels
17+
18+
New infrastructure functions (inline, matching azure_vm.py patterns): - start_vm / get_vm_ip /
19+
get_vm_state / wait_for_ssh / deallocate_vm - start_container (docker start or docker run with
20+
correct flags) - apply_iptables_fix (exempt port 5050 from DNAT, idempotent) - setup_tunnels (kill
21+
stale, create SSH tunnels for 5001/5050/8006) - setup_eval_proxy (socat bridge for evaluate
22+
server) - wait_for_waa (poll /probe through tunnel)
23+
24+
Design decisions: - --manage-vm flag: opt-in VM start/deallocate lifecycle - --setup-tunnels flag:
25+
opt-in tunnel setup with port cleanup - --baseline-model / --guided-model: use different planner
26+
models for Phase 1 vs Phase 3 (e.g., gpt-4o-mini baseline to ensure failure) - VM deallocate in
27+
try/finally (always runs, even on error) - Phase errors are caught individually; report always
28+
generated with partial results - All operations are idempotent (safe to re-run) - --mock mode
29+
unchanged (no VM management needed)
30+
31+
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
32+
33+
434
## v0.63.0 (2026-03-22)
535

636
### Features

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "openadapt-evals"
7-
version = "0.63.0"
7+
version = "0.64.0"
88
description = "Evaluation infrastructure for GUI agent benchmarks"
99
readme = "README.md"
1010
requires-python = ">=3.10"

0 commit comments

Comments
 (0)