File tree Expand file tree Collapse file tree
Expand file tree Collapse file tree Original file line number Diff line number Diff line change 11# CHANGELOG
22
33
4+ ## v0.64.0 (2026-03-23)
5+
6+ ### Features
7+
8+ - Automate full VM lifecycle in correction flywheel script
9+ ([ #186 ] ( https://github.com/OpenAdaptAI/openadapt-evals/pull/186 ) ,
10+ [ ` 748534b ` ] ( https://github.com/OpenAdaptAI/openadapt-evals/commit/748534bffd3b024cd587aec22abc2697f511af6f ) )
11+
12+ Integrate all manual infrastructure steps so the flywheel runs end-to-end deterministically with a
13+ single command:
14+
15+ python scripts/run_correction_flywheel.py \ --task-config
16+ example_tasks/clear-browsing-data-chrome.yaml \ --demo-dir ./demos --manage-vm --setup-tunnels
17+
18+ New infrastructure functions (inline, matching azure_vm.py patterns): - start_vm / get_vm_ip /
19+ get_vm_state / wait_for_ssh / deallocate_vm - start_container (docker start or docker run with
20+ correct flags) - apply_iptables_fix (exempt port 5050 from DNAT, idempotent) - setup_tunnels (kill
21+ stale, create SSH tunnels for 5001/5050/8006) - setup_eval_proxy (socat bridge for evaluate
22+ server) - wait_for_waa (poll /probe through tunnel)
23+
24+ Design decisions: - --manage-vm flag: opt-in VM start/deallocate lifecycle - --setup-tunnels flag:
25+ opt-in tunnel setup with port cleanup - --baseline-model / --guided-model: use different planner
26+ models for Phase 1 vs Phase 3 (e.g., gpt-4o-mini baseline to ensure failure) - VM deallocate in
27+ try/finally (always runs, even on error) - Phase errors are caught individually; report always
28+ generated with partial results - All operations are idempotent (safe to re-run) - --mock mode
29+ unchanged (no VM management needed)
30+
31+ Co-authored-by: Claude Opus 4.6 (1M context) < noreply@anthropic.com >
32+
33+
434## v0.63.0 (2026-03-22)
535
636### Features
Original file line number Diff line number Diff line change @@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44
55[project ]
66name = " openadapt-evals"
7- version = " 0.63 .0"
7+ version = " 0.64 .0"
88description = " Evaluation infrastructure for GUI agent benchmarks"
99readme = " README.md"
1010requires-python = " >=3.10"
You can’t perform that action at this time.
0 commit comments