feat: add interactive recording workflow with auto-infrastructure and VM IP detection by abrichr · Pull Request #57 · OpenAdaptAI/openadapt-evals

abrichr · 2026-03-01T19:02:00Z

Summary

Adds VM IP auto-detection, screen stability detection, interactive step correction during recording, and automatic infrastructure deployment (--auto flag) for the WAA demo recording workflow.

Commits

5373c19 fix: replace LibreOffice screenshot with full desktop view
- Swaps the README screenshot from a release-hosted crop to a local screenshots/waa_libreoffice_desktop.png showing the full macOS → Chrome → noVNC → Windows 11 → LibreOffice stack
6d0a3fb feat: add VM IP auto-detection and screen stability detection
- New openadapt_evals/infrastructure/vm_ip.py with resolve_vm_ip(): layered fallback (explicit → pool registry → Azure CLI)
- New _compare_screenshots() and _wait_for_stable_screen() for pixel-level screen stability detection (99.5% threshold, 3 consecutive checks)
- run_dc_eval.py and record_waa_demos.py use resolve_vm_ip() instead of hardcoded IPs
- 24 new tests across test_vm_ip.py and test_screen_stability.py
44db6e6 fix: regenerate suggested steps after task restart
- After a QEMU hard reset mid-recording, takes a fresh screenshot and regenerates VLM-suggested steps instead of reusing stale ones
e577823 feat: add interactive step correction during recording
- Users can type feedback at any step to refine remaining steps via VLM
- New functions: _refine_steps(), _refine_remaining_steps(), _interactive_step_review(), _interactive_remaining_review()
- Step parsing/formatting utilities: _parse_step_list(), _format_step_list(), _display_steps(), _display_current_step()
- Interactive commands: [Enter] advance, [d] done, [r] redo, [R] restart, [s] refresh, [x] retry, [u] undo, or type feedback
73473df fix: validate task args before VM IP resolution
- Guards against Fire passing True for --tasks when used without a value, before attempting VM IP resolution that would fail confusingly
26f3766 refactor: extract screen stability into module and recording loop into function
- Moves compare_screenshots and wait_for_stable_screen into openadapt_evals/infrastructure/screen_stability.py
- Removes fragile importlib hack from test_screen_stability.py — tests now import directly
- Extracts per-task recording loop into _record_single_task() for readability
- Fixes pre-existing bug: len(steps) → len(steps_meta) in completion message
05b261c feat: add --auto flag for automatic infrastructure deployment
- New --auto flag (and granular --auto-vm, --auto-tunnel, --auto-container) for record-waa
- Auto-recovery: starts VM, establishes SSH tunnels (prefers autossh), starts Docker container + socat proxy, waits for WAA readiness
- VM cleanup on exit: atexit + signal handlers offer to deallocate if script started the VM
- Checkpoint/resume system: saves recording state after each step, offers to resume on next run
- Pre-fetches task configs before QEMU reset to avoid stale socat bridge issues

Files changed (11 files, +1706/-217)

File	Change
`openadapt_evals/infrastructure/vm_ip.py`	New: VM IP auto-detection module
`openadapt_evals/infrastructure/screen_stability.py`	New: screen comparison + stability detection
`tests/test_vm_ip.py`	New: 14 tests for VM IP resolution
`tests/test_screen_stability.py`	New: 10 tests for screen stability
`screenshots/waa_libreoffice_desktop.png`	New: full desktop screenshot
`scripts/record_waa_demos.py`	Major: +1259/-217 lines — auto-infra, step correction, checkpoint/resume
`scripts/run_dc_eval.py`	Minor: use `resolve_vm_ip()`
`openadapt_evals/infrastructure/__init__.py`	Exports new modules
`openadapt_evals/infrastructure/qemu_reset.py`	Docstring: remove hardcoded IP
`README.md`	Screenshot reference updated
`.beads/issues.jsonl`	Bead tracking

Test plan

pytest tests/test_vm_ip.py tests/test_screen_stability.py -v — 24/24 pass
Manual: python scripts/record_waa_demos.py record-waa --auto --tasks=04d9aeaf — verify auto-recovery flow
Manual: verify checkpoint resume (interrupt mid-recording, re-run)
Manual: verify step correction (type feedback during recording)

🤖 Generated with Claude Code

The previous screenshot showed only the Calc window. The new one shows the full context: macOS Chrome browser with noVNC tab, Windows 11 desktop inside QEMU, LibreOffice Calc welcome dialog, and Windows taskbar. This better demonstrates the VM evaluation infrastructure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add resolve_vm_ip() with layered resolution: explicit arg → pool registry (fast, local) → Azure CLI query (always accurate, ~3s) - Remove hardcoded 172.173.66.131 defaults from record_waa_demos.py and run_dc_eval.py; --vm-ip is now auto-detected if omitted - Add _wait_for_stable_screen() that polls QEMU framebuffer (free) until 3 consecutive screenshots match (99.5% similarity threshold), replacing the fixed time.sleep(3) that caused stale screenshots - Add _compare_screenshots() with numpy-vectorized pixel comparison - 24 new tests (14 for VM IP, 10 for screen stability) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When the user presses 'R' to restart a task, the QEMU hard reset produces a new stable screenshot, but the suggested steps were not regenerated. The stale steps from the previous screenshot were displayed. Now _generate_steps() is called again with the fresh screenshot after every restart. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

After generating suggested steps from the screenshot, the user can now type corrections (e.g., "step 9 formula should reference Sheet1.B2") and the VLM will regenerate with the feedback. Loop continues until the user presses Enter to accept. Also refactors _generate_steps into smaller functions: - _build_setup_desc(): extracts setup description from task config - _vlm_call(): shared OpenAI API call helper - _refine_steps(): sends feedback + screenshot for revised steps - _display_steps(): pretty-prints step box - _interactive_step_review(): correction loop Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move the tasks-type guard above resolve_vm_ip() call so that input validation happens before any real work. Fixes CI failure where resolve_vm_ip raises RuntimeError in environments without Azure access. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…o function - Move _compare_screenshots and _wait_for_stable_screen from scripts/record_waa_demos.py into openadapt_evals/infrastructure/screen_stability.py as public functions (compare_screenshots, wait_for_stable_screen) - Script wrappers delegate to the new module, preserving all call sites - Update tests/test_screen_stability.py to import from the module directly, removing the fragile importlib.util.spec_from_file_location hack - Extract per-task recording loop from cmd_record_waa() into _record_single_task() for readability and testability - Fix pre-existing bug: len(steps) -> len(steps_meta) in completion message Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…oyment When the WAA server is not reachable, the script now: - With --auto: starts VM, establishes SSH tunnels, starts Docker container and socat proxy, then waits for WAA to boot. Confirms with user before starting VM (cost warning). Auto-deallocates VM on exit/signal. - Without --auto: prints actionable help message showing --auto and granular flags (--auto-vm, --auto-tunnel, --auto-container). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

New script converts WAA recordings (meta.json + screenshots) to demo text files for eval-suite, with two modes: - text: instant, free, uses step descriptions from meta.json - vlm: richer, sends screenshots to VLM for Observation/Intent/Result Generated both text-only and VLM-enriched demos for task 04d9aeaf (LibreOffice Calc annual changes). No VM or openadapt-ml needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Step 15: VLM described after-state instead of before-state, and referenced C3 instead of C2. Step 17: VLM hallucinated "CLICK cell D3" — should be D2 (first data row for OA changes formula). Step 18: Cascading fix from step 17. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…, 17-18)" This reverts commit 27b14bb.

- Remove unused _compare_screenshots wrapper in record_waa_demos.py - Use f.get('path', '?') instead of f['path'] in _build_setup_desc - Ensure demo .txt files end with trailing newline Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The VLM (gpt-4.1-mini) was hallucinating cell references and other details that contradicted the recorded actions from meta.json (e.g., "D3" instead of "D2"). Three improvements to the converter pipeline: 1. Strengthen the VLM prompt to label the recorded action as "GROUND-TRUTH" and explicitly instruct the model not to substitute different cell refs, values, or formulas based on visual interpretation. 2. Add post-hoc validation that extracts cell references, formulas, and quoted text from both the ground-truth step and the VLM's Action field. On mismatch, the Action field is replaced with the ground-truth description while preserving the VLM's Observation/Intent/Result. 3. Upgrade default model from gpt-4.1-mini to gpt-4.1 and lower temperature from 0.1 to 0.0 for more deterministic output. The --model flag allows overriding back to gpt-4.1-mini if cost is a concern. Regenerated demo for 04d9aeaf with the fixed pipeline — previously hallucinated cell references (steps 15, 17, 18) are now correct. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abrichr and others added 10 commits March 1, 2026 13:49

Revert "fix: correct VLM annotation errors in 04d9aeaf demo (steps 15…

b9bff88

…, 17-18)" This reverts commit 27b14bb.

abrichr mentioned this pull request Mar 2, 2026

feat: add consilium integration, autossh, checkpoint/resume, and auto-recovery #58

Merged

5 tasks

abrichr and others added 2 commits March 1, 2026 23:35

abrichr merged commit 3463f9d into main Mar 2, 2026
1 check passed

abrichr changed the title ~~feat: add VM IP auto-detection and screen stability detection~~ feat: add interactive recording workflow with auto-infrastructure and VM IP detection Mar 2, 2026

abrichr mentioned this pull request Mar 2, 2026

fix: use systemd-first pattern for socat proxy in record_waa_demos #61

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add interactive recording workflow with auto-infrastructure and VM IP detection#57

feat: add interactive recording workflow with auto-infrastructure and VM IP detection#57
abrichr merged 12 commits into
mainfrom
feat/vm-ip-autodetect-screen-stability

abrichr commented Mar 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrichr commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits

Files changed (11 files, +1706/-217)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abrichr commented Mar 1, 2026 •

edited

Loading