feat: add interactive recording workflow with auto-infrastructure and VM IP detection#57
Merged
Merged
Conversation
The previous screenshot showed only the Calc window. The new one shows the full context: macOS Chrome browser with noVNC tab, Windows 11 desktop inside QEMU, LibreOffice Calc welcome dialog, and Windows taskbar. This better demonstrates the VM evaluation infrastructure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add resolve_vm_ip() with layered resolution: explicit arg → pool registry (fast, local) → Azure CLI query (always accurate, ~3s) - Remove hardcoded 172.173.66.131 defaults from record_waa_demos.py and run_dc_eval.py; --vm-ip is now auto-detected if omitted - Add _wait_for_stable_screen() that polls QEMU framebuffer (free) until 3 consecutive screenshots match (99.5% similarity threshold), replacing the fixed time.sleep(3) that caused stale screenshots - Add _compare_screenshots() with numpy-vectorized pixel comparison - 24 new tests (14 for VM IP, 10 for screen stability) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the user presses 'R' to restart a task, the QEMU hard reset produces a new stable screenshot, but the suggested steps were not regenerated. The stale steps from the previous screenshot were displayed. Now _generate_steps() is called again with the fresh screenshot after every restart. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After generating suggested steps from the screenshot, the user can now type corrections (e.g., "step 9 formula should reference Sheet1.B2") and the VLM will regenerate with the feedback. Loop continues until the user presses Enter to accept. Also refactors _generate_steps into smaller functions: - _build_setup_desc(): extracts setup description from task config - _vlm_call(): shared OpenAI API call helper - _refine_steps(): sends feedback + screenshot for revised steps - _display_steps(): pretty-prints step box - _interactive_step_review(): correction loop Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the tasks-type guard above resolve_vm_ip() call so that input validation happens before any real work. Fixes CI failure where resolve_vm_ip raises RuntimeError in environments without Azure access. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…o function - Move _compare_screenshots and _wait_for_stable_screen from scripts/record_waa_demos.py into openadapt_evals/infrastructure/screen_stability.py as public functions (compare_screenshots, wait_for_stable_screen) - Script wrappers delegate to the new module, preserving all call sites - Update tests/test_screen_stability.py to import from the module directly, removing the fragile importlib.util.spec_from_file_location hack - Extract per-task recording loop from cmd_record_waa() into _record_single_task() for readability and testability - Fix pre-existing bug: len(steps) -> len(steps_meta) in completion message Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…oyment When the WAA server is not reachable, the script now: - With --auto: starts VM, establishes SSH tunnels, starts Docker container and socat proxy, then waits for WAA to boot. Confirms with user before starting VM (cost warning). Auto-deallocates VM on exit/signal. - Without --auto: prints actionable help message showing --auto and granular flags (--auto-vm, --auto-tunnel, --auto-container). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New script converts WAA recordings (meta.json + screenshots) to demo text files for eval-suite, with two modes: - text: instant, free, uses step descriptions from meta.json - vlm: richer, sends screenshots to VLM for Observation/Intent/Result Generated both text-only and VLM-enriched demos for task 04d9aeaf (LibreOffice Calc annual changes). No VM or openadapt-ml needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 15: VLM described after-state instead of before-state, and referenced C3 instead of C2. Step 17: VLM hallucinated "CLICK cell D3" — should be D2 (first data row for OA changes formula). Step 18: Cascading fix from step 17. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, 17-18)" This reverts commit 27b14bb.
5 tasks
- Remove unused _compare_screenshots wrapper in record_waa_demos.py
- Use f.get('path', '?') instead of f['path'] in _build_setup_desc
- Ensure demo .txt files end with trailing newline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The VLM (gpt-4.1-mini) was hallucinating cell references and other details that contradicted the recorded actions from meta.json (e.g., "D3" instead of "D2"). Three improvements to the converter pipeline: 1. Strengthen the VLM prompt to label the recorded action as "GROUND-TRUTH" and explicitly instruct the model not to substitute different cell refs, values, or formulas based on visual interpretation. 2. Add post-hoc validation that extracts cell references, formulas, and quoted text from both the ground-truth step and the VLM's Action field. On mismatch, the Action field is replaced with the ground-truth description while preserving the VLM's Observation/Intent/Result. 3. Upgrade default model from gpt-4.1-mini to gpt-4.1 and lower temperature from 0.1 to 0.0 for more deterministic output. The --model flag allows overriding back to gpt-4.1-mini if cost is a concern. Regenerated demo for 04d9aeaf with the fixed pipeline — previously hallucinated cell references (steps 15, 17, 18) are now correct. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds VM IP auto-detection, screen stability detection, interactive step correction during recording, and automatic infrastructure deployment (
--autoflag) for the WAA demo recording workflow.Commits
5373c19fix: replace LibreOffice screenshot with full desktop viewscreenshots/waa_libreoffice_desktop.pngshowing the full macOS → Chrome → noVNC → Windows 11 → LibreOffice stack6d0a3fbfeat: add VM IP auto-detection and screen stability detectionopenadapt_evals/infrastructure/vm_ip.pywithresolve_vm_ip(): layered fallback (explicit → pool registry → Azure CLI)_compare_screenshots()and_wait_for_stable_screen()for pixel-level screen stability detection (99.5% threshold, 3 consecutive checks)run_dc_eval.pyandrecord_waa_demos.pyuseresolve_vm_ip()instead of hardcoded IPstest_vm_ip.pyandtest_screen_stability.py44db6e6fix: regenerate suggested steps after task restarte577823feat: add interactive step correction during recording_refine_steps(),_refine_remaining_steps(),_interactive_step_review(),_interactive_remaining_review()_parse_step_list(),_format_step_list(),_display_steps(),_display_current_step()[Enter]advance,[d]done,[r]redo,[R]restart,[s]refresh,[x]retry,[u]undo, or type feedback73473dffix: validate task args before VM IP resolutionTruefor--taskswhen used without a value, before attempting VM IP resolution that would fail confusingly26f3766refactor: extract screen stability into module and recording loop into functioncompare_screenshotsandwait_for_stable_screenintoopenadapt_evals/infrastructure/screen_stability.pyimportlibhack fromtest_screen_stability.py— tests now import directly_record_single_task()for readabilitylen(steps)→len(steps_meta)in completion message05b261cfeat: add --auto flag for automatic infrastructure deployment--autoflag (and granular--auto-vm,--auto-tunnel,--auto-container) forrecord-waaFiles changed (11 files, +1706/-217)
openadapt_evals/infrastructure/vm_ip.pyopenadapt_evals/infrastructure/screen_stability.pytests/test_vm_ip.pytests/test_screen_stability.pyscreenshots/waa_libreoffice_desktop.pngscripts/record_waa_demos.pyscripts/run_dc_eval.pyresolve_vm_ip()openadapt_evals/infrastructure/__init__.pyopenadapt_evals/infrastructure/qemu_reset.pyREADME.md.beads/issues.jsonlTest plan
pytest tests/test_vm_ip.py tests/test_screen_stability.py -v— 24/24 passpython scripts/record_waa_demos.py record-waa --auto --tasks=04d9aeaf— verify auto-recovery flow🤖 Generated with Claude Code