feat: add consilium integration, autossh, checkpoint/resume, and auto-recovery#58
Merged
Conversation
The previous screenshot showed only the Calc window. The new one shows the full context: macOS Chrome browser with noVNC tab, Windows 11 desktop inside QEMU, LibreOffice Calc welcome dialog, and Windows taskbar. This better demonstrates the VM evaluation infrastructure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add resolve_vm_ip() with layered resolution: explicit arg → pool registry (fast, local) → Azure CLI query (always accurate, ~3s) - Remove hardcoded 172.173.66.131 defaults from record_waa_demos.py and run_dc_eval.py; --vm-ip is now auto-detected if omitted - Add _wait_for_stable_screen() that polls QEMU framebuffer (free) until 3 consecutive screenshots match (99.5% similarity threshold), replacing the fixed time.sleep(3) that caused stale screenshots - Add _compare_screenshots() with numpy-vectorized pixel comparison - 24 new tests (14 for VM IP, 10 for screen stability) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the user presses 'R' to restart a task, the QEMU hard reset produces a new stable screenshot, but the suggested steps were not regenerated. The stale steps from the previous screenshot were displayed. Now _generate_steps() is called again with the fresh screenshot after every restart. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After generating suggested steps from the screenshot, the user can now type corrections (e.g., "step 9 formula should reference Sheet1.B2") and the VLM will regenerate with the feedback. Loop continues until the user presses Enter to accept. Also refactors _generate_steps into smaller functions: - _build_setup_desc(): extracts setup description from task config - _vlm_call(): shared OpenAI API call helper - _refine_steps(): sends feedback + screenshot for revised steps - _display_steps(): pretty-prints step box - _interactive_step_review(): correction loop Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the tasks-type guard above resolve_vm_ip() call so that input validation happens before any real work. Fixes CI failure where resolve_vm_ip raises RuntimeError in environments without Azure access. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-recovery - Integrate consilium multi-model council for step generation (_vlm_call) with graceful fallback to single-model (gpt-4.1-mini) on failure - Add efficiency-focused step generation with human/agent target modes - Fix prompt framing in _refine_steps (remove sycophantic "user says wrong") - Add grounded reasoning (describe screenshot before listing steps) - Add checkpoint/resume: save recording state after every step to survive tunnel drops or crashes, with interactive resume on reconnection - Add --auto/--auto-vm/--auto-tunnel/--auto-container flags for automatic infrastructure recovery (VM start, SSH tunnels, Docker container, socat) - Prefer autossh over plain ssh for tunnel auto-reconnection - Add bcdedit recoveryenabled=No to Dockerfile FirstLogonCommands to prevent Windows Automatic Repair loops after dirty shutdown - Add retry (3x) for task config fetch to handle transient connection aborts - Add resilience-options.md documenting infrastructure recovery strategies - Add test_vlm_call.py with 10 tests covering image passing, checkpoint roundtrip, prompt construction, and fallback model validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The evaluate server (localhost:5050) goes through a socat bridge that can become stale after container/VM restarts. Pre-fetching all task configs before the QEMU reset ensures human-readable instructions are cached in memory even if the bridge dies later. Falls back to live fetch with retry on cache miss. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Picks up consilium e3619ad which migrates from deprecated google-generativeai to google-genai SDK, eliminating the FutureWarning about the deprecated package. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the model's planned steps diverge from the actual UI (e.g. a menu doesn't have the expected option), the user can press 's' to take a fresh screenshot and regenerate all remaining steps from the current screen state — no need to describe what's wrong. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Show the next step and prompt user to verify VNC matches expected state before resuming. Default changed to No since fresh start is the safe choice — resume is only valid after tunnel drops, not VM reboots. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New key mapping: Enter = step done d = task done early u = undo last step (was 'r', renamed for clarity) r = restart task (soft — close apps, re-setup, regenerate steps) R = restart task (hard — QEMU reboot) s = refresh remaining steps from current screenshot text = feedback to correct remaining steps Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The hard reset at startup was destroying the VM state that checkpoints depend on. Now the script checks for checkpoints BEFORE the reset. If the user wants to resume, the reset is skipped entirely. If not, stale checkpoints are cleaned up automatically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Corrected remaining steps now show as "Step 4 of 10", "Step 5 of 10" etc. instead of restarting from 1. Uses the existing start_num parameter of _format_step_list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New prompt layout with clearer descriptions: [Enter] next step [x] retry step [u] undo prev step [d] task complete [s] refresh steps from screenshot [r] restart task [R] restart task (reboot VM) Or type correction: [x] retry step: discards the current attempt, takes a fresh before screenshot, and re-displays the same step. Useful when you messed up the action and want to try again. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove "draft then review" instructions that caused models to output both draft and final step lists. Now requests only the final numbered steps with no commentary. - Add 5s delay after _setup_task_env() in soft restart so the task app has time to open before screen stability check begins. - Increase close_all delay from 2s to 3s for reliability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two-pass LLM analysis pipeline: - Pass 1 (holistic): sends full task context + sampled screenshots to identify problematic steps - Pass 2 (per-step): deep-dives each flagged step with before/after screenshots + surrounding context Interactive review with accept/reject/edit per correction. Saves meta_refined.json + refinement_log.json alongside original meta.json. Supports --auto (non-interactive), --dry-run, --all, --model, and --no-council flags. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The _vlm_call() was only passing the last text block to consilium, losing the system prompt (with JSON constraint) and all step text. Now concatenates system prompt + all text blocks into a single prompt. This fixes the holistic review returning prose instead of JSON. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace naive fence-stripping with _extract_json() that handles: preamble text before JSON, ```json fences, trailing commentary, and bare JSON arrays/objects embedded in prose. - Add openadapt-ml as uv source (path = "../openadapt-ml") so `uv sync` can resolve it for the annotation command. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The annotate command imports prompt templates, data classes, and VLM provider wrappers from openadapt-ml. Added as dependency with local path source in [tool.uv.sources]. TODO: migrate annotation code into openadapt-evals to eliminate this cross-repo dependency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the recording script starts a VM via --auto-vm, it now registers atexit and signal handlers to clean up on exit: - Normal exit: prompts user to deallocate (default Y) - SIGINT/SIGTERM: auto-deallocates to prevent billing from orphaned VMs - Only triggers if the script itself started the VM (not pre-running) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Member
Author
Closing in favor of PR #57PR #57 ( Features already covered by #57
Unique #58 features to carry forward in future PRs
Why closing
Branch preservedThe |
Member
Author
|
Reopening — the plan is to merge #57 first, then rebase this PR onto main to carry forward the unique features (consilium, refine_demo.py, autossh, bcdedit, prompt improvements). |
CI was failing because uv.sources references local paths (../openadapt-ml) that don't exist in CI. Use --no-sources flag to fall back to PyPI versions. Also bump requires-python to >=3.11 since consilium 0.3.0 on PyPI requires it, and fix consilium git URL to the renamed OpenAdaptAI/openadapt-consilium repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…h-checkpoint # Conflicts: # .beads/issues.jsonl # openadapt_evals/infrastructure/__init__.py # scripts/record_waa_demos.py # tests/test_screen_stability.py
abrichr
added a commit
that referenced
this pull request
Mar 3, 2026
Add coverage for RL training environment, end-to-end eval pipeline, annotation pipeline, 4-layer probe diagnostics, demo recording persistence, review artifacts, coordinate clamping, and multi-cloud VMProvider protocol. Update architecture tree with new modules (rl_env.py, probe.py, annotation.py, vlm.py, vm_provider.py, evaluation/) and scripts directory. Add openadapt-consilium to related projects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2 tasks
abrichr
added a commit
that referenced
this pull request
Mar 3, 2026
Add coverage for RL training environment, end-to-end eval pipeline, annotation pipeline, 4-layer probe diagnostics, demo recording persistence, review artifacts, coordinate clamping, and multi-cloud VMProvider protocol. Update architecture tree with new modules (rl_env.py, probe.py, annotation.py, vlm.py, vm_provider.py, evaluation/) and scripts directory. Add openadapt-consilium to related projects. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
abrichr
added a commit
that referenced
this pull request
Mar 3, 2026
Add coverage for RL training environment, end-to-end eval pipeline, annotation pipeline, 4-layer probe diagnostics, demo recording persistence, review artifacts, coordinate clamping, and multi-cloud VMProvider protocol. Update architecture tree with new modules (rl_env.py, probe.py, annotation.py, vlm.py, vm_provider.py, evaluation/) and scripts directory. Add openadapt-consilium to related projects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--auto): automatically start VM, establish SSH tunnels, start Docker containerChanges
scripts/record_waa_demos.pyopenadapt_evals/waa_deploy/Dockerfilebcdedit /set {default} recoveryenabled Noin FirstLogonCommandspyproject.toml+uv.locktests/test_vlm_call.pydocs/resilience-options.mdTest plan
uv run pytest tests/test_vlm_call.py -v— 10/10 passuv run pytest tests/ --ignore=tests/test_api_agent_ml.py --ignore=tests/test_council.py— 488 pass, 7 pre-existing failures (missing demo files)uv sync && uv run python scripts/record_waa_demos.py record-waawith consilium--autoflag with deallocated VM🤖 Generated with Claude Code