feat: add consilium integration, autossh, checkpoint/resume, and auto-recovery by abrichr · Pull Request #58 · OpenAdaptAI/openadapt-evals

abrichr · 2026-03-01T22:01:37Z

Summary

Consilium multi-model council for VLM step generation with graceful single-model fallback
Checkpoint/resume: recording state saved after every step; survives tunnel drops, VM reboots
Auto-recovery (--auto): automatically start VM, establish SSH tunnels, start Docker container
Autossh: prefer autossh for tunnel auto-reconnection (falls back to plain ssh)
bcdedit: disable Windows Automatic Repair in Dockerfile golden image
Prompt improvements: efficiency-focused step generation, grounded reasoning, fixed sycophantic framing
Task config retry: 3x retry on transient connection aborts
10 new tests: image passing, checkpoint roundtrip, prompt construction, fallback validation

Changes

Area	What
`scripts/record_waa_demos.py`	Consilium integration, checkpoint/resume, auto-recovery flags, autossh tunnels, prompt improvements
`openadapt_evals/waa_deploy/Dockerfile`	`bcdedit /set {default} recoveryenabled No` in FirstLogonCommands
`pyproject.toml` + `uv.lock`	Added consilium dependency with git source
`tests/test_vlm_call.py`	10 tests for VLM call chain and checkpoint roundtrip
`docs/resilience-options.md`	Infrastructure resilience strategy document

Test plan

uv run pytest tests/test_vlm_call.py -v — 10/10 pass
uv run pytest tests/ --ignore=tests/test_api_agent_ml.py --ignore=tests/test_council.py — 488 pass, 7 pre-existing failures (missing demo files)
Manual: uv sync && uv run python scripts/record_waa_demos.py record-waa with consilium
Manual: --auto flag with deallocated VM
Rebuild Docker image to verify bcdedit FirstLogonCommand

🤖 Generated with Claude Code

The previous screenshot showed only the Calc window. The new one shows the full context: macOS Chrome browser with noVNC tab, Windows 11 desktop inside QEMU, LibreOffice Calc welcome dialog, and Windows taskbar. This better demonstrates the VM evaluation infrastructure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add resolve_vm_ip() with layered resolution: explicit arg → pool registry (fast, local) → Azure CLI query (always accurate, ~3s) - Remove hardcoded 172.173.66.131 defaults from record_waa_demos.py and run_dc_eval.py; --vm-ip is now auto-detected if omitted - Add _wait_for_stable_screen() that polls QEMU framebuffer (free) until 3 consecutive screenshots match (99.5% similarity threshold), replacing the fixed time.sleep(3) that caused stale screenshots - Add _compare_screenshots() with numpy-vectorized pixel comparison - 24 new tests (14 for VM IP, 10 for screen stability) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When the user presses 'R' to restart a task, the QEMU hard reset produces a new stable screenshot, but the suggested steps were not regenerated. The stale steps from the previous screenshot were displayed. Now _generate_steps() is called again with the fresh screenshot after every restart. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

After generating suggested steps from the screenshot, the user can now type corrections (e.g., "step 9 formula should reference Sheet1.B2") and the VLM will regenerate with the feedback. Loop continues until the user presses Enter to accept. Also refactors _generate_steps into smaller functions: - _build_setup_desc(): extracts setup description from task config - _vlm_call(): shared OpenAI API call helper - _refine_steps(): sends feedback + screenshot for revised steps - _display_steps(): pretty-prints step box - _interactive_step_review(): correction loop Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move the tasks-type guard above resolve_vm_ip() call so that input validation happens before any real work. Fixes CI failure where resolve_vm_ip raises RuntimeError in environments without Azure access. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…-recovery - Integrate consilium multi-model council for step generation (_vlm_call) with graceful fallback to single-model (gpt-4.1-mini) on failure - Add efficiency-focused step generation with human/agent target modes - Fix prompt framing in _refine_steps (remove sycophantic "user says wrong") - Add grounded reasoning (describe screenshot before listing steps) - Add checkpoint/resume: save recording state after every step to survive tunnel drops or crashes, with interactive resume on reconnection - Add --auto/--auto-vm/--auto-tunnel/--auto-container flags for automatic infrastructure recovery (VM start, SSH tunnels, Docker container, socat) - Prefer autossh over plain ssh for tunnel auto-reconnection - Add bcdedit recoveryenabled=No to Dockerfile FirstLogonCommands to prevent Windows Automatic Repair loops after dirty shutdown - Add retry (3x) for task config fetch to handle transient connection aborts - Add resilience-options.md documenting infrastructure recovery strategies - Add test_vlm_call.py with 10 tests covering image passing, checkpoint roundtrip, prompt construction, and fallback model validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The evaluate server (localhost:5050) goes through a socat bridge that can become stale after container/VM restarts. Pre-fetching all task configs before the QEMU reset ensures human-readable instructions are cached in memory even if the bridge dies later. Falls back to live fetch with retry on cache miss. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Picks up consilium e3619ad which migrates from deprecated google-generativeai to google-genai SDK, eliminating the FutureWarning about the deprecated package. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When the model's planned steps diverge from the actual UI (e.g. a menu doesn't have the expected option), the user can press 's' to take a fresh screenshot and regenerate all remaining steps from the current screen state — no need to describe what's wrong. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Show the next step and prompt user to verify VNC matches expected state before resuming. Default changed to No since fresh start is the safe choice — resume is only valid after tunnel drops, not VM reboots. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

New key mapping: Enter = step done d = task done early u = undo last step (was 'r', renamed for clarity) r = restart task (soft — close apps, re-setup, regenerate steps) R = restart task (hard — QEMU reboot) s = refresh remaining steps from current screenshot text = feedback to correct remaining steps Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The hard reset at startup was destroying the VM state that checkpoints depend on. Now the script checks for checkpoints BEFORE the reset. If the user wants to resume, the reset is skipped entirely. If not, stale checkpoints are cleaned up automatically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Corrected remaining steps now show as "Step 4 of 10", "Step 5 of 10" etc. instead of restarting from 1. Uses the existing start_num parameter of _format_step_list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

New prompt layout with clearer descriptions: [Enter] next step [x] retry step [u] undo prev step [d] task complete [s] refresh steps from screenshot [r] restart task [R] restart task (reboot VM) Or type correction: [x] retry step: discards the current attempt, takes a fresh before screenshot, and re-displays the same step. Useful when you messed up the action and want to try again. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove "draft then review" instructions that caused models to output both draft and final step lists. Now requests only the final numbered steps with no commentary. - Add 5s delay after _setup_task_env() in soft restart so the task app has time to open before screen stability check begins. - Increase close_all delay from 2s to 3s for reliability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two-pass LLM analysis pipeline: - Pass 1 (holistic): sends full task context + sampled screenshots to identify problematic steps - Pass 2 (per-step): deep-dives each flagged step with before/after screenshots + surrounding context Interactive review with accept/reject/edit per correction. Saves meta_refined.json + refinement_log.json alongside original meta.json. Supports --auto (non-interactive), --dry-run, --all, --model, and --no-council flags. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The _vlm_call() was only passing the last text block to consilium, losing the system prompt (with JSON constraint) and all step text. Now concatenates system prompt + all text blocks into a single prompt. This fixes the holistic review returning prose instead of JSON. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Replace naive fence-stripping with _extract_json() that handles: preamble text before JSON, ```json fences, trailing commentary, and bare JSON arrays/objects embedded in prose. - Add openadapt-ml as uv source (path = "../openadapt-ml") so `uv sync` can resolve it for the annotation command. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The annotate command imports prompt templates, data classes, and VLM provider wrappers from openadapt-ml. Added as dependency with local path source in [tool.uv.sources]. TODO: migrate annotation code into openadapt-evals to eliminate this cross-repo dependency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When the recording script starts a VM via --auto-vm, it now registers atexit and signal handlers to clean up on exit: - Normal exit: prompts user to deallocate (default Y) - SIGINT/SIGTERM: auto-deallocates to prevent billing from orphaned VMs - Only triggers if the script itself started the VM (not pre-running) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abrichr · 2026-03-02T04:34:25Z

Closing in favor of PR #57

PR #57 (feat/vm-ip-autodetect-screen-stability) covers the foundational features with a cleaner architecture and passing CI. This PR's unique features will be cherry-picked into future PRs after #57 merges.

Features already covered by #57

VM IP auto-detection (resolve_vm_ip())
Screen stability detection (extracted as a proper module in openadapt_evals/infrastructure/screen_stability.py)
--auto / --auto-vm / --auto-tunnel / --auto-container infrastructure flags
Checkpoint/resume for recording sessions
Recording-to-demo converter script
Interactive step correction during recording

Unique #58 features to carry forward in future PRs

Consilium multi-model council integration for step generation (_vlm_call with graceful fallback)
autossh preference over plain ssh for tunnel auto-reconnection
refine_demo.py — two-pass LLM demo refinement pipeline (holistic + per-step analysis)
bcdedit recoveryenabled=No Dockerfile fix to prevent Windows Automatic Repair loops
Improved prompt engineering — removed sycophantic "user says wrong" framing, grounded reasoning (describe screenshot before listing steps), removed "draft then review" instructions
test_vlm_call.py — 10 tests covering image passing, checkpoint roundtrip, prompt construction, and fallback model validation
Soft restart (r key) vs hard restart (R key) distinction
Retry step (x key) for re-attempting current step
Screenshot refresh (s key) to regenerate remaining steps mid-recording
Pre-fetch task configs before QEMU reset to survive stale socat bridges
VM auto-deallocate on script exit (atexit + signal handlers)
Robust JSON extraction (_extract_json()) for VLM responses

Why closing

This PR diverged from feat: add interactive recording workflow with auto-infrastructure and VM IP detection #57's architecture (e.g., inlined screen_stability instead of keeping it as a module)
CI is failing due to openadapt-ml local path dependency (path = "../openadapt-ml" in [tool.uv.sources]) which doesn't resolve in CI
21 commits with significant overlap makes merging both PRs impractical

Branch preserved

The feat/consilium-autossh-checkpoint branch is not being deleted — it remains available as a reference for cherry-picking the features listed above.

abrichr · 2026-03-02T04:34:31Z

Reopening — the plan is to merge #57 first, then rebase this PR onto main to carry forward the unique features (consilium, refine_demo.py, autossh, bcdedit, prompt improvements).

CI was failing because uv.sources references local paths (../openadapt-ml) that don't exist in CI. Use --no-sources flag to fall back to PyPI versions. Also bump requires-python to >=3.11 since consilium 0.3.0 on PyPI requires it, and fix consilium git URL to the renamed OpenAdaptAI/openadapt-consilium repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…h-checkpoint # Conflicts: # .beads/issues.jsonl # openadapt_evals/infrastructure/__init__.py # scripts/record_waa_demos.py # tests/test_screen_stability.py

Add coverage for RL training environment, end-to-end eval pipeline, annotation pipeline, 4-layer probe diagnostics, demo recording persistence, review artifacts, coordinate clamping, and multi-cloud VMProvider protocol. Update architecture tree with new modules (rl_env.py, probe.py, annotation.py, vlm.py, vm_provider.py, evaluation/) and scripts directory. Add openadapt-consilium to related projects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add coverage for RL training environment, end-to-end eval pipeline, annotation pipeline, 4-layer probe diagnostics, demo recording persistence, review artifacts, coordinate clamping, and multi-cloud VMProvider protocol. Update architecture tree with new modules (rl_env.py, probe.py, annotation.py, vlm.py, vm_provider.py, evaluation/) and scripts directory. Add openadapt-consilium to related projects. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Add coverage for RL training environment, end-to-end eval pipeline, annotation pipeline, 4-layer probe diagnostics, demo recording persistence, review artifacts, coordinate clamping, and multi-cloud VMProvider protocol. Update architecture tree with new modules (rl_env.py, probe.py, annotation.py, vlm.py, vm_provider.py, evaluation/) and scripts directory. Add openadapt-consilium to related projects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abrichr and others added 21 commits March 1, 2026 13:49

fix: update lock file for consilium google-genai migration

0fb38ef

Picks up consilium e3619ad which migrates from deprecated google-generativeai to google-genai SDK, eliminating the FutureWarning about the deprecated package. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: remove unused import os in _refine_steps

57a5f8f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: number corrected steps from where recording left off

56807c5

Corrected remaining steps now show as "Step 4 of 10", "Step 5 of 10" etc. instead of restarting from 1. Uses the existing start_num parameter of _format_step_list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abrichr closed this Mar 2, 2026

abrichr reopened this Mar 2, 2026

abrichr and others added 3 commits March 2, 2026 00:10

chore: sync beads state

952e46f

Merge remote-tracking branch 'origin/main' into feat/consilium-autoss…

2dfe5ab

…h-checkpoint # Conflicts: # .beads/issues.jsonl # openadapt_evals/infrastructure/__init__.py # scripts/record_waa_demos.py # tests/test_screen_stability.py

abrichr merged commit 26da34c into main Mar 2, 2026
1 check passed

abrichr mentioned this pull request Mar 3, 2026

docs: update README with recent features #82

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add consilium integration, autossh, checkpoint/resume, and auto-recovery#58

feat: add consilium integration, autossh, checkpoint/resume, and auto-recovery#58
abrichr merged 24 commits into
mainfrom
feat/consilium-autossh-checkpoint

abrichr commented Mar 1, 2026

Uh oh!

abrichr commented Mar 2, 2026

Uh oh!

abrichr commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrichr commented Mar 1, 2026

Summary

Changes

Test plan

Uh oh!

abrichr commented Mar 2, 2026

Closing in favor of PR #57

Features already covered by #57

Unique #58 features to carry forward in future PRs

Why closing

Branch preserved

Uh oh!

abrichr commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant