Skip to content

feat: add consilium integration, autossh, checkpoint/resume, and auto-recovery#58

Merged
abrichr merged 24 commits into
mainfrom
feat/consilium-autossh-checkpoint
Mar 2, 2026
Merged

feat: add consilium integration, autossh, checkpoint/resume, and auto-recovery#58
abrichr merged 24 commits into
mainfrom
feat/consilium-autossh-checkpoint

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 1, 2026

Summary

  • Consilium multi-model council for VLM step generation with graceful single-model fallback
  • Checkpoint/resume: recording state saved after every step; survives tunnel drops, VM reboots
  • Auto-recovery (--auto): automatically start VM, establish SSH tunnels, start Docker container
  • Autossh: prefer autossh for tunnel auto-reconnection (falls back to plain ssh)
  • bcdedit: disable Windows Automatic Repair in Dockerfile golden image
  • Prompt improvements: efficiency-focused step generation, grounded reasoning, fixed sycophantic framing
  • Task config retry: 3x retry on transient connection aborts
  • 10 new tests: image passing, checkpoint roundtrip, prompt construction, fallback validation

Changes

Area What
scripts/record_waa_demos.py Consilium integration, checkpoint/resume, auto-recovery flags, autossh tunnels, prompt improvements
openadapt_evals/waa_deploy/Dockerfile bcdedit /set {default} recoveryenabled No in FirstLogonCommands
pyproject.toml + uv.lock Added consilium dependency with git source
tests/test_vlm_call.py 10 tests for VLM call chain and checkpoint roundtrip
docs/resilience-options.md Infrastructure resilience strategy document

Test plan

  • uv run pytest tests/test_vlm_call.py -v — 10/10 pass
  • uv run pytest tests/ --ignore=tests/test_api_agent_ml.py --ignore=tests/test_council.py — 488 pass, 7 pre-existing failures (missing demo files)
  • Manual: uv sync && uv run python scripts/record_waa_demos.py record-waa with consilium
  • Manual: --auto flag with deallocated VM
  • Rebuild Docker image to verify bcdedit FirstLogonCommand

🤖 Generated with Claude Code

abrichr and others added 21 commits March 1, 2026 13:49
The previous screenshot showed only the Calc window. The new one shows
the full context: macOS Chrome browser with noVNC tab, Windows 11
desktop inside QEMU, LibreOffice Calc welcome dialog, and Windows
taskbar. This better demonstrates the VM evaluation infrastructure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add resolve_vm_ip() with layered resolution: explicit arg → pool
  registry (fast, local) → Azure CLI query (always accurate, ~3s)
- Remove hardcoded 172.173.66.131 defaults from record_waa_demos.py
  and run_dc_eval.py; --vm-ip is now auto-detected if omitted
- Add _wait_for_stable_screen() that polls QEMU framebuffer (free)
  until 3 consecutive screenshots match (99.5% similarity threshold),
  replacing the fixed time.sleep(3) that caused stale screenshots
- Add _compare_screenshots() with numpy-vectorized pixel comparison
- 24 new tests (14 for VM IP, 10 for screen stability)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the user presses 'R' to restart a task, the QEMU hard reset
produces a new stable screenshot, but the suggested steps were not
regenerated. The stale steps from the previous screenshot were
displayed. Now _generate_steps() is called again with the fresh
screenshot after every restart.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After generating suggested steps from the screenshot, the user can now
type corrections (e.g., "step 9 formula should reference Sheet1.B2")
and the VLM will regenerate with the feedback. Loop continues until
the user presses Enter to accept.

Also refactors _generate_steps into smaller functions:
- _build_setup_desc(): extracts setup description from task config
- _vlm_call(): shared OpenAI API call helper
- _refine_steps(): sends feedback + screenshot for revised steps
- _display_steps(): pretty-prints step box
- _interactive_step_review(): correction loop

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the tasks-type guard above resolve_vm_ip() call so that
input validation happens before any real work. Fixes CI failure
where resolve_vm_ip raises RuntimeError in environments without
Azure access.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-recovery

- Integrate consilium multi-model council for step generation (_vlm_call)
  with graceful fallback to single-model (gpt-4.1-mini) on failure
- Add efficiency-focused step generation with human/agent target modes
- Fix prompt framing in _refine_steps (remove sycophantic "user says wrong")
- Add grounded reasoning (describe screenshot before listing steps)
- Add checkpoint/resume: save recording state after every step to survive
  tunnel drops or crashes, with interactive resume on reconnection
- Add --auto/--auto-vm/--auto-tunnel/--auto-container flags for automatic
  infrastructure recovery (VM start, SSH tunnels, Docker container, socat)
- Prefer autossh over plain ssh for tunnel auto-reconnection
- Add bcdedit recoveryenabled=No to Dockerfile FirstLogonCommands to prevent
  Windows Automatic Repair loops after dirty shutdown
- Add retry (3x) for task config fetch to handle transient connection aborts
- Add resilience-options.md documenting infrastructure recovery strategies
- Add test_vlm_call.py with 10 tests covering image passing, checkpoint
  roundtrip, prompt construction, and fallback model validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The evaluate server (localhost:5050) goes through a socat bridge that
can become stale after container/VM restarts. Pre-fetching all task
configs before the QEMU reset ensures human-readable instructions are
cached in memory even if the bridge dies later. Falls back to live
fetch with retry on cache miss.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Picks up consilium e3619ad which migrates from deprecated
google-generativeai to google-genai SDK, eliminating the
FutureWarning about the deprecated package.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the model's planned steps diverge from the actual UI (e.g. a menu
doesn't have the expected option), the user can press 's' to take a
fresh screenshot and regenerate all remaining steps from the current
screen state — no need to describe what's wrong.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Show the next step and prompt user to verify VNC matches expected
state before resuming. Default changed to No since fresh start is
the safe choice — resume is only valid after tunnel drops, not
VM reboots.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New key mapping:
  Enter = step done
  d     = task done early
  u     = undo last step (was 'r', renamed for clarity)
  r     = restart task (soft — close apps, re-setup, regenerate steps)
  R     = restart task (hard — QEMU reboot)
  s     = refresh remaining steps from current screenshot
  text  = feedback to correct remaining steps

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The hard reset at startup was destroying the VM state that checkpoints
depend on. Now the script checks for checkpoints BEFORE the reset.
If the user wants to resume, the reset is skipped entirely. If not,
stale checkpoints are cleaned up automatically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Corrected remaining steps now show as "Step 4 of 10", "Step 5 of 10"
etc. instead of restarting from 1. Uses the existing start_num
parameter of _format_step_list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New prompt layout with clearer descriptions:
  [Enter] next step   [x] retry step   [u] undo prev step
  [d] task complete   [s] refresh steps from screenshot
  [r] restart task    [R] restart task (reboot VM)
  Or type correction:

[x] retry step: discards the current attempt, takes a fresh
before screenshot, and re-displays the same step. Useful when
you messed up the action and want to try again.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove "draft then review" instructions that caused models to output
  both draft and final step lists. Now requests only the final numbered
  steps with no commentary.
- Add 5s delay after _setup_task_env() in soft restart so the task app
  has time to open before screen stability check begins.
- Increase close_all delay from 2s to 3s for reliability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two-pass LLM analysis pipeline:
- Pass 1 (holistic): sends full task context + sampled screenshots to
  identify problematic steps
- Pass 2 (per-step): deep-dives each flagged step with before/after
  screenshots + surrounding context

Interactive review with accept/reject/edit per correction. Saves
meta_refined.json + refinement_log.json alongside original meta.json.

Supports --auto (non-interactive), --dry-run, --all, --model, and
--no-council flags.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The _vlm_call() was only passing the last text block to consilium,
losing the system prompt (with JSON constraint) and all step text.
Now concatenates system prompt + all text blocks into a single prompt.

This fixes the holistic review returning prose instead of JSON.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace naive fence-stripping with _extract_json() that handles:
  preamble text before JSON, ```json fences, trailing commentary,
  and bare JSON arrays/objects embedded in prose.
- Add openadapt-ml as uv source (path = "../openadapt-ml") so
  `uv sync` can resolve it for the annotation command.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The annotate command imports prompt templates, data classes, and
VLM provider wrappers from openadapt-ml. Added as dependency with
local path source in [tool.uv.sources].

TODO: migrate annotation code into openadapt-evals to eliminate
this cross-repo dependency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the recording script starts a VM via --auto-vm, it now registers
atexit and signal handlers to clean up on exit:
- Normal exit: prompts user to deallocate (default Y)
- SIGINT/SIGTERM: auto-deallocates to prevent billing from orphaned VMs
- Only triggers if the script itself started the VM (not pre-running)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abrichr
Copy link
Copy Markdown
Member Author

abrichr commented Mar 2, 2026

Closing in favor of PR #57

PR #57 (feat/vm-ip-autodetect-screen-stability) covers the foundational features with a cleaner architecture and passing CI. This PR's unique features will be cherry-picked into future PRs after #57 merges.

Features already covered by #57

  • VM IP auto-detection (resolve_vm_ip())
  • Screen stability detection (extracted as a proper module in openadapt_evals/infrastructure/screen_stability.py)
  • --auto / --auto-vm / --auto-tunnel / --auto-container infrastructure flags
  • Checkpoint/resume for recording sessions
  • Recording-to-demo converter script
  • Interactive step correction during recording

Unique #58 features to carry forward in future PRs

  • Consilium multi-model council integration for step generation (_vlm_call with graceful fallback)
  • autossh preference over plain ssh for tunnel auto-reconnection
  • refine_demo.py — two-pass LLM demo refinement pipeline (holistic + per-step analysis)
  • bcdedit recoveryenabled=No Dockerfile fix to prevent Windows Automatic Repair loops
  • Improved prompt engineering — removed sycophantic "user says wrong" framing, grounded reasoning (describe screenshot before listing steps), removed "draft then review" instructions
  • test_vlm_call.py — 10 tests covering image passing, checkpoint roundtrip, prompt construction, and fallback model validation
  • Soft restart (r key) vs hard restart (R key) distinction
  • Retry step (x key) for re-attempting current step
  • Screenshot refresh (s key) to regenerate remaining steps mid-recording
  • Pre-fetch task configs before QEMU reset to survive stale socat bridges
  • VM auto-deallocate on script exit (atexit + signal handlers)
  • Robust JSON extraction (_extract_json()) for VLM responses

Why closing

Branch preserved

The feat/consilium-autossh-checkpoint branch is not being deleted — it remains available as a reference for cherry-picking the features listed above.

@abrichr abrichr closed this Mar 2, 2026
@abrichr
Copy link
Copy Markdown
Member Author

abrichr commented Mar 2, 2026

Reopening — the plan is to merge #57 first, then rebase this PR onto main to carry forward the unique features (consilium, refine_demo.py, autossh, bcdedit, prompt improvements).

@abrichr abrichr reopened this Mar 2, 2026
abrichr and others added 3 commits March 2, 2026 00:10
CI was failing because uv.sources references local paths (../openadapt-ml)
that don't exist in CI. Use --no-sources flag to fall back to PyPI versions.
Also bump requires-python to >=3.11 since consilium 0.3.0 on PyPI requires it,
and fix consilium git URL to the renamed OpenAdaptAI/openadapt-consilium repo.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…h-checkpoint

# Conflicts:
#	.beads/issues.jsonl
#	openadapt_evals/infrastructure/__init__.py
#	scripts/record_waa_demos.py
#	tests/test_screen_stability.py
@abrichr abrichr merged commit 26da34c into main Mar 2, 2026
1 check passed
abrichr added a commit that referenced this pull request Mar 3, 2026
Add coverage for RL training environment, end-to-end eval pipeline,
annotation pipeline, 4-layer probe diagnostics, demo recording
persistence, review artifacts, coordinate clamping, and multi-cloud
VMProvider protocol. Update architecture tree with new modules
(rl_env.py, probe.py, annotation.py, vlm.py, vm_provider.py,
evaluation/) and scripts directory. Add openadapt-consilium to
related projects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
abrichr added a commit that referenced this pull request Mar 3, 2026
Add coverage for RL training environment, end-to-end eval pipeline,
annotation pipeline, 4-layer probe diagnostics, demo recording
persistence, review artifacts, coordinate clamping, and multi-cloud
VMProvider protocol. Update architecture tree with new modules
(rl_env.py, probe.py, annotation.py, vlm.py, vm_provider.py,
evaluation/) and scripts directory. Add openadapt-consilium to
related projects.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
abrichr added a commit that referenced this pull request Mar 3, 2026
Add coverage for RL training environment, end-to-end eval pipeline,
annotation pipeline, 4-layer probe diagnostics, demo recording
persistence, review artifacts, coordinate clamping, and multi-cloud
VMProvider protocol. Update architecture tree with new modules
(rl_env.py, probe.py, annotation.py, vlm.py, vm_provider.py,
evaluation/) and scripts directory. Add openadapt-consilium to
related projects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant