Skip to content

Commit 42b3847

Browse files
abrichrclaude
andauthored
feat: add QEMU monitor restart for Windows VM (#55)
* feat: add QEMU monitor restart for Windows VM Add QEMUResetManager that sends system_reset via the QEMU monitor telnet interface (port 7100) for reliable Windows hard resets inside the dockur container. This is more reliable than shutdown /r /t 0 via the WAA /execute endpoint, which dies before Windows actually restarts. Changes: - New module: openadapt_evals/infrastructure/qemu_reset.py - CLI command: oa-vm windows-restart --vm-ip <ip> --timeout 300 - Updated scripts/run_dc_eval.py _restart_container() to use QEMU reset as primary approach, falling back to docker restart if monitor is unreachable - 15 unit tests with mocked SSH/HTTP calls Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: handle QEMU monitor binary output and add recording script improvements - Fix UnicodeDecodeError in qemu_reset.py by using bytes mode instead of text=True for subprocess.run (QEMU monitor returns telnet control chars) - Add fire dependency to pyproject.toml for recording script CLI - Add --vm-ip parameter and QEMU hard reset on script startup for clean state - Add 'R' command to restart task from scratch via QEMU reset - Add LibreOffice recovery data cleanup and auto-recovery disabling after each hard reset (deletes backup files, removes RecoveryList entries, sets AutoSave=false in registrymodifications.xcu) - Add --tasks type guard with clear error message when Fire passes bool - Add TestRecordWaaArgParsing tests for argument validation - Fix test mocks to use bytes instead of strings for subprocess output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: only print recovery cleanup success when it actually succeeds The "Cleared LibreOffice recovery data." message was outside the try/except block, printing even when the cleanup request failed. Move it inside the success branch and add a warning for non-OK responses. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: extract HARDER_TASK_IDS to shared constants and fix regex - Move duplicated HARDER_TASK_IDS list from record_waa_demos.py and run_dc_eval.py into openadapt_evals/constants.py - Add re.DOTALL to LibreOffice cleanup regex so it handles multi-line XML entries in registrymodifications.xcu Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: remove redundant import and move fire to dev dependency - Remove duplicate QEMUResetManager import in _hard_reset_task_env (already imported in enclosing cmd_record_waa scope) - Move fire from core dependencies to dev extras since it's only used in scripts/__main__ guards, not as a library dependency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: remove unused pytest import in test_qemu_reset Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent fd7d38d commit 42b3847

11 files changed

Lines changed: 757 additions & 47 deletions

File tree

.beads/issues.jsonl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,5 +13,5 @@
1313
{"id":"openadapt-evals-hvm","title":"VL model fix PR #18 ready to merge","notes":"2026-02-08: openadapt-ml PR #18 was already merged on 2026-01-29. VL model fix is done.","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-29T16:17:03.491938-05:00","created_by":"Richard Abrich","updated_at":"2026-02-08T12:55:19.233249-05:00","closed_at":"2026-02-08T12:55:19.233249-05:00","close_reason":"PR #18 already merged 2026-01-29"}
1414
{"id":"openadapt-evals-mx8","title":"Analyze evaluation results and publish findings","description":"After demo-conditioned evaluation completes, analyze results: success rates, failure modes, demo impact. Create data-driven roadmap for improvements.","status":"open","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-14T12:23:06.328838-05:00","created_by":"Richard Abrich","updated_at":"2026-02-14T12:23:06.328838-05:00"}
1515
{"id":"openadapt-evals-sz4","title":"RCA: Windows product key prompt recurring issue","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.266286-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.493102-05:00","closed_at":"2026-01-20T20:32:06.493102-05:00","close_reason":"RCA complete - root cause is VERSION mismatch (CLI=11, Dockerfile=11e). Fix documented in RECURRING_ISSUES.md and WINDOWS_PRODUCT_KEY_RCA.md"}
16-
{"id":"openadapt-evals-vcb","title":"Run demo-conditioned WAA evaluation","description":"Once demos are recorded, run WAA evaluation with demo-conditioned agents (RetrievalAugmentedAgent with real demos). Target: measure improvement over zero-shot baseline. Requires real demos from recording task.","notes":"Feb 28: 6 design docs created (code health, marketing, CLI DX, testing, infra, docs). Marketing materials drafted and polished. Prioritization documented in STATUS.md. Tier 1 blockers identified: version fix, PyAutoGUI fail-safe recovery, socat systemd service, auto-open viewer. Next: implement Tier 1 items to unblock reliable eval runs.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-14T12:23:04.624305-05:00","created_by":"Richard Abrich","updated_at":"2026-02-28T09:26:40.150172-05:00"}
16+
{"id":"openadapt-evals-vcb","title":"Run demo-conditioned WAA evaluation","description":"Once demos are recorded, run WAA evaluation with demo-conditioned agents (RetrievalAugmentedAgent with real demos). Target: measure improvement over zero-shot baseline. Requires real demos from recording task.","notes":"Feb 28: 6 design docs created (code health, marketing, CLI DX, testing, infra, docs). Marketing materials drafted and polished. Prioritization documented in STATUS.md. Tier 1 blockers identified: version fix, PyAutoGUI fail-safe recovery, socat systemd service, auto-open viewer. Next: implement Tier 1 items to unblock reliable eval runs.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-14T12:23:04.624305-05:00","created_by":"Richard Abrich","updated_at":"2026-02-28T11:25:44.494548-05:00"}
1717
{"id":"openadapt-evals-wis","title":"Add pre-flight check to detect Windows install issues","status":"closed","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.865052-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.757261-05:00","closed_at":"2026-01-20T20:32:06.757261-05:00","close_reason":"Duplicate of openadapt-evals-0dt"}

openadapt_evals/benchmarks/vm_cli.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2807,6 +2807,30 @@ def cmd_vm_start(args):
28072807
return 1
28082808

28092809

2810+
def cmd_windows_restart(args):
2811+
"""Restart Windows inside QEMU via the monitor interface."""
2812+
from openadapt_evals.infrastructure.qemu_reset import QEMUResetManager
2813+
2814+
init_logging()
2815+
2816+
ip = args.vm_ip or get_vm_ip()
2817+
if not ip:
2818+
log("WIN-RESTART", "ERROR: VM not found. Specify --vm-ip or ensure VM is running.")
2819+
return 1
2820+
2821+
log("WIN-RESTART", f"Restarting Windows on {ip} via QEMU monitor...")
2822+
2823+
mgr = QEMUResetManager(
2824+
vm_ip=ip,
2825+
ssh_user="azureuser",
2826+
timeout_seconds=args.timeout,
2827+
)
2828+
2829+
success, message = mgr.restart_windows(server_url=args.server)
2830+
log("WIN-RESTART", message)
2831+
return 0 if success else 1
2832+
2833+
28102834
def cmd_exec(args):
28112835
"""Run command on VM host."""
28122836
ip = get_vm_ip()
@@ -7941,6 +7965,27 @@ def main():
79417965
p_vmstart = subparsers.add_parser("vm-start", help="Start a deallocated VM")
79427966
p_vmstart.set_defaults(func=cmd_vm_start)
79437967

7968+
# windows-restart
7969+
p_winrestart = subparsers.add_parser(
7970+
"windows-restart",
7971+
help="Restart Windows inside QEMU via monitor (hard reset)",
7972+
)
7973+
p_winrestart.add_argument(
7974+
"--vm-ip", default=None, help="VM IP (default: auto-detect from Azure)"
7975+
)
7976+
p_winrestart.add_argument(
7977+
"--server",
7978+
default="http://localhost:5001",
7979+
help="WAA server URL for readiness probe (default: http://localhost:5001)",
7980+
)
7981+
p_winrestart.add_argument(
7982+
"--timeout",
7983+
type=int,
7984+
default=300,
7985+
help="Timeout in seconds to wait for WAA server after reset (default: 300)",
7986+
)
7987+
p_winrestart.set_defaults(func=cmd_windows_restart)
7988+
79447989
# logs
79457990
p_logs = subparsers.add_parser("logs", help="Show WAA status and logs")
79467991
p_logs.add_argument(

openadapt_evals/constants.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
"""Shared constants for openadapt-evals."""
2+
3+
# 12 harder WAA task IDs used for demo-conditioned evaluation
4+
HARDER_TASK_IDS = [
5+
"04d9aeaf-7bed-4024-bedb-e10e6f00eb7f-WOS",
6+
"0a0faba3-5580-44df-965d-f562a99b291c-WOS",
7+
"0bf05a7d-b28b-44d2-955a-50b41e24012a-WOS",
8+
"0e763496-b6bb-4508-a427-fad0b6c3e195-WOS",
9+
"4bcb1253-a636-4df4-8cb0-a35c04dfef31-WOS",
10+
"70745df8-f2f5-42bd-8074-fbc10334fcc5-2-WOS",
11+
"8b1ce5f2-59d2-4dcc-b0b0-666a714b9a14-WOS",
12+
"e2b5e914-ffe1-44d2-8e92-58f8c5d92bb2-WOS",
13+
"ec71221e-ac43-46f9-89b8-ee7d80f7e1c5-WOS",
14+
"fba2c100-79e8-42df-ae74-b592418d54f4-WOS",
15+
"INF-0d95d28a-9587-433b-a805-1fbe5467d598-WOS",
16+
"INF-5ac2891a-eacd-4954-b339-98abba077adb-WOS",
17+
]

openadapt_evals/infrastructure/__init__.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
- VMMonitor: Azure VM status monitoring
77
- AzureOpsTracker: Azure operation logging
88
- SSHTunnelManager: SSH tunnel management for VNC/API access
9+
- QEMUResetManager: QEMU monitor-based Windows restart
910
1011
Example:
1112
```python
@@ -18,12 +19,18 @@
1819
# Create and manage pools
1920
pool = PoolManager()
2021
pool.create(workers=3)
22+
23+
# Restart Windows inside QEMU
24+
from openadapt_evals.infrastructure import QEMUResetManager
25+
mgr = QEMUResetManager(vm_ip="172.173.66.131")
26+
success, msg = mgr.restart_windows()
2127
```
2228
"""
2329

2430
from openadapt_evals.infrastructure.azure_ops_tracker import AzureOpsTracker
2531
from openadapt_evals.infrastructure.azure_vm import AzureVMManager
2632
from openadapt_evals.infrastructure.pool import PoolManager, PoolRunResult
33+
from openadapt_evals.infrastructure.qemu_reset import QEMUResetManager
2734
from openadapt_evals.infrastructure.ssh_tunnel import SSHTunnelManager, get_tunnel_manager
2835
from openadapt_evals.infrastructure.vm_monitor import VMMonitor, VMConfig
2936

@@ -32,6 +39,7 @@
3239
"AzureVMManager",
3340
"PoolManager",
3441
"PoolRunResult",
42+
"QEMUResetManager",
3543
"VMMonitor",
3644
"VMConfig",
3745
"SSHTunnelManager",
Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
"""QEMU Monitor Reset Manager for Windows VMs.
2+
3+
Provides reliable Windows restart inside QEMU (dockur/windows Docker image)
4+
by sending ``system_reset`` via the QEMU monitor telnet interface on port 7100.
5+
6+
The WAA Flask server running inside Windows dies when you send
7+
``shutdown /r /t 0`` through the ``/execute`` endpoint, making that approach
8+
unreliable. Sending ``system_reset`` through the QEMU monitor is a hard
9+
reset that works regardless of the guest OS state.
10+
11+
Architecture::
12+
13+
Local machine
14+
--> SSH --> Azure VM (Ubuntu host)
15+
--> docker exec winarena
16+
--> echo "system_reset" | nc -q1 localhost 7100
17+
(QEMU monitor telnet on port 7100)
18+
19+
After reset, the container's ``entry_setup.sh`` automatically polls
20+
``172.30.0.2:5000/probe`` and the WAA Flask server comes back up in
21+
~90 seconds.
22+
23+
Usage::
24+
25+
from openadapt_evals.infrastructure.qemu_reset import QEMUResetManager
26+
27+
mgr = QEMUResetManager(vm_ip="172.173.66.131")
28+
29+
# Full restart: send reset + wait for WAA server
30+
success, message = mgr.restart_windows()
31+
32+
# Or do each step separately
33+
mgr.reset_windows()
34+
mgr.wait_for_waa_ready()
35+
"""
36+
37+
from __future__ import annotations
38+
39+
import logging
40+
import subprocess
41+
import time
42+
43+
import requests
44+
45+
logger = logging.getLogger(__name__)
46+
47+
# SSH options consistent with the rest of the codebase
48+
_SSH_OPTS = [
49+
"-o", "StrictHostKeyChecking=no",
50+
"-o", "UserKnownHostsFile=/dev/null",
51+
"-o", "LogLevel=ERROR",
52+
"-o", "ConnectTimeout=10",
53+
]
54+
55+
56+
class QEMUResetManager:
57+
"""Manage Windows restarts via QEMU monitor inside a Docker container.
58+
59+
Attributes:
60+
vm_ip: IP address of the Azure Ubuntu VM hosting the Docker container.
61+
ssh_user: SSH user for the VM (default ``azureuser``).
62+
qemu_monitor_port: QEMU monitor telnet port inside the container (default 7100).
63+
container_name: Docker container name (default ``winarena``).
64+
timeout_seconds: Maximum seconds to wait for the WAA server after reset.
65+
"""
66+
67+
def __init__(
68+
self,
69+
vm_ip: str,
70+
ssh_user: str = "azureuser",
71+
qemu_monitor_port: int = 7100,
72+
container_name: str = "winarena",
73+
timeout_seconds: int = 300,
74+
) -> None:
75+
self.vm_ip = vm_ip
76+
self.ssh_user = ssh_user
77+
self.qemu_monitor_port = qemu_monitor_port
78+
self.container_name = container_name
79+
self.timeout_seconds = timeout_seconds
80+
81+
def reset_windows(self) -> bool:
82+
"""Send ``system_reset`` via the QEMU monitor over SSH.
83+
84+
Executes::
85+
86+
ssh {user}@{ip} "docker exec {container} bash -c
87+
'echo system_reset | nc -q1 localhost {port}'"
88+
89+
Returns:
90+
True if the SSH + docker exec command succeeded (exit code 0).
91+
"""
92+
docker_cmd = (
93+
f"docker exec {self.container_name} bash -c "
94+
f"'echo system_reset | nc -q1 localhost {self.qemu_monitor_port}'"
95+
)
96+
ssh_cmd = [
97+
"ssh",
98+
*_SSH_OPTS,
99+
f"{self.ssh_user}@{self.vm_ip}",
100+
docker_cmd,
101+
]
102+
103+
logger.info(
104+
"Sending system_reset via QEMU monitor (port %d) on %s",
105+
self.qemu_monitor_port,
106+
self.vm_ip,
107+
)
108+
109+
try:
110+
result = subprocess.run(
111+
ssh_cmd,
112+
capture_output=True,
113+
timeout=30,
114+
)
115+
except subprocess.TimeoutExpired:
116+
logger.error("SSH command timed out sending system_reset")
117+
return False
118+
119+
if result.returncode != 0:
120+
stderr = result.stderr.decode("utf-8", errors="replace").strip()
121+
logger.error(
122+
"QEMU monitor reset failed (rc=%d): %s",
123+
result.returncode,
124+
stderr,
125+
)
126+
return False
127+
128+
logger.info("QEMU system_reset sent successfully")
129+
return True
130+
131+
def wait_for_waa_ready(
132+
self,
133+
server_url: str = "http://localhost:5001",
134+
check_interval: int = 10,
135+
) -> bool:
136+
"""Poll the WAA ``/probe`` endpoint until it responds or timeout.
137+
138+
Args:
139+
server_url: Base URL of the WAA server (through SSH tunnel).
140+
check_interval: Seconds between probe attempts.
141+
142+
Returns:
143+
True if the server responded within ``timeout_seconds``, False on timeout.
144+
"""
145+
probe_url = f"{server_url}/probe"
146+
deadline = time.time() + self.timeout_seconds
147+
start = time.time()
148+
149+
logger.info(
150+
"Waiting up to %ds for WAA server at %s",
151+
self.timeout_seconds,
152+
probe_url,
153+
)
154+
155+
while time.time() < deadline:
156+
elapsed = int(time.time() - start)
157+
try:
158+
resp = requests.get(probe_url, timeout=check_interval)
159+
if resp.ok:
160+
logger.info("WAA server ready after %ds", elapsed)
161+
return True
162+
except (requests.ConnectionError, requests.Timeout):
163+
pass
164+
165+
remaining = int(deadline - time.time())
166+
if remaining > 0:
167+
logger.info(
168+
"[%ds] WAA not ready yet, retrying in %ds (%ds remaining)...",
169+
elapsed,
170+
check_interval,
171+
remaining,
172+
)
173+
time.sleep(check_interval)
174+
175+
elapsed = int(time.time() - start)
176+
logger.error("WAA server did not become ready within %ds", elapsed)
177+
return False
178+
179+
def restart_windows(
180+
self,
181+
server_url: str = "http://localhost:5001",
182+
) -> tuple[bool, str]:
183+
"""Full restart cycle: send QEMU reset then wait for WAA readiness.
184+
185+
Args:
186+
server_url: Base URL of the WAA server (through SSH tunnel).
187+
188+
Returns:
189+
Tuple of (success, message) where *success* is True if the
190+
server came back within the timeout.
191+
"""
192+
if not self.reset_windows():
193+
return False, "Failed to send system_reset via QEMU monitor"
194+
195+
logger.info("Reset sent, waiting for WAA server to come back...")
196+
197+
if self.wait_for_waa_ready(server_url=server_url):
198+
return True, "Windows restarted and WAA server is ready"
199+
200+
return False, f"WAA server did not come back within {self.timeout_seconds}s"
201+
202+
def is_qemu_monitor_reachable(self) -> bool:
203+
"""Check whether the QEMU monitor telnet port is reachable inside the container.
204+
205+
This can be used to decide whether to fall back to ``docker restart``.
206+
207+
Returns:
208+
True if the QEMU monitor responds.
209+
"""
210+
docker_cmd = (
211+
f"docker exec {self.container_name} bash -c "
212+
f"'echo info version | nc -q1 localhost {self.qemu_monitor_port}'"
213+
)
214+
ssh_cmd = [
215+
"ssh",
216+
*_SSH_OPTS,
217+
f"{self.ssh_user}@{self.vm_ip}",
218+
docker_cmd,
219+
]
220+
221+
try:
222+
result = subprocess.run(
223+
ssh_cmd,
224+
capture_output=True,
225+
timeout=15,
226+
)
227+
stdout = result.stdout.decode("utf-8", errors="replace")
228+
reachable = result.returncode == 0 and "QEMU" in stdout
229+
logger.debug(
230+
"QEMU monitor reachable: %s (stdout: %s)",
231+
reachable,
232+
stdout.strip()[:100],
233+
)
234+
return reachable
235+
except subprocess.TimeoutExpired:
236+
logger.debug("QEMU monitor reachability check timed out")
237+
return False

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ dev = [
4848
"ruff>=0.1.0",
4949
"flask>=3.0.0",
5050
"requests-toolbelt>=1.0.0",
51+
"fire>=0.5.0",
5152
]
5253
waa = [
5354
# Windows Agent Arena dependencies

0 commit comments

Comments
 (0)