Skip to content

Commit 4a28653

Browse files
abrichrclaude
andauthored
fix: detect and dismiss Windows lock screen before each task (#117)
* feat: add correction flywheel (store, capture, parser, controller hooks) Implements the correction flywheel MVP: - correction_store.py: JSON-file-based correction library with save/find (fuzzy string matching via SequenceMatcher)/load_all - correction_capture.py: Human correction capture using openadapt-capture Recorder (primary) with PIL screenshot fallback - correction_parser.py: VLM call to parse before/after screenshots into PlanStep dict (think/action/expect) - demo_controller.py: Added correction_store and enable_correction_capture params. On retry exhaustion: check correction store -> inject match, or capture human correction -> parse -> store -> advance - cli.py: Added --correction-library and --enable-correction-capture flags The loop: agent fails at step N -> correction store checked -> if match, inject corrected step -> if no match and capture enabled, human completes step -> Recorder captures -> VLM parses -> correction stored -> next run retrieves it. 17 tests added, all passing. 54 existing demo_controller tests unaffected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: mock _has_recorder in correction capture test The test was calling the real Recorder which may not have wait_for_ready in the installed version. Mock it to use the simple fallback path since this is a unit test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: detect and dismiss Windows lock screen before each task Add _dismiss_lock_screen() to run_dc_eval.py that checks for LogonUI.exe process and types the password to unlock if the screen is locked. Called from ensure_waa_ready() after each successful probe. This prevents eval failures when the Windows VM has been idle and the lock screen has engaged between tasks or between sessions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: sync beads state Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent db22f6b commit 4a28653

9 files changed

Lines changed: 1151 additions & 1 deletion

File tree

.beads/beads.db

0 Bytes
Binary file not shown.

.beads/issues.jsonl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,5 +13,5 @@
1313
{"id":"openadapt-evals-hvm","title":"VL model fix PR #18 ready to merge","notes":"2026-02-08: openadapt-ml PR #18 was already merged on 2026-01-29. VL model fix is done.","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-29T16:17:03.491938-05:00","created_by":"Richard Abrich","updated_at":"2026-02-08T12:55:19.233249-05:00","closed_at":"2026-02-08T12:55:19.233249-05:00","close_reason":"PR #18 already merged 2026-01-29"}
1414
{"id":"openadapt-evals-mx8","title":"Analyze evaluation results and publish findings","description":"After demo-conditioned evaluation completes, analyze results: success rates, failure modes, demo impact. Create data-driven roadmap for improvements.","notes":"wright repo (OpenAdaptAI/wright) scaffolding underway. Herald + consilium repos transferred to OpenAdaptAI org. Wright will be the orchestration layer for eval pipeline.","status":"open","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-14T12:23:06.328838-05:00","created_by":"Richard Abrich","updated_at":"2026-03-02T00:08:08.422633-05:00"}
1515
{"id":"openadapt-evals-sz4","title":"RCA: Windows product key prompt recurring issue","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.266286-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.493102-05:00","closed_at":"2026-01-20T20:32:06.493102-05:00","close_reason":"RCA complete - root cause is VERSION mismatch (CLI=11, Dockerfile=11e). Fix documented in RECURRING_ISSUES.md and WINDOWS_PRODUCT_KEY_RCA.md"}
16-
{"id":"openadapt-evals-vcb","title":"Run demo-conditioned WAA evaluation","description":"Once demos are recorded, run WAA evaluation with demo-conditioned agents (RetrievalAugmentedAgent with real demos). Target: measure improvement over zero-shot baseline. Requires real demos from recording task.","notes":"2026-03-06: Core4 Trial 1 launched with --controller --done-gate --max-steps 30 (first ever run with both features). Prior 7 trials showed DC=14% vs ZS=18% — no lift. Root causes: (1) --controller was NEVER used, (2) no done-gate existed. PRs merged this session: #107 (readiness), #109 (core4 lane), #110 (done-gate). Results will be in benchmark_results/repeat_core4_trial1_20260306_154155/","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-14T12:23:04.624305-05:00","created_by":"Richard Abrich","updated_at":"2026-03-07T01:44:43.380289-05:00"}
16+
{"id":"openadapt-evals-vcb","title":"Run demo-conditioned WAA evaluation","description":"Once demos are recorded, run WAA evaluation with demo-conditioned agents (RetrievalAugmentedAgent with real demos). Target: measure improvement over zero-shot baseline. Requires real demos from recording task.","notes":"2026-03-06: Core4 Trial 1 launched with --controller --done-gate --max-steps 30 (first ever run with both features). Prior 7 trials showed DC=14% vs ZS=18% — no lift. Root causes: (1) --controller was NEVER used, (2) no done-gate existed. PRs merged this session: #107 (readiness), #109 (core4 lane), #110 (done-gate). Results will be in benchmark_results/repeat_core4_trial1_20260306_154155/","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-02-14T12:23:04.624305-05:00","created_by":"Richard Abrich","updated_at":"2026-03-08T12:32:50.259805-04:00"}
1717
{"id":"openadapt-evals-wis","title":"Add pre-flight check to detect Windows install issues","status":"closed","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.865052-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.757261-05:00","closed_at":"2026-01-20T20:32:06.757261-05:00","close_reason":"Duplicate of openadapt-evals-0dt"}

openadapt_evals/benchmarks/cli.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -460,6 +460,18 @@ def cmd_run(args: argparse.Namespace) -> int:
460460
if use_controller:
461461
print(f"Using DemoController (max_retries={args.max_retries}, max_replans={args.max_replans})")
462462

463+
# Set up correction store if requested
464+
correction_store = None
465+
enable_correction_capture = getattr(args, "enable_correction_capture", False)
466+
correction_library_path = getattr(args, "correction_library", None)
467+
if correction_library_path:
468+
from openadapt_evals.correction_store import CorrectionStore
469+
470+
correction_store = CorrectionStore(correction_library_path)
471+
print(f"Correction library: {correction_library_path}")
472+
if enable_correction_capture:
473+
print("Correction capture: ENABLED (will prompt for human corrections on failure)")
474+
463475
# Run evaluation
464476
if use_controller:
465477
from openadapt_evals.demo_controller import run_with_controller
@@ -475,6 +487,8 @@ def cmd_run(args: argparse.Namespace) -> int:
475487
max_steps=args.max_steps,
476488
max_retries=args.max_retries,
477489
max_replans=args.max_replans,
490+
correction_store=correction_store,
491+
enable_correction_capture=enable_correction_capture,
478492
)
479493
results.append(result)
480494
else:
@@ -2432,6 +2446,10 @@ def main() -> int:
24322446
run_parser.add_argument("--focus-check-method", type=str, default="win32",
24332447
choices=["win32", "a11y", "both"],
24342448
help="Method for foreground window check: win32 (fast, default), a11y, or both")
2449+
run_parser.add_argument("--correction-library", type=str, default=None,
2450+
help="Path to correction library directory for the correction flywheel")
2451+
run_parser.add_argument("--enable-correction-capture", action="store_true",
2452+
help="Enable HITL correction capture when agent fails (requires --correction-library)")
24352453

24362454
# Live evaluation (full control)
24372455
live_parser = subparsers.add_parser("live", help="Run live evaluation against WAA server (full control)")
Lines changed: 238 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,238 @@
1+
"""Correction capture for the correction flywheel.
2+
3+
Captures a human correction using openadapt-capture's Recorder (primary path)
4+
or falls back to simple periodic screenshots via PIL if openadapt-capture is
5+
not available.
6+
7+
The Recorder provides full input event recording (mouse + keyboard) plus
8+
action-gated screenshots, which gives the VLM parser much richer context
9+
for understanding what the human did.
10+
"""
11+
12+
from __future__ import annotations
13+
14+
import logging
15+
import os
16+
import time
17+
from dataclasses import dataclass, field
18+
19+
logger = logging.getLogger(__name__)
20+
21+
22+
@dataclass
23+
class CorrectionResult:
24+
"""Result of a correction capture session."""
25+
26+
screenshots: list[str] = field(default_factory=list) # paths
27+
capture_dir: str | None = None # openadapt-capture directory (if used)
28+
duration_seconds: float = 0.0
29+
output_dir: str = ""
30+
31+
32+
def _take_screenshot(output_path: str) -> str | None:
33+
"""Take a screenshot and save to output_path. Returns path or None."""
34+
try:
35+
from PIL import ImageGrab
36+
37+
img = ImageGrab.grab()
38+
img.save(output_path)
39+
return output_path
40+
except Exception as exc:
41+
logger.warning("Screenshot failed: %s", exc)
42+
return None
43+
44+
45+
def _has_recorder() -> bool:
46+
"""Check if openadapt-capture Recorder is available."""
47+
try:
48+
from openadapt_capture.recorder import Recorder # noqa: F401
49+
50+
return True
51+
except ImportError:
52+
return False
53+
54+
55+
def _prompt_user(step_desc: str, explanation: str) -> None:
56+
"""Print the correction prompt to the terminal."""
57+
print("\n" + "=" * 60)
58+
print("CORRECTION NEEDED")
59+
print("=" * 60)
60+
print(f"Failed step: {step_desc}")
61+
if explanation:
62+
print(f"Reason: {explanation}")
63+
print("\nPlease complete this step manually.")
64+
print("Press Enter when done...")
65+
print("=" * 60 + "\n")
66+
67+
68+
def _wait_for_enter(timeout_seconds: int) -> None:
69+
"""Block until user presses Enter or timeout expires."""
70+
try:
71+
import select
72+
import sys
73+
74+
if hasattr(select, "select"):
75+
remaining = timeout_seconds
76+
while remaining > 0:
77+
ready, _, _ = select.select([sys.stdin], [], [], 1.0)
78+
if ready:
79+
sys.stdin.readline()
80+
break
81+
remaining -= 1.0
82+
else:
83+
input()
84+
except EOFError:
85+
logger.info("stdin closed, stopping capture after timeout")
86+
time.sleep(min(timeout_seconds, 10))
87+
88+
89+
class CorrectionCapture:
90+
"""Capture a human correction for a failed step."""
91+
92+
def __init__(self, output_dir: str):
93+
self.output_dir = output_dir
94+
os.makedirs(output_dir, exist_ok=True)
95+
96+
def capture_correction(
97+
self,
98+
failure_context: dict,
99+
timeout_seconds: int = 120,
100+
interval_seconds: float = 2.0,
101+
) -> CorrectionResult:
102+
"""Capture a human correction.
103+
104+
Uses openadapt-capture Recorder if available (full input events +
105+
action-gated screenshots), otherwise falls back to periodic PIL
106+
screenshots.
107+
"""
108+
# Save the failure screenshot as "before"
109+
before_path = os.path.join(self.output_dir, "before.png")
110+
before_screenshots = []
111+
if failure_context.get("screenshot_bytes"):
112+
with open(before_path, "wb") as f:
113+
f.write(failure_context["screenshot_bytes"])
114+
before_screenshots.append(before_path)
115+
elif failure_context.get("screenshot_path"):
116+
before_screenshots.append(failure_context["screenshot_path"])
117+
118+
step_desc = failure_context.get("step_action", "this step")
119+
explanation = failure_context.get("explanation", "")
120+
121+
_prompt_user(step_desc, explanation)
122+
123+
if _has_recorder():
124+
return self._capture_with_recorder(
125+
before_screenshots, timeout_seconds
126+
)
127+
else:
128+
logger.info("openadapt-capture not available, using simple screenshot capture")
129+
return self._capture_simple(
130+
before_screenshots, timeout_seconds, interval_seconds
131+
)
132+
133+
def _capture_with_recorder(
134+
self,
135+
before_screenshots: list[str],
136+
timeout_seconds: int,
137+
) -> CorrectionResult:
138+
"""Full capture using openadapt-capture Recorder."""
139+
from openadapt_capture.recorder import Recorder
140+
141+
capture_dir = os.path.join(self.output_dir, "recording")
142+
start = time.monotonic()
143+
144+
with Recorder(
145+
capture_dir,
146+
task_description="Human correction for failed agent step",
147+
capture_video=False, # screenshots only, faster
148+
capture_audio=False,
149+
) as recorder:
150+
recorder.wait_for_ready(timeout=30)
151+
_wait_for_enter(timeout_seconds)
152+
recorder.stop()
153+
154+
duration = time.monotonic() - start
155+
156+
# Extract screenshots from the capture
157+
screenshot_paths = list(before_screenshots)
158+
try:
159+
from openadapt_capture.capture import CaptureSession
160+
161+
session = CaptureSession.load(capture_dir)
162+
for i, action in enumerate(session.actions()):
163+
if action.screenshot is not None:
164+
path = os.path.join(self.output_dir, f"action_{i:04d}.png")
165+
action.screenshot.save(path)
166+
screenshot_paths.append(path)
167+
except Exception as exc:
168+
logger.warning("Failed to extract screenshots from capture: %s", exc)
169+
# Fall back to taking a final screenshot
170+
after_path = os.path.join(self.output_dir, "after.png")
171+
taken = _take_screenshot(after_path)
172+
if taken:
173+
screenshot_paths.append(taken)
174+
175+
logger.info(
176+
"Recorder capture complete: %d screenshots in %.1fs",
177+
len(screenshot_paths),
178+
duration,
179+
)
180+
return CorrectionResult(
181+
screenshots=screenshot_paths,
182+
capture_dir=capture_dir,
183+
duration_seconds=duration,
184+
output_dir=self.output_dir,
185+
)
186+
187+
def _capture_simple(
188+
self,
189+
before_screenshots: list[str],
190+
timeout_seconds: int,
191+
interval_seconds: float,
192+
) -> CorrectionResult:
193+
"""Fallback: periodic PIL screenshots."""
194+
import threading
195+
196+
start = time.monotonic()
197+
stop_event = threading.Event()
198+
screenshot_paths: list[str] = []
199+
200+
def _capture_loop():
201+
idx = 0
202+
while not stop_event.is_set():
203+
stop_event.wait(interval_seconds)
204+
if stop_event.is_set():
205+
break
206+
path = os.path.join(self.output_dir, f"capture_{idx:04d}.png")
207+
taken = _take_screenshot(path)
208+
if taken:
209+
screenshot_paths.append(taken)
210+
idx += 1
211+
212+
capture_thread = threading.Thread(target=_capture_loop, daemon=True)
213+
capture_thread.start()
214+
215+
_wait_for_enter(timeout_seconds)
216+
217+
stop_event.set()
218+
capture_thread.join(timeout=5)
219+
220+
# Final "after" screenshot
221+
after_path = os.path.join(self.output_dir, "after.png")
222+
taken = _take_screenshot(after_path)
223+
if taken:
224+
screenshot_paths.append(taken)
225+
226+
all_screenshots = list(before_screenshots) + screenshot_paths
227+
duration = time.monotonic() - start
228+
229+
logger.info(
230+
"Simple capture complete: %d screenshots in %.1fs",
231+
len(all_screenshots),
232+
duration,
233+
)
234+
return CorrectionResult(
235+
screenshots=all_screenshots,
236+
duration_seconds=duration,
237+
output_dir=self.output_dir,
238+
)
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
"""Parse a human correction capture into a PlanStep.
2+
3+
Uses a VLM call to compare before/after screenshots and describe what
4+
the human did in the same format as a plan step (think/action/expect).
5+
"""
6+
7+
from __future__ import annotations
8+
9+
import json
10+
import logging
11+
import os
12+
13+
from openadapt_evals.vlm import vlm_call
14+
15+
logger = logging.getLogger(__name__)
16+
17+
_PARSE_PROMPT = """\
18+
The agent was trying to perform a step but failed. A human then completed the step manually.
19+
20+
Failed step description: {step_action}
21+
Failure explanation: {failure_explanation}
22+
23+
Compare the BEFORE screenshot (when the agent failed) and the AFTER screenshot \
24+
(after the human completed the step). Describe what the human did to complete the step.
25+
26+
Respond in this exact JSON format:
27+
{{
28+
"think": "reasoning about what needed to happen and why the agent failed",
29+
"action": "concrete description of what the human did (e.g., 'Click the Display button in the left sidebar')",
30+
"expect": "what the screen looks like after the action"
31+
}}
32+
33+
Respond with ONLY the JSON object, no other text."""
34+
35+
36+
def parse_correction(
37+
step_action: str,
38+
failure_explanation: str,
39+
before_screenshot: bytes,
40+
after_screenshot: bytes,
41+
model: str = "gpt-4.1-mini",
42+
provider: str = "openai",
43+
) -> dict:
44+
"""Parse before/after screenshots into a PlanStep dict.
45+
46+
Returns dict with keys: think, action, expect.
47+
"""
48+
prompt = _PARSE_PROMPT.format(
49+
step_action=step_action,
50+
failure_explanation=failure_explanation,
51+
)
52+
53+
response = vlm_call(
54+
prompt,
55+
images=[before_screenshot, after_screenshot],
56+
model=model,
57+
provider=provider,
58+
max_tokens=512,
59+
)
60+
61+
# Extract JSON from response
62+
try:
63+
# Try direct parse first
64+
result = json.loads(response)
65+
except json.JSONDecodeError:
66+
# Try to find JSON in the response
67+
import re
68+
69+
match = re.search(r"\{[^}]+\}", response, re.DOTALL)
70+
if match:
71+
result = json.loads(match.group())
72+
else:
73+
logger.error("Failed to parse VLM response as JSON: %s", response[:200])
74+
result = {
75+
"think": f"Human corrected the step: {step_action}",
76+
"action": step_action,
77+
"expect": "Step completed successfully",
78+
}
79+
80+
# Ensure required keys exist
81+
for key in ("think", "action", "expect"):
82+
if key not in result:
83+
result[key] = ""
84+
85+
logger.info("Parsed correction: action=%s", result["action"][:80])
86+
return result

0 commit comments

Comments
 (0)