OpenAdaptAI
diff --git a/‎docs/waa_execution_parity_plan.md‎
Lines changed: 72 additions & 0 deletions b/‎docs/waa_execution_parity_plan.md‎
Lines changed: 72 additions & 0 deletions
diff --git a/‎openadapt_evals/benchmarks/cli.py‎
Lines changed: 39 additions & 6 deletions b/‎openadapt_evals/benchmarks/cli.py‎
Lines changed: 39 additions & 6 deletions
diff --git a/‎openadapt_evals/benchmarks/runner.py‎
Lines changed: 99 additions & 3 deletions b/‎openadapt_evals/benchmarks/runner.py‎
Lines changed: 99 additions & 3 deletions
@@ -0,0 +1,72 @@
+# WAA Execution-Core Parity Plan
+
+Date: 2026-03-06  
+Scope: `openadapt-evals` infrastructure + adapter execution path
+
+## Target Definition
+
+Execution-core parity means our eval loop has the same practical reliability envelope as vanilla WAA for a fixed task set:
+
+- deterministic environment reset
+- stable observation/action/evaluate endpoints
+- no recurring transport timeout cascades during repeated trials
+- reproducible run commands and health gates
+
+## Current State
+
+Not there yet.
+
+What is now in place:
+
+- Multi-sample health gate before runs (`run_dc_eval.py`):
+  - requires repeated success across `probe + screenshot + a11y + execute + evaluate`
+  - defaults: `3` samples, `2` required successes
+- Resume conversion for stale safe-mode checkpoints (`record_waa_demos.py`):
+  - non-safe-mode resume no longer gets stuck in manual placeholder flow
+  - AI step plan is regenerated from the live screen
+- A11y backend adaptation and focus-check correctness were already landed in prior stabilization edits.
+
+## Remaining Gaps vs Upstream-Style Reliability
+
+1. Multi-hop topology risk remains (local tunnel + VM host + container).
+2. Screenshot path still relies on Flask `/screenshot` endpoint.
+3. UIA remains intermittently slow/unresponsive in this environment.
+4. No explicit soak gate yet that blocks promotion unless repeated trials are transport-clean.
+
+## Parity Plan (Phased)
+
+### Phase 1: Gate and Detect (in progress)
+
+- Keep strict pre-run health gate enabled by default.
+- Add run-level pass criteria:
+  - no infra fail-fast terminations
+  - no transport timeout cascades in execution logs
+- Artifact: per-run parity summary in `STATUS.md` weekly section.
+
+### Phase 2: Stabilize Substrate
+
+- Add upstream-aligned screenshot fallback path (QMP/`obs_winagent`-based where available).
+- Keep win32 as default a11y backend except tasks that explicitly require UIA.
+- Ensure reset path is idempotent under LibreOffice recovery edge cases.
+
+### Phase 3: Prove with Repeats
+
+- Core4 soak: 3 repeated trials (ZS+DC) with zero infra aborts.
+- Then full hard-12 repeated trials (minimum 3/task/condition).
+- Publish north-star delta from only transport-clean runs.
+
+### Phase 4: Decision Gate
+
+- By April 15, 2026:
+  - If quantitative lift is clear and infra is stable, continue.
+  - If lift is unclear, pivot to narrower retrieval-layer productization.
+
+## Acceptance Criteria ("There Yet")
+
+We consider execution-core parity achieved when all are true for two consecutive evaluation days:
+
+1. Core4 repeated trials complete without infra fail-fast aborts.
+2. No repeated transport timeout cascades in task execution logs.
+3. Health gate passes consistently without manual tunnel/container intervention.
+4. Demo recording/resume flows are deterministic (no stale safe-mode/manual-step traps).
+
@@ -271,12 +271,18 @@ def cmd_mock(args: argparse.Namespace) -> int:
         return 1
 
     # Create config for trace collection
+    done_gate = getattr(args, "done_gate", False)
+    done_gate_max_overrides = getattr(args, "done_gate_max_overrides", 3)
+    done_gate_threshold = getattr(args, "done_gate_threshold", 1.0)
     config = None
-    if args.output:
+    if args.output or done_gate:
         config = EvaluationConfig(
-            save_execution_traces=True,
-            output_dir=args.output,
+            save_execution_traces=bool(args.output),
+            output_dir=args.output or "benchmark_results",
             run_name=args.run_name or "mock_eval",
+            done_gate=done_gate,
+            done_gate_max_overrides=done_gate_max_overrides,
+            done_gate_threshold=done_gate_threshold,
         )
 
     # Run evaluation
@@ -441,6 +447,9 @@ def cmd_run(args: argparse.Namespace) -> int:
         save_execution_traces=True,
         output_dir=args.output,
         run_name=args.run_name,
+        done_gate=getattr(args, "done_gate", False),
+        done_gate_max_overrides=getattr(args, "done_gate_max_overrides", 3),
+        done_gate_threshold=getattr(args, "done_gate_threshold", 1.0),
     )
 
     print(f"Running {len(task_ids)} task(s): {', '.join(task_ids)}")
@@ -658,12 +667,18 @@ def cmd_live(args: argparse.Namespace) -> int:
         return 1
 
     # Create config for trace collection
+    done_gate = getattr(args, "done_gate", False)
+    done_gate_max_overrides = getattr(args, "done_gate_max_overrides", 3)
+    done_gate_threshold = getattr(args, "done_gate_threshold", 1.0)
     eval_config = None
-    if args.output:
+    if args.output or done_gate:
         eval_config = EvaluationConfig(
-            save_execution_traces=True,
-            output_dir=args.output,
+            save_execution_traces=bool(args.output),
+            output_dir=args.output or "benchmark_results",
             run_name=args.run_name or "live_eval",
+            done_gate=done_gate,
+            done_gate_max_overrides=done_gate_max_overrides,
+            done_gate_threshold=done_gate_threshold,
         )
 
     # Load tasks
@@ -2357,6 +2372,12 @@ def main() -> int:
     mock_parser.add_argument("--use-a11y-tree", action="store_true", help="Enable accessibility tree grounding for Qwen3VL")
     mock_parser.add_argument("--output", type=str, help="Output directory for traces")
     mock_parser.add_argument("--run-name", type=str, help="Name for this evaluation run")
+    mock_parser.add_argument("--done-gate", action="store_true",
+                            help="Verify task completion before accepting agent's 'done' signal")
+    mock_parser.add_argument("--done-gate-max-overrides", type=int, default=3,
+                            help="Max times to override premature 'done' (default: 3)")
+    mock_parser.add_argument("--done-gate-threshold", type=float, default=1.0,
+                            help="Minimum score to accept 'done' (default: 1.0)")
 
     # Simplified run command (recommended for live evaluation)
     run_parser = subparsers.add_parser(
@@ -2399,6 +2420,12 @@ def main() -> int:
                            help="Force network/audio tray icons visible for stable click-coordinate tasks")
     run_parser.add_argument("--waa-image-version", type=str, default=None,
                            help="Pinned WAA image version label to record in run metadata")
+    run_parser.add_argument("--done-gate", action="store_true",
+                           help="Verify task completion before accepting agent's 'done' signal")
+    run_parser.add_argument("--done-gate-max-overrides", type=int, default=3,
+                           help="Max times to override premature 'done' (default: 3)")
+    run_parser.add_argument("--done-gate-threshold", type=float, default=1.0,
+                           help="Minimum score to accept 'done' (default: 1.0)")
 
     # Live evaluation (full control)
     live_parser = subparsers.add_parser("live", help="Run live evaluation against WAA server (full control)")
@@ -2427,6 +2454,12 @@ def main() -> int:
                             help="Force network/audio tray icons visible for stable click-coordinate tasks")
     live_parser.add_argument("--waa-image-version", type=str, default=None,
                             help="Pinned WAA image version label to record in run metadata")
+    live_parser.add_argument("--done-gate", action="store_true",
+                            help="Verify task completion before accepting agent's 'done' signal")
+    live_parser.add_argument("--done-gate-max-overrides", type=int, default=3,
+                            help="Max times to override premature 'done' (default: 3)")
+    live_parser.add_argument("--done-gate-threshold", type=float, default=1.0,
+                            help="Minimum score to accept 'done' (default: 1.0)")
 
     # Probe server
     probe_parser = subparsers.add_parser("probe", help="Check if WAA server is reachable")
 
@@ -14,6 +14,7 @@
 
 from __future__ import annotations
 
+import copy
 import logging
 import time
 from concurrent.futures import ThreadPoolExecutor, as_completed
@@ -58,6 +59,9 @@ class EvaluationConfig:
         run_name: Name for this evaluation run.
         enable_live_tracking: Whether to enable live evaluation progress tracking.
         live_tracking_file: Path to live tracking JSON file.
+        done_gate: Whether to verify task completion before accepting agent's "done".
+        done_gate_max_overrides: Max times to override a premature "done" (default 3).
+        done_gate_threshold: Minimum score to accept "done" (default 1.0).
     """
 
     max_steps: int = 50
@@ -72,6 +76,9 @@ class EvaluationConfig:
     run_name: str | None = None
     enable_live_tracking: bool = True
     live_tracking_file: str = "benchmark_live.json"
+    done_gate: bool = False
+    done_gate_max_overrides: int = 3
+    done_gate_threshold: float = 1.0
 
 
 def evaluate_agent_on_benchmark(
@@ -319,6 +326,7 @@ def _run_single_task(
         done = False
         steps = 0
         action = None
+        done_gate_overrides = 0
         max_steps = task.time_limit_steps or config.max_steps
 
         while not done and steps < max_steps:
@@ -367,10 +375,98 @@ def _run_single_task(
             if action.type in ("done", "error"):
                 if action.type == "error":
                     logger.error(f"Step {steps}: Agent error: {action.raw_action}")
+                    done = True
+                    break
+
+                # Agent says "done" — apply done-gate if enabled
+                logger.info(f"Step {steps}: Agent signaled task completion")
+
+                if (
+                    config.done_gate
+                    and done_gate_overrides < config.done_gate_max_overrides
+                ):
+                    logger.info(
+                        f"Step {steps}: Done-gate active — evaluating task "
+                        f"(override {done_gate_overrides + 1}/{config.done_gate_max_overrides})"
+                    )
+                    try:
+                        gate_result = adapter.evaluate(task)
+                        gate_score = gate_result.score
+                    except Exception as e:
+                        logger.warning(
+                            f"Step {steps}: Done-gate evaluation failed: {e}. "
+                            "Accepting 'done' to avoid infinite loop."
+                        )
+                        done = True
+                        break
+
+                    if gate_score >= config.done_gate_threshold:
+                        logger.info(
+                            f"Step {steps}: Done-gate PASSED "
+                            f"(score={gate_score:.2f} >= {config.done_gate_threshold:.2f})"
+                        )
+                        done = True
+                        break
+
+                    # Override the premature "done"
+                    done_gate_overrides += 1
+                    logger.warning(
+                        f"Step {steps}: Done-gate REJECTED premature 'done' "
+                        f"(score={gate_score:.2f} < {config.done_gate_threshold:.2f}, "
+                        f"override {done_gate_overrides}/{config.done_gate_max_overrides})"
+                    )
+
+                    # Modify the task instruction to tell the agent to continue.
+                    # Strip any previous done-gate message before appending the new one.
+                    _DONE_GATE_MARKER = "\n\n[SYSTEM: The task is NOT yet complete"
+                    continuation_msg = (
+                        "\n\n[SYSTEM: The task is NOT yet complete based on automated "
+                        "evaluation (score: {score:.0%}). Your previous 'done' signal "
+                        "was overridden ({n}/{max}). Please examine the current screen "
+                        "carefully and continue working on the task. Do NOT declare "
+                        "'done' unless the task is truly finished.]"
+                    ).format(
+                        score=gate_score,
+                        n=done_gate_overrides,
+                        max=config.done_gate_max_overrides,
+                    )
+
+                    # Create a modified task with continuation message
+                    task = copy.copy(task)
+                    # Remove previous done-gate message if present
+                    marker_idx = task.instruction.find(_DONE_GATE_MARKER)
+                    if marker_idx >= 0:
+                        task.instruction = task.instruction[:marker_idx]
+                    task.instruction = task.instruction + continuation_msg
+
+                    # Get a fresh observation for the agent's next step
+                    # Use a no-op key press to trigger a new screenshot
+                    try:
+                        noop_action = BenchmarkAction(type="key", key="")
+                        obs, env_done, _info = adapter.step(noop_action)
+                        if env_done:
+                            logger.info(
+                                f"Step {steps}: Environment signaled done during "
+                                "done-gate screenshot refresh"
+                            )
+                            done = True
+                            break
+                    except Exception as e:
+                        logger.warning(
+                            f"Step {steps}: Failed to get fresh observation "
+                            f"after done-gate override: {e}. Using previous obs."
+                        )
+
+                    steps += 1
+                    continue
                 else:
-                    logger.info(f"Step {steps}: Agent signaled task completion")
-                done = True
-                break
+                    if config.done_gate and done_gate_overrides >= config.done_gate_max_overrides:
+                        logger.warning(
+                            f"Step {steps}: Done-gate max overrides reached "
+                            f"({config.done_gate_max_overrides}). Accepting 'done'."
+                        )
+                    done = True
+                    break
 
             # Execute action
             try: