Skip to content

Commit 65714ad

Browse files
abrichrclaude
andauthored
feat: add done-gate to prevent premature task completion (#110)
* feat: add done-gate to prevent agents from prematurely declaring task complete When enabled via --done-gate, the evaluation runner calls adapter.evaluate() when the agent signals "done" to verify the task is actually complete. If the score is below the threshold (default 1.0), the runner overrides the "done" signal, appends a continuation message to the task instruction, and lets the agent continue. Limited to a configurable max overrides (default 3) to prevent infinite loops. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add core4 trial wrapper, north-star updater, and parity plan doc - core4_eval.py: deterministic wrapper for running repeated Core4 trials - update_weekly_north_star.py: compute hard-task success rates for STATUS.md - waa_execution_parity_plan.md: phased plan for WAA execution reliability Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 9de5f39 commit 65714ad

6 files changed

Lines changed: 722 additions & 9 deletions

File tree

docs/waa_execution_parity_plan.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# WAA Execution-Core Parity Plan
2+
3+
Date: 2026-03-06
4+
Scope: `openadapt-evals` infrastructure + adapter execution path
5+
6+
## Target Definition
7+
8+
Execution-core parity means our eval loop has the same practical reliability envelope as vanilla WAA for a fixed task set:
9+
10+
- deterministic environment reset
11+
- stable observation/action/evaluate endpoints
12+
- no recurring transport timeout cascades during repeated trials
13+
- reproducible run commands and health gates
14+
15+
## Current State
16+
17+
Not there yet.
18+
19+
What is now in place:
20+
21+
- Multi-sample health gate before runs (`run_dc_eval.py`):
22+
- requires repeated success across `probe + screenshot + a11y + execute + evaluate`
23+
- defaults: `3` samples, `2` required successes
24+
- Resume conversion for stale safe-mode checkpoints (`record_waa_demos.py`):
25+
- non-safe-mode resume no longer gets stuck in manual placeholder flow
26+
- AI step plan is regenerated from the live screen
27+
- A11y backend adaptation and focus-check correctness were already landed in prior stabilization edits.
28+
29+
## Remaining Gaps vs Upstream-Style Reliability
30+
31+
1. Multi-hop topology risk remains (local tunnel + VM host + container).
32+
2. Screenshot path still relies on Flask `/screenshot` endpoint.
33+
3. UIA remains intermittently slow/unresponsive in this environment.
34+
4. No explicit soak gate yet that blocks promotion unless repeated trials are transport-clean.
35+
36+
## Parity Plan (Phased)
37+
38+
### Phase 1: Gate and Detect (in progress)
39+
40+
- Keep strict pre-run health gate enabled by default.
41+
- Add run-level pass criteria:
42+
- no infra fail-fast terminations
43+
- no transport timeout cascades in execution logs
44+
- Artifact: per-run parity summary in `STATUS.md` weekly section.
45+
46+
### Phase 2: Stabilize Substrate
47+
48+
- Add upstream-aligned screenshot fallback path (QMP/`obs_winagent`-based where available).
49+
- Keep win32 as default a11y backend except tasks that explicitly require UIA.
50+
- Ensure reset path is idempotent under LibreOffice recovery edge cases.
51+
52+
### Phase 3: Prove with Repeats
53+
54+
- Core4 soak: 3 repeated trials (ZS+DC) with zero infra aborts.
55+
- Then full hard-12 repeated trials (minimum 3/task/condition).
56+
- Publish north-star delta from only transport-clean runs.
57+
58+
### Phase 4: Decision Gate
59+
60+
- By April 15, 2026:
61+
- If quantitative lift is clear and infra is stable, continue.
62+
- If lift is unclear, pivot to narrower retrieval-layer productization.
63+
64+
## Acceptance Criteria ("There Yet")
65+
66+
We consider execution-core parity achieved when all are true for two consecutive evaluation days:
67+
68+
1. Core4 repeated trials complete without infra fail-fast aborts.
69+
2. No repeated transport timeout cascades in task execution logs.
70+
3. Health gate passes consistently without manual tunnel/container intervention.
71+
4. Demo recording/resume flows are deterministic (no stale safe-mode/manual-step traps).
72+

openadapt_evals/benchmarks/cli.py

Lines changed: 39 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -271,12 +271,18 @@ def cmd_mock(args: argparse.Namespace) -> int:
271271
return 1
272272

273273
# Create config for trace collection
274+
done_gate = getattr(args, "done_gate", False)
275+
done_gate_max_overrides = getattr(args, "done_gate_max_overrides", 3)
276+
done_gate_threshold = getattr(args, "done_gate_threshold", 1.0)
274277
config = None
275-
if args.output:
278+
if args.output or done_gate:
276279
config = EvaluationConfig(
277-
save_execution_traces=True,
278-
output_dir=args.output,
280+
save_execution_traces=bool(args.output),
281+
output_dir=args.output or "benchmark_results",
279282
run_name=args.run_name or "mock_eval",
283+
done_gate=done_gate,
284+
done_gate_max_overrides=done_gate_max_overrides,
285+
done_gate_threshold=done_gate_threshold,
280286
)
281287

282288
# Run evaluation
@@ -441,6 +447,9 @@ def cmd_run(args: argparse.Namespace) -> int:
441447
save_execution_traces=True,
442448
output_dir=args.output,
443449
run_name=args.run_name,
450+
done_gate=getattr(args, "done_gate", False),
451+
done_gate_max_overrides=getattr(args, "done_gate_max_overrides", 3),
452+
done_gate_threshold=getattr(args, "done_gate_threshold", 1.0),
444453
)
445454

446455
print(f"Running {len(task_ids)} task(s): {', '.join(task_ids)}")
@@ -658,12 +667,18 @@ def cmd_live(args: argparse.Namespace) -> int:
658667
return 1
659668

660669
# Create config for trace collection
670+
done_gate = getattr(args, "done_gate", False)
671+
done_gate_max_overrides = getattr(args, "done_gate_max_overrides", 3)
672+
done_gate_threshold = getattr(args, "done_gate_threshold", 1.0)
661673
eval_config = None
662-
if args.output:
674+
if args.output or done_gate:
663675
eval_config = EvaluationConfig(
664-
save_execution_traces=True,
665-
output_dir=args.output,
676+
save_execution_traces=bool(args.output),
677+
output_dir=args.output or "benchmark_results",
666678
run_name=args.run_name or "live_eval",
679+
done_gate=done_gate,
680+
done_gate_max_overrides=done_gate_max_overrides,
681+
done_gate_threshold=done_gate_threshold,
667682
)
668683

669684
# Load tasks
@@ -2357,6 +2372,12 @@ def main() -> int:
23572372
mock_parser.add_argument("--use-a11y-tree", action="store_true", help="Enable accessibility tree grounding for Qwen3VL")
23582373
mock_parser.add_argument("--output", type=str, help="Output directory for traces")
23592374
mock_parser.add_argument("--run-name", type=str, help="Name for this evaluation run")
2375+
mock_parser.add_argument("--done-gate", action="store_true",
2376+
help="Verify task completion before accepting agent's 'done' signal")
2377+
mock_parser.add_argument("--done-gate-max-overrides", type=int, default=3,
2378+
help="Max times to override premature 'done' (default: 3)")
2379+
mock_parser.add_argument("--done-gate-threshold", type=float, default=1.0,
2380+
help="Minimum score to accept 'done' (default: 1.0)")
23602381

23612382
# Simplified run command (recommended for live evaluation)
23622383
run_parser = subparsers.add_parser(
@@ -2399,6 +2420,12 @@ def main() -> int:
23992420
help="Force network/audio tray icons visible for stable click-coordinate tasks")
24002421
run_parser.add_argument("--waa-image-version", type=str, default=None,
24012422
help="Pinned WAA image version label to record in run metadata")
2423+
run_parser.add_argument("--done-gate", action="store_true",
2424+
help="Verify task completion before accepting agent's 'done' signal")
2425+
run_parser.add_argument("--done-gate-max-overrides", type=int, default=3,
2426+
help="Max times to override premature 'done' (default: 3)")
2427+
run_parser.add_argument("--done-gate-threshold", type=float, default=1.0,
2428+
help="Minimum score to accept 'done' (default: 1.0)")
24022429

24032430
# Live evaluation (full control)
24042431
live_parser = subparsers.add_parser("live", help="Run live evaluation against WAA server (full control)")
@@ -2427,6 +2454,12 @@ def main() -> int:
24272454
help="Force network/audio tray icons visible for stable click-coordinate tasks")
24282455
live_parser.add_argument("--waa-image-version", type=str, default=None,
24292456
help="Pinned WAA image version label to record in run metadata")
2457+
live_parser.add_argument("--done-gate", action="store_true",
2458+
help="Verify task completion before accepting agent's 'done' signal")
2459+
live_parser.add_argument("--done-gate-max-overrides", type=int, default=3,
2460+
help="Max times to override premature 'done' (default: 3)")
2461+
live_parser.add_argument("--done-gate-threshold", type=float, default=1.0,
2462+
help="Minimum score to accept 'done' (default: 1.0)")
24302463

24312464
# Probe server
24322465
probe_parser = subparsers.add_parser("probe", help="Check if WAA server is reachable")

openadapt_evals/benchmarks/runner.py

Lines changed: 99 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414

1515
from __future__ import annotations
1616

17+
import copy
1718
import logging
1819
import time
1920
from concurrent.futures import ThreadPoolExecutor, as_completed
@@ -58,6 +59,9 @@ class EvaluationConfig:
5859
run_name: Name for this evaluation run.
5960
enable_live_tracking: Whether to enable live evaluation progress tracking.
6061
live_tracking_file: Path to live tracking JSON file.
62+
done_gate: Whether to verify task completion before accepting agent's "done".
63+
done_gate_max_overrides: Max times to override a premature "done" (default 3).
64+
done_gate_threshold: Minimum score to accept "done" (default 1.0).
6165
"""
6266

6367
max_steps: int = 50
@@ -72,6 +76,9 @@ class EvaluationConfig:
7276
run_name: str | None = None
7377
enable_live_tracking: bool = True
7478
live_tracking_file: str = "benchmark_live.json"
79+
done_gate: bool = False
80+
done_gate_max_overrides: int = 3
81+
done_gate_threshold: float = 1.0
7582

7683

7784
def evaluate_agent_on_benchmark(
@@ -319,6 +326,7 @@ def _run_single_task(
319326
done = False
320327
steps = 0
321328
action = None
329+
done_gate_overrides = 0
322330
max_steps = task.time_limit_steps or config.max_steps
323331

324332
while not done and steps < max_steps:
@@ -367,10 +375,98 @@ def _run_single_task(
367375
if action.type in ("done", "error"):
368376
if action.type == "error":
369377
logger.error(f"Step {steps}: Agent error: {action.raw_action}")
378+
done = True
379+
break
380+
381+
# Agent says "done" — apply done-gate if enabled
382+
logger.info(f"Step {steps}: Agent signaled task completion")
383+
384+
if (
385+
config.done_gate
386+
and done_gate_overrides < config.done_gate_max_overrides
387+
):
388+
logger.info(
389+
f"Step {steps}: Done-gate active — evaluating task "
390+
f"(override {done_gate_overrides + 1}/{config.done_gate_max_overrides})"
391+
)
392+
try:
393+
gate_result = adapter.evaluate(task)
394+
gate_score = gate_result.score
395+
except Exception as e:
396+
logger.warning(
397+
f"Step {steps}: Done-gate evaluation failed: {e}. "
398+
"Accepting 'done' to avoid infinite loop."
399+
)
400+
done = True
401+
break
402+
403+
if gate_score >= config.done_gate_threshold:
404+
logger.info(
405+
f"Step {steps}: Done-gate PASSED "
406+
f"(score={gate_score:.2f} >= {config.done_gate_threshold:.2f})"
407+
)
408+
done = True
409+
break
410+
411+
# Override the premature "done"
412+
done_gate_overrides += 1
413+
logger.warning(
414+
f"Step {steps}: Done-gate REJECTED premature 'done' "
415+
f"(score={gate_score:.2f} < {config.done_gate_threshold:.2f}, "
416+
f"override {done_gate_overrides}/{config.done_gate_max_overrides})"
417+
)
418+
419+
# Modify the task instruction to tell the agent to continue.
420+
# Strip any previous done-gate message before appending the new one.
421+
_DONE_GATE_MARKER = "\n\n[SYSTEM: The task is NOT yet complete"
422+
continuation_msg = (
423+
"\n\n[SYSTEM: The task is NOT yet complete based on automated "
424+
"evaluation (score: {score:.0%}). Your previous 'done' signal "
425+
"was overridden ({n}/{max}). Please examine the current screen "
426+
"carefully and continue working on the task. Do NOT declare "
427+
"'done' unless the task is truly finished.]"
428+
).format(
429+
score=gate_score,
430+
n=done_gate_overrides,
431+
max=config.done_gate_max_overrides,
432+
)
433+
434+
# Create a modified task with continuation message
435+
task = copy.copy(task)
436+
# Remove previous done-gate message if present
437+
marker_idx = task.instruction.find(_DONE_GATE_MARKER)
438+
if marker_idx >= 0:
439+
task.instruction = task.instruction[:marker_idx]
440+
task.instruction = task.instruction + continuation_msg
441+
442+
# Get a fresh observation for the agent's next step
443+
# Use a no-op key press to trigger a new screenshot
444+
try:
445+
noop_action = BenchmarkAction(type="key", key="")
446+
obs, env_done, _info = adapter.step(noop_action)
447+
if env_done:
448+
logger.info(
449+
f"Step {steps}: Environment signaled done during "
450+
"done-gate screenshot refresh"
451+
)
452+
done = True
453+
break
454+
except Exception as e:
455+
logger.warning(
456+
f"Step {steps}: Failed to get fresh observation "
457+
f"after done-gate override: {e}. Using previous obs."
458+
)
459+
460+
steps += 1
461+
continue
370462
else:
371-
logger.info(f"Step {steps}: Agent signaled task completion")
372-
done = True
373-
break
463+
if config.done_gate and done_gate_overrides >= config.done_gate_max_overrides:
464+
logger.warning(
465+
f"Step {steps}: Done-gate max overrides reached "
466+
f"({config.done_gate_max_overrides}). Accepting 'done'."
467+
)
468+
done = True
469+
break
374470

375471
# Execute action
376472
try:

0 commit comments

Comments
 (0)