Skip to content

feat: add done-gate to prevent premature task completion#110

Merged
abrichr merged 2 commits into
mainfrom
feat/done-gate
Mar 6, 2026
Merged

feat: add done-gate to prevent premature task completion#110
abrichr merged 2 commits into
mainfrom
feat/done-gate

Conversation

@abrichr

@abrichr abrichr commented Mar 6, 2026

Copy link
Copy Markdown
Member

Summary

  • Adds a done-gate mechanism to the evaluation runner that verifies task completion before accepting the agent's done signal
  • When the agent declares done, the runner calls adapter.evaluate(task) to check the actual score. If below threshold, the done is overridden and the agent receives a continuation message
  • Opt-in via --done-gate flag (default off) to preserve existing behavior
  • Configurable max overrides (--done-gate-max-overrides, default 3) and score threshold (--done-gate-threshold, default 1.0)

Root cause

The VS Code settings task (70745df8) agent declares already done after 1 step, misreading the UI. The font-change task agent sometimes quits early too. The step loop at line 341 in runner.py immediately breaks when action.type == done, with no verification.

Changes

  • openadapt_evals/benchmarks/runner.py: Done-gate logic in _run_single_task(), new fields on EvaluationConfig
  • openadapt_evals/benchmarks/cli.py: --done-gate, --done-gate-max-overrides, --done-gate-threshold flags on mock, run, and live commands
  • scripts/run_dc_eval.py: Same flags, passed through to the benchmark CLI subprocess

Test plan

  • Run openadapt-evals mock --tasks 1 (no done-gate) -- verify existing behavior unchanged
  • Run openadapt-evals mock --tasks 1 --done-gate -- verify done-gate activates
  • Run live eval with --done-gate against WAA VM to confirm premature done is overridden
  • Verify max overrides limit prevents infinite loops

… complete

When enabled via --done-gate, the evaluation runner calls adapter.evaluate()
when the agent signals "done" to verify the task is actually complete. If the
score is below the threshold (default 1.0), the runner overrides the "done"
signal, appends a continuation message to the task instruction, and lets the
agent continue. Limited to a configurable max overrides (default 3) to prevent
infinite loops.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- core4_eval.py: deterministic wrapper for running repeated Core4 trials
- update_weekly_north_star.py: compute hard-task success rates for STATUS.md
- waa_execution_parity_plan.md: phased plan for WAA execution reliability

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abrichr abrichr merged commit 65714ad into main Mar 6, 2026
1 check passed
abrichr added a commit that referenced this pull request Mar 6, 2026
…ore4_eval.py

The core4_eval.py was passing --transport-error-threshold, --health-samples,
--health-min-success, and --health-sample-delay to run_dc_eval.py, but those
args don't exist in run_dc_eval.py (they were from uncommitted Codex changes).
Also adds --done-gate passthrough to match PR #110.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
abrichr added a commit that referenced this pull request Mar 6, 2026
…ore4_eval.py (#111)

The core4_eval.py was passing --transport-error-threshold, --health-samples,
--health-min-success, and --health-sample-delay to run_dc_eval.py, but those
args don't exist in run_dc_eval.py (they were from uncommitted Codex changes).
Also adds --done-gate passthrough to match PR #110.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
abrichr added a commit that referenced this pull request Mar 6, 2026
* fix: remove stale health-gate args and add done-gate passthrough in core4_eval.py

The core4_eval.py was passing --transport-error-threshold, --health-samples,
--health-min-success, and --health-sample-delay to run_dc_eval.py, but those
args don't exist in run_dc_eval.py (they were from uncommitted Codex changes).
Also adds --done-gate passthrough to match PR #110.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: search all LibreOffice profile dirs for recovery cleanup

The cleanup script only targeted LibreOffice/4/user/backup, but
LibreOffice 26.2 also uses LibreOffice/user/backup. Now scans all
subdirectories under AppData/Roaming/LibreOffice for user profiles.

Also clears .~lock.* files that can block file re-opening, and
removes lock files from common download locations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant