feat: add done-gate to prevent premature task completion by abrichr · Pull Request #110 · OpenAdaptAI/openadapt-evals

abrichr · 2026-03-06T20:32:57Z

Summary

Adds a done-gate mechanism to the evaluation runner that verifies task completion before accepting the agent's done signal
When the agent declares done, the runner calls adapter.evaluate(task) to check the actual score. If below threshold, the done is overridden and the agent receives a continuation message
Opt-in via --done-gate flag (default off) to preserve existing behavior
Configurable max overrides (--done-gate-max-overrides, default 3) and score threshold (--done-gate-threshold, default 1.0)

Root cause

The VS Code settings task (70745df8) agent declares already done after 1 step, misreading the UI. The font-change task agent sometimes quits early too. The step loop at line 341 in runner.py immediately breaks when action.type == done, with no verification.

Changes

openadapt_evals/benchmarks/runner.py: Done-gate logic in _run_single_task(), new fields on EvaluationConfig
openadapt_evals/benchmarks/cli.py: --done-gate, --done-gate-max-overrides, --done-gate-threshold flags on mock, run, and live commands
scripts/run_dc_eval.py: Same flags, passed through to the benchmark CLI subprocess

Test plan

Run openadapt-evals mock --tasks 1 (no done-gate) -- verify existing behavior unchanged
Run openadapt-evals mock --tasks 1 --done-gate -- verify done-gate activates
Run live eval with --done-gate against WAA VM to confirm premature done is overridden
Verify max overrides limit prevents infinite loops

… complete When enabled via --done-gate, the evaluation runner calls adapter.evaluate() when the agent signals "done" to verify the task is actually complete. If the score is below the threshold (default 1.0), the runner overrides the "done" signal, appends a continuation message to the task instruction, and lets the agent continue. Limited to a configurable max overrides (default 3) to prevent infinite loops. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- core4_eval.py: deterministic wrapper for running repeated Core4 trials - update_weekly_north_star.py: compute hard-task success rates for STATUS.md - waa_execution_parity_plan.md: phased plan for WAA execution reliability Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ore4_eval.py The core4_eval.py was passing --transport-error-threshold, --health-samples, --health-min-success, and --health-sample-delay to run_dc_eval.py, but those args don't exist in run_dc_eval.py (they were from uncommitted Codex changes). Also adds --done-gate passthrough to match PR #110. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ore4_eval.py (#111) The core4_eval.py was passing --transport-error-threshold, --health-samples, --health-min-success, and --health-sample-delay to run_dc_eval.py, but those args don't exist in run_dc_eval.py (they were from uncommitted Codex changes). Also adds --done-gate passthrough to match PR #110. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove stale health-gate args and add done-gate passthrough in core4_eval.py The core4_eval.py was passing --transport-error-threshold, --health-samples, --health-min-success, and --health-sample-delay to run_dc_eval.py, but those args don't exist in run_dc_eval.py (they were from uncommitted Codex changes). Also adds --done-gate passthrough to match PR #110. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: search all LibreOffice profile dirs for recovery cleanup The cleanup script only targeted LibreOffice/4/user/backup, but LibreOffice 26.2 also uses LibreOffice/user/backup. Now scans all subdirectories under AppData/Roaming/LibreOffice for user profiles. Also clears .~lock.* files that can block file re-opening, and removes lock files from common download locations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

abrichr force-pushed the feat/done-gate branch from 4add05a to 4c26720 Compare March 6, 2026 20:35

abrichr merged commit 65714ad into main Mar 6, 2026
1 check passed

abrichr mentioned this pull request Mar 6, 2026

fix: remove stale health-gate args and add done-gate passthrough in core4_eval.py #111

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add done-gate to prevent premature task completion#110

feat: add done-gate to prevent premature task completion#110
abrichr merged 2 commits into
mainfrom
feat/done-gate

abrichr commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrichr commented Mar 6, 2026

Summary

Root cause

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant