feat: add done-gate to prevent premature task completion#110
Merged
Conversation
… complete When enabled via --done-gate, the evaluation runner calls adapter.evaluate() when the agent signals "done" to verify the task is actually complete. If the score is below the threshold (default 1.0), the runner overrides the "done" signal, appends a continuation message to the task instruction, and lets the agent continue. Limited to a configurable max overrides (default 3) to prevent infinite loops. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- core4_eval.py: deterministic wrapper for running repeated Core4 trials - update_weekly_north_star.py: compute hard-task success rates for STATUS.md - waa_execution_parity_plan.md: phased plan for WAA execution reliability Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
abrichr
added a commit
that referenced
this pull request
Mar 6, 2026
…ore4_eval.py The core4_eval.py was passing --transport-error-threshold, --health-samples, --health-min-success, and --health-sample-delay to run_dc_eval.py, but those args don't exist in run_dc_eval.py (they were from uncommitted Codex changes). Also adds --done-gate passthrough to match PR #110. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 task
abrichr
added a commit
that referenced
this pull request
Mar 6, 2026
…ore4_eval.py (#111) The core4_eval.py was passing --transport-error-threshold, --health-samples, --health-min-success, and --health-sample-delay to run_dc_eval.py, but those args don't exist in run_dc_eval.py (they were from uncommitted Codex changes). Also adds --done-gate passthrough to match PR #110. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
abrichr
added a commit
that referenced
this pull request
Mar 6, 2026
* fix: remove stale health-gate args and add done-gate passthrough in core4_eval.py The core4_eval.py was passing --transport-error-threshold, --health-samples, --health-min-success, and --health-sample-delay to run_dc_eval.py, but those args don't exist in run_dc_eval.py (they were from uncommitted Codex changes). Also adds --done-gate passthrough to match PR #110. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: search all LibreOffice profile dirs for recovery cleanup The cleanup script only targeted LibreOffice/4/user/backup, but LibreOffice 26.2 also uses LibreOffice/user/backup. Now scans all subdirectories under AppData/Roaming/LibreOffice for user profiles. Also clears .~lock.* files that can block file re-opening, and removes lock files from common download locations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Root cause
The VS Code settings task (70745df8) agent declares already done after 1 step, misreading the UI. The font-change task agent sometimes quits early too. The step loop at line 341 in runner.py immediately breaks when action.type == done, with no verification.
Changes
Test plan