Skip to content

fix(controller): prevent plan step drift and reduce VLM false negatives#97

Merged
abrichr merged 8 commits into
mainfrom
fix/controller-improvements
Mar 4, 2026
Merged

fix(controller): prevent plan step drift and reduce VLM false negatives#97
abrichr merged 8 commits into
mainfrom
fix/controller-improvements

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 3, 2026

Summary

  • Plan step drift prevention: Fixed _advance_plan_steps() to only compare current step vs next step, advancing at most one step per call. Previously, aggressive keyword matching could jump 5+ steps on a single action, causing the controller to skip intermediate steps entirely.
  • VLM verification prompt tuning: Added partially_verified status and effectively_verified property for cases where the core outcome is achieved but with minor deviations (cursor position, formatting). Rewrote all verification prompts (step, progress, goal) to be outcome-focused, reducing false negatives observed in live eval.
  • False-negative regression tests: 9 tests from specific Level 3 eval scenarios (header entry with cursor drift, numeric value semantic disputes, formatting not applied).

Changes

File Change
claude_computer_use_agent.py _advance_plan_steps() limits to single-step advancement
plan_verify.py partially_verified status, effectively_verified property, outcome-focused prompts
demo_controller.py Uses effectively_verified instead of == "verified"
test_claude_computer_use_agent.py +3 drift prevention tests
test_demo_controller.py +5 drift prevention tests
test_plan_verify.py +21 new tests (partially_verified, outcome prompts, false-negative regressions)

Test plan

  • All 147 controller tests pass (55 agent + 41 controller + 51 plan_verify)
  • Run Level 3 live eval with updated prompts to verify reduced false negatives

🤖 Generated with Claude Code

Copy link
Copy Markdown
Member Author

@abrichr abrichr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Fixes for PR #97 feedback items

Pushed commit 422fd56 addressing the three feedback items.

1. Heuristic/verifier share no state so steps can drift

Root cause found: The PR already added _external_step_control to disable the agent heuristic when the controller is driving. However, the agent still injected its own stale plan progress text into Claude API messages from _plan_steps (which never gets updated under external control). Claude received two conflicting views: controller says step 3, agent says step 1.

Additionally, the agent done-override logic used stale _plan_steps even though the controller already handles done-override.

Fix applied (claude_computer_use_agent.py):

  • _build_initial_messages(): Skips plan progress injection when _external_step_control=True
  • Follow-up messages: Skips plan progress / demo re-injection under external control
  • _process_response(): Skips premature done override when under external control

2. partially_verified leaves no trace for goal verification

Already well-addressed in the PR. The _build_step_verification_summary() correctly detects partially_verified steps, includes explanations, and augments goal text. No additional changes needed.

3. Regression tests mock only VLM parsing (no integration tests)

Added TestAgentPlanProgressSuppression class with 4 integration tests:

  1. Agent suppresses stale progress under external control (real agent, not mock)
  2. Agent injects progress normally without external control
  3. End-to-end: controller sets flag preventing stale progress
  4. Done override handled by controller not agent

All 162 tests pass (0 failures, 1 skip for missing demo file).

abrichr added a commit that referenced this pull request Mar 4, 2026
… dep

- Add comment in reset() explaining why _external_step_control is not reset
- Add comment on hasattr guard explaining MagicMock behavior is acceptable
- Add docstring note in TestFalseNegativeRegressions about VLM response limitation
- Add flask to test optional-dependencies for CI coverage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
abrichr and others added 8 commits March 3, 2026 23:08
Two improvements to the closed-loop demo-conditioned controller:

1. Plan step tracking drift prevention: _advance_plan_steps() now only
   compares current step vs next step, advancing at most one step per call.
   Previously, bulk keyword matching could jump 5+ steps on a single action.

2. VLM verification prompt tuning: Added "partially_verified" status for
   cases where the core outcome is achieved but with minor deviations
   (cursor position, formatting). Rewrote all verification prompts to be
   outcome-focused, reducing false negatives from live eval scenarios.

Adds 68 new tests (8 drift prevention + 21 VLM prompt + 9 false-negative
regressions + 30 existing test updates). All 147 controller tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Analyzes unit economics of the closed-loop controller architecture:
Claude agent costs, VLM verifier costs, scaling projections, and a
three-phase strategy from loop-as-product to trained-model-as-product.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pyautogui.drag() uses relative coordinates that compound with starting
position errors, making it unreliable for small targets like LibreOffice
fill handles (~3x3 pixels). Replace with explicit mouseDown/moveTo/mouseUp
sequence with timing delays for reliable drag operations.

Also adds drag case to _build_pixel_command() for the pixel_action() path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…al verification

Three issues addressed:

1. Heuristic/verifier step drift: The agent's keyword-based
   _advance_plan_steps() heuristic and the DemoController's VLM verifier
   operated on independent state, allowing them to disagree on which step
   was current. Fix: add _external_step_control flag to the agent that the
   DemoController sets at init, making _advance_plan_steps() a no-op when
   the controller manages step progression via VLM verification.

2. partially_verified invisible to goal verification: When steps were
   marked partially_verified, the final goal verification pass had no
   visibility into which steps had partial completions. Fix: _verify_goal()
   now builds a step verification summary and augments the goal text with
   it when noteworthy statuses (partially_verified, failed) exist.

3. Missing integration tests: Added TestHeuristicVerifierSync (4 tests)
   and TestGoalVerificationContext (5 tests) that verify the heuristic is
   properly disabled under controller management, step advancement is
   driven by VLM verification, and partial/failed step context reaches
   goal verification. Also added 2 agent-level tests for
   _external_step_control behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When DemoController sets _external_step_control=True, the agent's
internal plan progress injection and done-override logic now become
no-ops. This prevents the agent from sending conflicting step-tracking
signals to the Claude model (agent says "step 1 in progress" while
controller says "step 3 is current").

Three specific suppressions:
1. _build_initial_messages skips plan progress text injection
2. Follow-up messages skip plan progress / demo re-injection
3. Premature "done" override is left to the controller

Adds integration tests exercising agent+controller interaction:
- Agent suppresses progress under external control
- Agent injects progress normally without external control
- Controller's augmented task instruction reaches the agent
- Done override handled by controller, not agent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After WAA setup (close_all → verify_apps → download → open), the
target application may be behind other windows, still loading, or
obscured by notifications. This wastes 6+ agent steps recovering.

Add _ensure_app_focused() with multi-strategy approach:
- Maps task related_apps to window title patterns
- Uses WAA /setup/activate_window endpoint (same as WAA postconfig)
- Falls back to Alt+Tab
- Retries 3x with increasing delays (2s, 3s, 5s)
- Verifies foreground window title via pygetwindow on VM
- Runs during reset(), does NOT count against agent step budget

Also adds _APP_WINDOW_PATTERNS mapping, _get_expected_window_patterns(),
_check_foreground_matches(), and _normalize_app_name() helpers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive analysis of GUI agent failure modes with taxonomy,
recording system design, training viability assessment, and
prioritized action plan. Key findings:

- 4-category taxonomy: Environment, Agent Planning, Grounding, Verifier
- Existing ExecutionTraceCollector needs only minor extensions
- SFT on 50-100 corrected trajectories expected 10-30pp improvement
- Deterministic infrastructure fixes should come first (Tier 1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… dep

- Add comment in reset() explaining why _external_step_control is not reset
- Add comment on hasattr guard explaining MagicMock behavior is acceptable
- Add docstring note in TestFalseNegativeRegressions about VLM response limitation
- Add flask to test optional-dependencies for CI coverage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abrichr abrichr force-pushed the fix/controller-improvements branch from ded41c8 to fd5e0ae Compare March 4, 2026 04:08
@abrichr abrichr merged commit f1f3870 into main Mar 4, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant