Skip to content

feat(ci): add visual bug-fix agent with live-app access#2146

Draft
kantord wants to merge 2 commits intomainfrom
feat/visual-bug-fix-agent
Draft

feat(ci): add visual bug-fix agent with live-app access#2146
kantord wants to merge 2 commits intomainfrom
feat/visual-bug-fix-agent

Conversation

@kantord
Copy link
Copy Markdown
Member

@kantord kantord commented Apr 29, 2026

Summary

Adds a parallel "Visual Bug Fix Agent" — a label-gated copy of the production bug-fix agent (_bug-fix-agent.yml) with one structural addition: the project's devcontainer is booted alongside the agent and exposed via docker exec, so the agent can drive the live app (xdotool, screenshots, in-container test runs) for bugs that resist a pure unit-test reproduction.

Production behavior is unchanged. This adds two new files; nothing existing is modified.

Depends on PR #2120

The agent's prompt references scripts/devcontainer-screenshot.sh and scripts/devcontainer-steal.sh, plus the corrected devcontainer-dev skill. All three land in #2120. #2120 should merge first; otherwise this agent's first run would fail when invoking missing scripts.

What changes

File Mirrors Differences
_visual-bug-fix-agent.yml _bug-fix-agent.yml + buildx setup, node_modules cache + populate, devcontainers/ci build/start, find container, launch entrypoint, readiness gate (process + xdotool search --class ToolHive + 2s settle); + DEVCONTAINER_ID env on each phase; + live-app paragraph in Phase 1 / 2 / 2b prompts; + --allowedTools extended with docker exec/cp/ps and the two helper scripts; + nm-cache dump step; + diagnostics-on-failure step; concurrency visual-bug-fix-*; branch suffix -visual; PR label auto-fix-visual; failure-comment marker Visual Bug Fix Agent; timeout-minutes 45→60; Phase 1 --max-turns 50→75
visual-bug-fix-on-label.yml bug-fix-on-label.yml Triggers on label auto-fix-visual (still requires Bug label)

The structural shape, gating, phases, hard gates, branch/PR creation, and failure handling are byte-equivalent to the production agent modulo the renames required for parallel runs.

Cost expectations

Rough estimates, sample of 2 production runs.

Production agent today + visual variant (current) + #2136 (prebuilt image)
Per-fix wall time ~18 min ~22 min (+22%) ~19 min (+5%)

Phase 1 (5–9 min agent thinking) dominates regardless of substrate. The +4 min is the devcontainer build + boot. Once #2136 lands, that drops to ~30s.

Activation

The trigger is the auto-fix-visual label on a Bug-labeled issue. To test:

  1. Merge chore: add visual bugfix agent #2120 (helper scripts + skill update).
  2. Merge this PR.
  3. Apply the auto-fix-visual label to a candidate bug. The natural first candidate is [Bug] When the server deletion finishes it focuses the MCP servers tab #663 (server deletion focuses MCP Servers tab), chosen earlier in the experiment for its async-IPC-driven cross-route behavior.

Both labels (auto-fix and auto-fix-visual) can be applied to the same issue if you want a head-to-head comparison.

What this PR does NOT prove

That live-app access actually improves the agent's success rate. The substrate is proven (in #2120). The agent integration is mechanical. Whether the +22% (or +5% post-#2136) wall-time tax is worth it depends entirely on how Phase 1 performs on bugs that resist unit-test repro — and that can only be measured by labeling real issues and observing outcomes.

Risk surface

  • Production agent is untouched; worst case rollback is git rm of two files.
  • If the agent misbehaves on a labeled issue, the label is trivially removed and the workflow has a "give-up" guard (final comment marker) that prevents repeat retries.
  • claude-code-action's workflow validation means the FIRST real run happens after merge — there is no pre-merge dry run available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant