feat(ci): add visual bug-fix agent with live-app access#2146
Draft
feat(ci): add visual bug-fix agent with live-app access#2146
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a parallel "Visual Bug Fix Agent" — a label-gated copy of the production bug-fix agent (
_bug-fix-agent.yml) with one structural addition: the project's devcontainer is booted alongside the agent and exposed viadocker exec, so the agent can drive the live app (xdotool, screenshots, in-container test runs) for bugs that resist a pure unit-test reproduction.Production behavior is unchanged. This adds two new files; nothing existing is modified.
Depends on PR #2120
The agent's prompt references
scripts/devcontainer-screenshot.shandscripts/devcontainer-steal.sh, plus the correcteddevcontainer-devskill. All three land in #2120. #2120 should merge first; otherwise this agent's first run would fail when invoking missing scripts.What changes
_visual-bug-fix-agent.yml_bug-fix-agent.ymlnode_modulescache + populate,devcontainers/cibuild/start, find container, launch entrypoint, readiness gate (process +xdotool search --class ToolHive+ 2s settle); +DEVCONTAINER_IDenv on each phase; + live-app paragraph in Phase 1 / 2 / 2b prompts; +--allowedToolsextended withdocker exec/cp/psand the two helper scripts; + nm-cache dump step; + diagnostics-on-failure step; concurrencyvisual-bug-fix-*; branch suffix-visual; PR labelauto-fix-visual; failure-comment markerVisual Bug Fix Agent;timeout-minutes45→60; Phase 1--max-turns50→75visual-bug-fix-on-label.ymlbug-fix-on-label.ymlauto-fix-visual(still requiresBuglabel)The structural shape, gating, phases, hard gates, branch/PR creation, and failure handling are byte-equivalent to the production agent modulo the renames required for parallel runs.
Cost expectations
Phase 1 (5–9 min agent thinking) dominates regardless of substrate. The +4 min is the devcontainer build + boot. Once #2136 lands, that drops to ~30s.
Activation
The trigger is the
auto-fix-visuallabel on aBug-labeled issue. To test:auto-fix-visuallabel to a candidate bug. The natural first candidate is [Bug] When the server deletion finishes it focuses the MCP servers tab #663 (server deletion focuses MCP Servers tab), chosen earlier in the experiment for its async-IPC-driven cross-route behavior.Both labels (
auto-fixandauto-fix-visual) can be applied to the same issue if you want a head-to-head comparison.What this PR does NOT prove
That live-app access actually improves the agent's success rate. The substrate is proven (in #2120). The agent integration is mechanical. Whether the +22% (or +5% post-#2136) wall-time tax is worth it depends entirely on how Phase 1 performs on bugs that resist unit-test repro — and that can only be measured by labeling real issues and observing outcomes.
Risk surface
git rmof two files.claude-code-action's workflow validation means the FIRST real run happens after merge — there is no pre-merge dry run available.