Skip to content

chore: add visual bugfix agent#2120

Draft
kantord wants to merge 8 commits intomainfrom
experiment/bug-fix-visual
Draft

chore: add visual bugfix agent#2120
kantord wants to merge 8 commits intomainfrom
experiment/bug-fix-visual

Conversation

@kantord
Copy link
Copy Markdown
Member

@kantord kantord commented Apr 27, 2026

TL;DR

Proves the project's devcontainer (Xvfb + fluxbox + Electron + thv) can boot inside a GitHub Actions runner and be driven end-to-end via docker exec — no AI agent involved here, just the substrate. Adds two helper scripts and corrects the devcontainer-dev skill so future agents can use this safely. Steady-state cost is ~4 min/run with the caching landed here, with a clear path to sub-minute via follow-up #2136.

What lands in this PR

  1. .github/workflows/experiment-devcontainer-proof.yml — push-triggered, branch-pinned proof workflow. Builds the devcontainer, boots the app, captures screenshots, drives one xdotool keystroke, runs pnpm run type-check inside the container, uploads screenshots as an artifact.
  2. scripts/devcontainer-screenshot.sh [host-path] — auto-finds the container, runs import inside, streams the PNG out via docker exec cat (works around the tmpfs/docker cp bug — see below). Prints absolute host path on stdout.
  3. scripts/devcontainer-steal.sh <container-id> <container-path> <host-path> — generic file extractor for any tmpfs path. Use for log files, generated artifacts, anything else that lives in /tmp. The screenshot helper calls this internally.
  4. .claude/skills/devcontainer-dev/SKILL.md (and .codex/, .cursor/ copies kept in sync) — replaces the broken docker cp example with the new helpers; documents the tmpfs gotcha; adds an "agent-grade readiness" note.
  5. actions/cache for the named node_modules volume in the proof workflow — drops pnpm install from ~100s to ~9s on warm.

Proof

CI run: https://github.com/stacklok/toolhive-studio/actions/runs/25062176682

Final-state timings (with all fixes applied, attempt #2 = warm cache):

Stage Time
Devcontainer build & start 167s
Wait for app readiness (rendered UI) 25s
Screenshots (×2) + xdotool driving ~4s
Type-check inside container 22s
Total wall time 4m24s

Both screenshots in the artifact are valid 1920×1200 PNGs of the rendered ToolHive Studio UI.

Bugs surfaced and fixed

  • docker cp is silently broken for /tmp paths. /tmp inside the devcontainer is tmpfs; docker cp only traverses overlay layers, so it returns "Could not find the file" even when the file demonstrably exists. Confirmed in CI and locally. Fix: stream via docker exec cat (encapsulated in devcontainer-steal.sh).
  • Process readiness ≠ UI rendered. The original readiness gate only checked process presence — the first run captured a blank screen because Electron was alive but mid-first-paint. Fix: also gate on xdotool search --class ToolHive matching the main window, plus a 2s settling sleep.
  • cacheFrom: type=gha was silently failing with unknown cache importer: gha. Fix: add docker/setup-buildx-action@v3 so BuildKit's gha backend is registered.
  • Layer cache import alone isn't enough because the node_modules named Docker volume doesn't survive across CI jobs, so postCreateCommand reruns pnpm install from scratch. Fix: cache the volume contents via actions/cache, populate before devcontainers/ci runs, dump back on cache miss.
  • devcontainers/ci@v0.3 doesn't actually export the gha cache (only imports). Discovered while measuring the warm rerun — the apt install layer takes a fresh ~35s every run regardless. Not fixed in this PR; tracked in Publish a prebuilt devcontainer image to GHCR #2136.

Cost compared to the production bug-fix agent

Rough estimates from a sample of two production runs. Real distribution is wider — agent thinking time varies a lot per bug.

The current production agent runs without a devcontainer and takes ~18 min per successful fix, dominated by Phase 1 Opus + Phase 2 Sonnet plus two full test-suite runs (#24982779175, #24781689310).

Production agent today + this PR's substrate + #2136 (prebuilt image)
Per-fix wall time ~18 min ~22 min ~19 min
Overhead vs current +22% +5%

Phase 1 (5–9 min) dominates either way — that's the agent's actual thinking, unrelated to substrate. Live-app access mostly augments Phase 1: bugs that currently fall through to Phase 2b ("direct fix without TDD" → lower-quality PRs) might succeed at writing a failing test if the agent could observe the running app. Whether the +22% (or +5% post-#2136) is worth it depends on how much Phase 1's success rate improves on UI/routing/IPC bugs. This PR doesn't answer that question — it only proves the substrate. The actual ROI test is to run the integrated agent on a known-tricky UI bug (#663 is the natural candidate) and compare outcomes.

Follow-ups

  • Publish a prebuilt devcontainer image to GHCR #2136 — Publish a prebuilt devcontainer image to GHCR. Replaces the per-run apt + base-image build with a single docker pull. Combined with this PR's node_modules cache, gets steady-state below 1 minute.
  • Enable Chrome DevTools Protocol in devcontainer for semantic agent control #2129 — Enable Chrome DevTools Protocol in the devcontainer. Lets agents drive the app via accessibility tree / DOM (Playwright CLI) instead of pixel-based xdotool + screenshots. Reduces token spend on the agent side.
  • Visual bug-fix agent integration — separate parallel workflow (_visual-bug-fix-agent.yml + visual-bug-fix-on-label.yml) that wires this substrate into a label-gated copy of the production agent. Coming as a follow-up PR; doesn't modify the production agent.

How to fire a run

git commit --allow-empty -m "trigger experiment" && git push

The workflow is push:-triggered and pinned to this branch. Use [skip ci] to push without firing.


Experimental — not for routine production CI. Intended to be merged so future agent work has a working substrate to build on.

@kantord kantord changed the title experiment(ci): devcontainer-based bug-fix agent (POC) chore: add visual bugfix agent Apr 28, 2026
@kantord
Copy link
Copy Markdown
Member Author

kantord commented Apr 29, 2026

@claude deep review please

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @kantord's task in 2m 26s —— View job


Deep Review — PR #2120: chore: add visual bugfix agent

  • Gather context and understand request
  • Read changed files in detail
  • Review GitHub Actions workflow
  • Review helper shell scripts
  • Review skill documentation updates
  • Compile and post comprehensive review

Overall Assessment

This is a well-engineered proof-of-concept that isolates a genuinely novel capability (devcontainer-in-CI) from the harder AI-agent problem, which is the right approach. The technical solutions — docker exec cat bypass for tmpfs, xdotool window-mapping gate for UI readiness, named-volume caching via actions/cache — are all correct and well-justified. The PR description is unusually thorough, documenting each bug found and how it was fixed.

Two actionable issues (one security-relevant), several minor quality notes, and a few observations below.


Issues

1. Two action versions are not SHA-pinned (security)

Every other action in the workflow uses a full commit SHA, but these two don't:

# line 53
uses: docker/setup-buildx-action@v3           # ← mutable tag

# line 85
uses: devcontainers/ci@v0.3                   # ← mutable tag

Mutable tags mean the action can be silently updated to include malicious code on the next run. The existing actions/checkout, actions/cache, and actions/upload-artifact are all pinned correctly. These two should match.

Fix this →

2. $GITHUB_OUTPUT is unquoted in run: steps

Throughout the workflow (lines 44, 101–102, 116, 154–155, 191–192, 195–196), the pattern is:

echo "key=value" >> $GITHUB_OUTPUT   # ← unquoted

The GHA-provided path never contains spaces in practice, but the shell convention is to double-quote env vars on the right side of redirections: >> "$GITHUB_OUTPUT". Minor, but affects every timing/output step.

Fix this →


Quality Notes

3. devcontainer-steal.sh hardcodes -u node

# scripts/devcontainer-steal.sh, line 28
docker exec -u node "$CONTAINER" cat "$SRC" > "$DEST"

If a log file or artifact was written by a root process (e.g., something under the DinD daemon), this silently fails with permission denied. For the current use case (screenshot from import running as node) this is fine, but the script is documented as a general-purpose extractor. Either document the limitation explicitly or drop -u node (root can read everything inside the container). The screenshot helper already runs as -u node, so the file will be owned by node — consistent here.

4. Volume name hardcoded

# line 77
VOLUME=toolhive-node-modules-toolhive-studio

This works because the GitHub Actions workspace basename is always toolhive-studio. The devcontainer.json uses ${localWorkspaceFolderBasename} which resolves to the same. Brittle if the repo is ever checked out to a differently-named directory, but that can't happen in GitHub-managed runners, so it's fine for the scope of this experiment. A brief comment noting this assumption would help.

5. Type-check failure is warned, not errored

# lines 197-198
echo "result=fail" >> $GITHUB_OUTPUT
echo "::warning::type-check failed"
# (no exit 1)

Intentional — the job tests the substrate, not code correctness. But a future agent using this workflow will see the job green even when pnpm run type-check fails inside the container. If this workflow ever gets repurposed as a validation gate, this will need revisiting. Fine for a proof-of-concept, worth a comment.

6. Readiness loop timeout is exactly the step's timeout-minutes

The polling loop runs for 120 × sleep 5 = 600s = 10min, and the step has timeout-minutes: 10. These two limits race each other — the step could time out a few seconds before the 120th iteration. The error message would say "timeout" rather than the explicit ::error::App did not become ready. Minor, but reducing the loop to 110 iterations (550s) would ensure the custom error always wins.


Positive Callouts

  • xdotool search --class ToolHive readiness gate is the right fix for the blank-screen problem. Process presence ≠ UI rendered was a real bug and the solution is correct.
  • docker exec cat bypass for tmpfs is the only clean solution to the docker cp limitation. The explanation in both the script header and SKILL.md is thorough — future maintainers will understand why.
  • actions/cache + alpine copy for named volume is clever. The volume doesn't survive across jobs, so this is the only way to avoid a full pnpm install on warm runs.
  • [bracket] trick for pgrep -f is correctly applied at line 143 ([e]lectron, [t]hv serve) to avoid the self-match false-positive documented in the skill.
  • Diagnostic on failure (lines 244–259) is comprehensive — dumping all five log files means a failing run is debuggable without a re-run.
  • SKILL.md update replaces a silently broken docker cp example with the helper scripts and explains the tmpfs gotcha. The "Process running ≠ UI rendered" callout is a good lesson captured.
  • Pin hashes on checkout, cache, and upload-artifact — inconsistency with the two unpinned ones above makes those easier to spot, not harder.

Minor Observation: .claude-pr/ directory

The .claude-pr/ directory is untracked (shows as ?? in git status) and not .gitignore-listed. It looks like a leftover from a Claude Code agent run. Not part of this PR, but worth cleaning up before merge to avoid accidental future inclusion.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant