Conversation
|
@claude deep review please |
|
Claude finished @kantord's task in 2m 26s —— View job Deep Review — PR #2120: chore: add visual bugfix agent
Overall AssessmentThis is a well-engineered proof-of-concept that isolates a genuinely novel capability (devcontainer-in-CI) from the harder AI-agent problem, which is the right approach. The technical solutions — Two actionable issues (one security-relevant), several minor quality notes, and a few observations below. Issues1. Two action versions are not SHA-pinned (security)Every other action in the workflow uses a full commit SHA, but these two don't: # line 53
uses: docker/setup-buildx-action@v3 # ← mutable tag
# line 85
uses: devcontainers/ci@v0.3 # ← mutable tagMutable tags mean the action can be silently updated to include malicious code on the next run. The existing 2.
|
TL;DR
Proves the project's devcontainer (Xvfb + fluxbox + Electron + thv) can boot inside a GitHub Actions runner and be driven end-to-end via
docker exec— no AI agent involved here, just the substrate. Adds two helper scripts and corrects thedevcontainer-devskill so future agents can use this safely. Steady-state cost is ~4 min/run with the caching landed here, with a clear path to sub-minute via follow-up #2136.What lands in this PR
.github/workflows/experiment-devcontainer-proof.yml— push-triggered, branch-pinned proof workflow. Builds the devcontainer, boots the app, captures screenshots, drives one xdotool keystroke, runspnpm run type-checkinside the container, uploads screenshots as an artifact.scripts/devcontainer-screenshot.sh [host-path]— auto-finds the container, runsimportinside, streams the PNG out viadocker exec cat(works around the tmpfs/docker cpbug — see below). Prints absolute host path on stdout.scripts/devcontainer-steal.sh <container-id> <container-path> <host-path>— generic file extractor for any tmpfs path. Use for log files, generated artifacts, anything else that lives in/tmp. The screenshot helper calls this internally..claude/skills/devcontainer-dev/SKILL.md(and.codex/,.cursor/copies kept in sync) — replaces the brokendocker cpexample with the new helpers; documents the tmpfs gotcha; adds an "agent-grade readiness" note.actions/cachefor the namednode_modulesvolume in the proof workflow — dropspnpm installfrom ~100s to ~9s on warm.Proof
CI run: https://github.com/stacklok/toolhive-studio/actions/runs/25062176682
Final-state timings (with all fixes applied, attempt #2 = warm cache):
Both screenshots in the artifact are valid 1920×1200 PNGs of the rendered ToolHive Studio UI.
Bugs surfaced and fixed
docker cpis silently broken for/tmppaths./tmpinside the devcontainer istmpfs;docker cponly traverses overlay layers, so it returns "Could not find the file" even when the file demonstrably exists. Confirmed in CI and locally. Fix: stream viadocker exec cat(encapsulated indevcontainer-steal.sh).xdotool search --class ToolHivematching the main window, plus a 2s settling sleep.cacheFrom: type=ghawas silently failing withunknown cache importer: gha. Fix: adddocker/setup-buildx-action@v3so BuildKit's gha backend is registered.node_modulesnamed Docker volume doesn't survive across CI jobs, sopostCreateCommandrerunspnpm installfrom scratch. Fix: cache the volume contents viaactions/cache, populate beforedevcontainers/ciruns, dump back on cache miss.devcontainers/ci@v0.3doesn't actually export the gha cache (only imports). Discovered while measuring the warm rerun — the apt install layer takes a fresh ~35s every run regardless. Not fixed in this PR; tracked in Publish a prebuilt devcontainer image to GHCR #2136.Cost compared to the production bug-fix agent
The current production agent runs without a devcontainer and takes ~18 min per successful fix, dominated by Phase 1 Opus + Phase 2 Sonnet plus two full test-suite runs (#24982779175, #24781689310).
Phase 1 (5–9 min) dominates either way — that's the agent's actual thinking, unrelated to substrate. Live-app access mostly augments Phase 1: bugs that currently fall through to Phase 2b ("direct fix without TDD" → lower-quality PRs) might succeed at writing a failing test if the agent could observe the running app. Whether the +22% (or +5% post-#2136) is worth it depends on how much Phase 1's success rate improves on UI/routing/IPC bugs. This PR doesn't answer that question — it only proves the substrate. The actual ROI test is to run the integrated agent on a known-tricky UI bug (#663 is the natural candidate) and compare outcomes.
Follow-ups
docker pull. Combined with this PR'snode_modulescache, gets steady-state below 1 minute.xdotool+ screenshots. Reduces token spend on the agent side._visual-bug-fix-agent.yml+visual-bug-fix-on-label.yml) that wires this substrate into a label-gated copy of the production agent. Coming as a follow-up PR; doesn't modify the production agent.How to fire a run
The workflow is
push:-triggered and pinned to this branch. Use[skip ci]to push without firing.Experimental — not for routine production CI. Intended to be merged so future agent work has a working substrate to build on.