running-in-ci: verify external-tool behavior by run or clone, not just hand-test#327
Conversation
Closes #326 Co-Authored-By: Claude <noreply@anthropic.com>
max-sixty
left a comment
There was a problem hiding this comment.
overall decent, but:
- the focus should be on verifying claims, if possible by running code
- it is possible to read a tool's source! tend can clone the repo!
- fine to retain the "don't make overconfident claims" but we should have that elsewhere
…nd-test Review feedback on #327 pointed out the section framed external-tool claims as "defer to a human" when the bot can actually verify by running the tool or cloning its repo. Rewrite the subsection so it opens with the two verification paths (install-and-run, then clone-and-grep) and only falls back to asking a human when both fail. The "don't make overconfident claims" guidance already lives in the preceding subsection.
|
Good points, rewrote in 6437170:
|
|
was this the actual problem in this specific case — that the |
|
Checked both sources. The mintlify page at The cmux source tells the other half of the story. In Net: docs were wrong, not a hallucination — but the new section's "clone the repo and grep the CLI parser" step would have caught it, since the discrepancy is visible in one |
tend-agent
left a comment
There was a problem hiding this comment.
Small correction to my earlier comment for the record: --json is in fact wired into list-workspaces at CLI/cmux.swift:2003-2005 (even at tag v0.63.2), via the shared if jsonOutput block.
The real reason the published recipe broke: cmux parses --json as a global flag in a loop that breaks at the first non-global token (see CLI/cmux.swift:1665-1710). So cmux list-workspaces --json silently drops --json and returns text; the form that works is cmux --json list-workspaces. The mintlify docs show the broken flag ordering, which is what I copied.
Still "docs wrong, not hallucination" — but a naïve grep -- --json CLI/cmux.swift would have confirmed --json works for list-workspaces and shipped the same broken recipe. Path 1 (run the tool) is what would have caught this; Path 2 alone misses flag-placement bugs unless you also read the arg-parser structure. Worth keeping that caveat in mind when this guidance gets applied to other fast-moving CLIs.
|
ok, can you add a comment on the original issue that the |
|
Done — worktrunk#1907 comment explains that |
) ## Summary Update the `running-in-ci` "How to propose" recipe so a bot updating `.claude/skills/` can actually write to disk. The current recipe (`cd .claude && mv /tmp/... SKILL.md`) satisfies the Claude Code harness write-guard ([anthropics/claude-code#37157](anthropics/claude-code#37157)) but fails the read-only bind-mount the sandbox places on `.claude/`. Mirror the pattern `review-runs` already documents: do the edit, commit, and push from a git worktree under `$TMPDIR`. Both restrictions are now explained inline so a future reader knows why the worktree is mandatory. ## Why `worktrunk-bot` hit this in worktrunk run [24926120125](https://github.com/max-sixty/worktrunk/actions/runs/24926120125) (a `tend-mention` session triggered by an explicit maintainer instruction to update repo skill guidance). The bot followed the bundled recipe, hit `mv: ... Read-only file system`, retried with `cp` (same error), retried with `touch` (same error), then ad-libbed `git hash-object -w` plus `git update-index --cacheinfo` to write the blob and stage it directly to the index — a workaround that left the working tree showing the file as modified after the fact. ~7 wasted bash cycles before producing [worktrunk PR #2415](max-sixty/worktrunk#2415). `worktrunk-bot` followed the new bundled-skill-defect flow (introduced in #324 yesterday) and opened permission-request issue [max-sixty/worktrunk#2416](max-sixty/worktrunk#2416) on the consumer side. This PR is the corresponding tend-side fix surfaced by `review-reviewers` rather than waiting on the permission round-trip — the diagnosis is unambiguous and the fix is small. ## What changes The "Draft a minimal edit" step (step 3 of "How to propose"): - Drops the `cd .claude && mv /tmp/... SKILL.md` recipe that fails on the read-only mount. - Adds the same prose `review-runs` already carries explaining (a) the read-only bind-mount of `.claude/` and (b) the harness write-guard from claude-code#37157, so a reader sees both barriers and why a worktree clears both. - Replaces the recipe with the `git worktree add "$TMPDIR/skill-fix"` pattern, parameterised on `<topic>-$GITHUB_RUN_ID` for the branch name. - Keeps the existing TODO referencing claude-code#37157. The frontmatter snippet is moved up so the recipe sits at the bottom of step 3 with no interleaving. ## Test plan - [x] Eaten own dogfood: this PR's branch was authored from a worktree at `$TMPDIR/skill-fix` using exactly the recipe being landed; commit and push worked end-to-end. Worktree under `$TMPDIR` is writable, no `Read-only file system` errors. - [ ] CI on this PR passes (actionlint, ruff, typos, uv-lock). - [ ] Visual diff vs `review-runs` SKILL.md prose to confirm the two skills now agree on the recipe shape. Independent of [#327](#327) (different concern in the same file: external-tool verification, lines ~498 — no overlap with this edit at lines ~568). --------- Co-authored-by: continuous-bot <269947486+continuous-bot@users.noreply.github.com> Co-authored-by: tend-agent <270458913+tend-agent@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Problem
The bundled
running-in-ciskill's Grounded Analysis section covers behavioral claims by telling the bot to "run the command yourself" or hedge — but that fallback doesn't obviously apply to external CLIs and APIs that aren't installed in CI and aren't exercised by automated tests. In worktrunk#1907 the bot read upstream mintlify docs describing acmux list-workspaces --jsonflag, believed it, and committed a recipe that broke for every reader — the installed cmux had no such flag.Solution
Adds a new
### Verifying external-tool behaviorsubsection under Grounded Analysis. It points at two concrete verification paths in order of preference: install and run the tool, or clone its public repo and grep the source. Deferring to a human with the tool installed is the fallback only when both paths fail. The "don't make overconfident claims" framing already lives in the preceding### User-facing comments require source evidencesubsection.Bad/good example is drawn from the cmux incident: the good path now shows cloning the upstream repo and checking the CLI parser, not asking a human to confirm.
Testing
Skill text only.
pre-commit runon the modified file passes (typos, trim-whitespace, bang-backtick, end-of-files).Closes #326 — automated triage