running-in-ci: verify external-tool behavior by run or clone, not just hand-test by tend-agent · Pull Request #327 · max-sixty/tend

tend-agent · 2026-04-23T19:00:58Z

Problem

The bundled running-in-ci skill's Grounded Analysis section covers behavioral claims by telling the bot to "run the command yourself" or hedge — but that fallback doesn't obviously apply to external CLIs and APIs that aren't installed in CI and aren't exercised by automated tests. In worktrunk#1907 the bot read upstream mintlify docs describing a cmux list-workspaces --json flag, believed it, and committed a recipe that broke for every reader — the installed cmux had no such flag.

Solution

Adds a new ### Verifying external-tool behavior subsection under Grounded Analysis. It points at two concrete verification paths in order of preference: install and run the tool, or clone its public repo and grep the source. Deferring to a human with the tool installed is the fallback only when both paths fail. The "don't make overconfident claims" framing already lives in the preceding ### User-facing comments require source evidence subsection.

Bad/good example is drawn from the cmux incident: the good path now shows cloning the upstream repo and checking the CLI parser, not asking a human to confirm.

Testing

Skill text only. pre-commit run on the modified file passes (typos, trim-whitespace, bang-backtick, end-of-files).

Closes #326 — automated triage

Closes #326 Co-Authored-By: Claude <noreply@anthropic.com>

max-sixty

overall decent, but:

the focus should be on verifying claims, if possible by running code
it is possible to read a tool's source! tend can clone the repo!
fine to retain the "don't make overconfident claims" but we should have that elsewhere

…nd-test Review feedback on #327 pointed out the section framed external-tool claims as "defer to a human" when the bot can actually verify by running the tool or cloning its repo. Rewrite the subsection so it opens with the two verification paths (install-and-run, then clone-and-grep) and only falls back to asking a human when both fail. The "don't make overconfident claims" guidance already lives in the preceding subsection.

tend-agent · 2026-04-24T00:23:45Z

Good points, rewrote in 6437170:

Section now leads with verifying by running the tool (install-and-run) or reading source (gh repo clone then grep). Deferring to a human is the fallback only when both fail.
Retitled to ### Verifying external-tool behavior — positive framing, since the primary action is verification rather than flagging a gap.
Dropped the "do not post a confident claim" language; that rule already lives in the preceding ### User-facing comments require source evidence subsection, so no new home needed.
Updated the good example to show cloning cmux's source and checking the CLI parser rather than asking a human to confirm.

max-sixty · 2026-04-24T07:17:08Z

was this the actual problem in this specific case — that the cmux docs were wrong? if so, this is fine. if not, and actually we hallucinated something, then we should adjust

tend-agent · 2026-04-24T07:19:31Z

Checked both sources. The mintlify page at manaflow-ai-cmux.mintlify.app/cli/workspaces does describe a --json flag for list-workspaces, with the exact JSON shape I posted (a ParamField path="--json" block plus a sample {"workspaces": [{"ref": ..., "id": ..., "index": ..., "title": ..., "selected": ...}]} payload). So I wasn't synthesizing the schema — I quoted what the docs said.

The cmux source tells the other half of the story. In CLI/cmux.swift the list-workspaces help block is just Usage: cmux list-workspaces with no flags; --json is only wired up on tree (and a few others). So the published docs are ahead of (or diverged from) the shipped CLI parser.

Net: docs were wrong, not a hallucination — but the new section's "clone the repo and grep the CLI parser" step would have caught it, since the discrepancy is visible in one grep --json CLI/cmux.swift. So the framing in this PR holds. Happy to file a docs issue with manaflow-ai/cmux if you'd like, separately.

tend-agent

Small correction to my earlier comment for the record: --json is in fact wired into list-workspaces at CLI/cmux.swift:2003-2005 (even at tag v0.63.2), via the shared if jsonOutput block.

The real reason the published recipe broke: cmux parses --json as a global flag in a loop that breaks at the first non-global token (see CLI/cmux.swift:1665-1710). So cmux list-workspaces --json silently drops --json and returns text; the form that works is cmux --json list-workspaces. The mintlify docs show the broken flag ordering, which is what I copied.

Still "docs wrong, not hallucination" — but a naïve grep -- --json CLI/cmux.swift would have confirmed --json works for list-workspaces and shipped the same broken recipe. Path 1 (run the tool) is what would have caught this; Path 2 alone misses flag-placement bugs unless you also read the arg-parser structure. Worth keeping that caveat in mind when this guidance gets applied to other fast-moving CLIs.

max-sixty · 2026-04-25T15:49:52Z

ok, can you add a comment on the original issue that the --json flag is available if expressed differently?

tend-agent · 2026-04-25T15:53:28Z

Done — worktrunk#1907 comment explains that --json does work for list-workspaces but only as a global flag (cmux --json list-workspaces), with source pointers at CLI/cmux.swift:1635-1690 (parser) and :2003-2006 (handler).

) ## Summary Update the `running-in-ci` "How to propose" recipe so a bot updating `.claude/skills/` can actually write to disk. The current recipe (`cd .claude && mv /tmp/... SKILL.md`) satisfies the Claude Code harness write-guard ([anthropics/claude-code#37157](anthropics/claude-code#37157)) but fails the read-only bind-mount the sandbox places on `.claude/`. Mirror the pattern `review-runs` already documents: do the edit, commit, and push from a git worktree under `$TMPDIR`. Both restrictions are now explained inline so a future reader knows why the worktree is mandatory. ## Why `worktrunk-bot` hit this in worktrunk run [24926120125](https://github.com/max-sixty/worktrunk/actions/runs/24926120125) (a `tend-mention` session triggered by an explicit maintainer instruction to update repo skill guidance). The bot followed the bundled recipe, hit `mv: ... Read-only file system`, retried with `cp` (same error), retried with `touch` (same error), then ad-libbed `git hash-object -w` plus `git update-index --cacheinfo` to write the blob and stage it directly to the index — a workaround that left the working tree showing the file as modified after the fact. ~7 wasted bash cycles before producing [worktrunk PR #2415](max-sixty/worktrunk#2415). `worktrunk-bot` followed the new bundled-skill-defect flow (introduced in #324 yesterday) and opened permission-request issue [max-sixty/worktrunk#2416](max-sixty/worktrunk#2416) on the consumer side. This PR is the corresponding tend-side fix surfaced by `review-reviewers` rather than waiting on the permission round-trip — the diagnosis is unambiguous and the fix is small. ## What changes The "Draft a minimal edit" step (step 3 of "How to propose"): - Drops the `cd .claude && mv /tmp/... SKILL.md` recipe that fails on the read-only mount. - Adds the same prose `review-runs` already carries explaining (a) the read-only bind-mount of `.claude/` and (b) the harness write-guard from claude-code#37157, so a reader sees both barriers and why a worktree clears both. - Replaces the recipe with the `git worktree add "$TMPDIR/skill-fix"` pattern, parameterised on `<topic>-$GITHUB_RUN_ID` for the branch name. - Keeps the existing TODO referencing claude-code#37157. The frontmatter snippet is moved up so the recipe sits at the bottom of step 3 with no interleaving. ## Test plan - [x] Eaten own dogfood: this PR's branch was authored from a worktree at `$TMPDIR/skill-fix` using exactly the recipe being landed; commit and push worked end-to-end. Worktree under `$TMPDIR` is writable, no `Read-only file system` errors. - [ ] CI on this PR passes (actionlint, ruff, typos, uv-lock). - [ ] Visual diff vs `review-runs` SKILL.md prose to confirm the two skills now agree on the recipe shape. Independent of [#327](#327) (different concern in the same file: external-tool verification, lines ~498 — no overlap with this edit at lines ~568). --------- Co-authored-by: continuous-bot <269947486+continuous-bot@users.noreply.github.com> Co-authored-by: tend-agent <270458913+tend-agent@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

running-in-ci: require hand-test for external-tool behavioral claims

b562b3b

Closes #326 Co-Authored-By: Claude <noreply@anthropic.com>

tend-agent mentioned this pull request Apr 23, 2026

running-in-ci: flag external-tool behavioral claims that need hand-testing #326

Closed

max-sixty reviewed Apr 24, 2026

View reviewed changes

tend-agent changed the title ~~running-in-ci: require hand-test for external-tool behavioral claims~~ running-in-ci: verify external-tool behavior by run or clone, not just hand-test Apr 24, 2026

max-sixty reviewed Apr 24, 2026

View reviewed changes

Comment thread plugins/tend-ci-runner/skills/running-in-ci/SKILL.md Outdated

Apply suggestion from @max-sixty

f71f267

tend-agent commented Apr 24, 2026

View reviewed changes

This was referenced Apr 24, 2026

review-runs-tracking: 2026-04 #138

Closed

running-in-ci: skill-edit recipe should use worktree under $TMPDIR #328

Merged

tend-agent mentioned this pull request Apr 25, 2026

docs: add cmux workspace integration recipe max-sixty/worktrunk#1907

Merged

2 tasks

max-sixty merged commit 586940b into main Apr 25, 2026
5 checks passed

max-sixty deleted the fix/issue-326 branch April 25, 2026 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

running-in-ci: verify external-tool behavior by run or clone, not just hand-test#327

running-in-ci: verify external-tool behavior by run or clone, not just hand-test#327
max-sixty merged 3 commits into
mainfrom
fix/issue-326

tend-agent commented Apr 23, 2026 •

edited

Loading

Uh oh!

max-sixty left a comment

Uh oh!

tend-agent commented Apr 24, 2026

Uh oh!

Uh oh!

max-sixty commented Apr 24, 2026

Uh oh!

tend-agent commented Apr 24, 2026

Uh oh!

tend-agent left a comment

Uh oh!

max-sixty commented Apr 25, 2026

Uh oh!

tend-agent commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tend-agent commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Testing

Uh oh!

max-sixty left a comment

Choose a reason for hiding this comment

Uh oh!

tend-agent commented Apr 24, 2026

Uh oh!

Uh oh!

max-sixty commented Apr 24, 2026

Uh oh!

tend-agent commented Apr 24, 2026

Uh oh!

tend-agent left a comment

Choose a reason for hiding this comment

Uh oh!

max-sixty commented Apr 25, 2026

Uh oh!

tend-agent commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tend-agent commented Apr 23, 2026 •

edited

Loading