Skip to content

running-in-ci: verify external-tool behavior by run or clone, not just hand-test#327

Merged
max-sixty merged 3 commits into
mainfrom
fix/issue-326
Apr 25, 2026
Merged

running-in-ci: verify external-tool behavior by run or clone, not just hand-test#327
max-sixty merged 3 commits into
mainfrom
fix/issue-326

Conversation

@tend-agent
Copy link
Copy Markdown
Collaborator

@tend-agent tend-agent commented Apr 23, 2026

Problem

The bundled running-in-ci skill's Grounded Analysis section covers behavioral claims by telling the bot to "run the command yourself" or hedge — but that fallback doesn't obviously apply to external CLIs and APIs that aren't installed in CI and aren't exercised by automated tests. In worktrunk#1907 the bot read upstream mintlify docs describing a cmux list-workspaces --json flag, believed it, and committed a recipe that broke for every reader — the installed cmux had no such flag.

Solution

Adds a new ### Verifying external-tool behavior subsection under Grounded Analysis. It points at two concrete verification paths in order of preference: install and run the tool, or clone its public repo and grep the source. Deferring to a human with the tool installed is the fallback only when both paths fail. The "don't make overconfident claims" framing already lives in the preceding ### User-facing comments require source evidence subsection.

Bad/good example is drawn from the cmux incident: the good path now shows cloning the upstream repo and checking the CLI parser, not asking a human to confirm.

Testing

Skill text only. pre-commit run on the modified file passes (typos, trim-whitespace, bang-backtick, end-of-files).


Closes #326 — automated triage

Closes #326

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall decent, but:

  • the focus should be on verifying claims, if possible by running code
  • it is possible to read a tool's source! tend can clone the repo!
  • fine to retain the "don't make overconfident claims" but we should have that elsewhere

…nd-test

Review feedback on #327 pointed out the section framed external-tool claims
as "defer to a human" when the bot can actually verify by running the tool
or cloning its repo. Rewrite the subsection so it opens with the two
verification paths (install-and-run, then clone-and-grep) and only falls
back to asking a human when both fail. The "don't make overconfident
claims" guidance already lives in the preceding subsection.
@tend-agent tend-agent changed the title running-in-ci: require hand-test for external-tool behavioral claims running-in-ci: verify external-tool behavior by run or clone, not just hand-test Apr 24, 2026
@tend-agent
Copy link
Copy Markdown
Collaborator Author

Good points, rewrote in 6437170:

  • Section now leads with verifying by running the tool (install-and-run) or reading source (gh repo clone then grep). Deferring to a human is the fallback only when both fail.
  • Retitled to ### Verifying external-tool behavior — positive framing, since the primary action is verification rather than flagging a gap.
  • Dropped the "do not post a confident claim" language; that rule already lives in the preceding ### User-facing comments require source evidence subsection, so no new home needed.
  • Updated the good example to show cloning cmux's source and checking the CLI parser rather than asking a human to confirm.

Comment thread plugins/tend-ci-runner/skills/running-in-ci/SKILL.md Outdated
@max-sixty
Copy link
Copy Markdown
Owner

was this the actual problem in this specific case — that the cmux docs were wrong? if so, this is fine. if not, and actually we hallucinated something, then we should adjust

@tend-agent
Copy link
Copy Markdown
Collaborator Author

Checked both sources. The mintlify page at manaflow-ai-cmux.mintlify.app/cli/workspaces does describe a --json flag for list-workspaces, with the exact JSON shape I posted (a ParamField path="--json" block plus a sample {"workspaces": [{"ref": ..., "id": ..., "index": ..., "title": ..., "selected": ...}]} payload). So I wasn't synthesizing the schema — I quoted what the docs said.

The cmux source tells the other half of the story. In CLI/cmux.swift the list-workspaces help block is just Usage: cmux list-workspaces with no flags; --json is only wired up on tree (and a few others). So the published docs are ahead of (or diverged from) the shipped CLI parser.

Net: docs were wrong, not a hallucination — but the new section's "clone the repo and grep the CLI parser" step would have caught it, since the discrepancy is visible in one grep --json CLI/cmux.swift. So the framing in this PR holds. Happy to file a docs issue with manaflow-ai/cmux if you'd like, separately.

Copy link
Copy Markdown
Collaborator Author

@tend-agent tend-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small correction to my earlier comment for the record: --json is in fact wired into list-workspaces at CLI/cmux.swift:2003-2005 (even at tag v0.63.2), via the shared if jsonOutput block.

The real reason the published recipe broke: cmux parses --json as a global flag in a loop that breaks at the first non-global token (see CLI/cmux.swift:1665-1710). So cmux list-workspaces --json silently drops --json and returns text; the form that works is cmux --json list-workspaces. The mintlify docs show the broken flag ordering, which is what I copied.

Still "docs wrong, not hallucination" — but a naïve grep -- --json CLI/cmux.swift would have confirmed --json works for list-workspaces and shipped the same broken recipe. Path 1 (run the tool) is what would have caught this; Path 2 alone misses flag-placement bugs unless you also read the arg-parser structure. Worth keeping that caveat in mind when this guidance gets applied to other fast-moving CLIs.

@max-sixty
Copy link
Copy Markdown
Owner

ok, can you add a comment on the original issue that the --json flag is available if expressed differently?

@tend-agent
Copy link
Copy Markdown
Collaborator Author

Done — worktrunk#1907 comment explains that --json does work for list-workspaces but only as a global flag (cmux --json list-workspaces), with source pointers at CLI/cmux.swift:1635-1690 (parser) and :2003-2006 (handler).

@max-sixty max-sixty merged commit 586940b into main Apr 25, 2026
5 checks passed
@max-sixty max-sixty deleted the fix/issue-326 branch April 25, 2026 22:46
max-sixty pushed a commit that referenced this pull request Apr 25, 2026
)

## Summary

Update the `running-in-ci` "How to propose" recipe so a bot updating
`.claude/skills/` can actually write to disk. The current recipe (`cd
.claude && mv /tmp/... SKILL.md`) satisfies the Claude Code harness
write-guard
([anthropics/claude-code#37157](anthropics/claude-code#37157))
but fails the read-only bind-mount the sandbox places on `.claude/`.
Mirror the pattern `review-runs` already documents: do the edit, commit,
and push from a git worktree under `$TMPDIR`. Both restrictions are now
explained inline so a future reader knows why the worktree is mandatory.

## Why

`worktrunk-bot` hit this in worktrunk run
[24926120125](https://github.com/max-sixty/worktrunk/actions/runs/24926120125)
(a `tend-mention` session triggered by an explicit maintainer
instruction to update repo skill guidance). The bot followed the bundled
recipe, hit `mv: ... Read-only file system`, retried with `cp` (same
error), retried with `touch` (same error), then ad-libbed `git
hash-object -w` plus `git update-index --cacheinfo` to write the blob
and stage it directly to the index — a workaround that left the working
tree showing the file as modified after the fact. ~7 wasted bash cycles
before producing [worktrunk PR
#2415](max-sixty/worktrunk#2415).

`worktrunk-bot` followed the new bundled-skill-defect flow (introduced
in #324 yesterday) and opened permission-request issue
[max-sixty/worktrunk#2416](max-sixty/worktrunk#2416)
on the consumer side. This PR is the corresponding tend-side fix
surfaced by `review-reviewers` rather than waiting on the permission
round-trip — the diagnosis is unambiguous and the fix is small.

## What changes

The "Draft a minimal edit" step (step 3 of "How to propose"):

- Drops the `cd .claude && mv /tmp/... SKILL.md` recipe that fails on
the read-only mount.
- Adds the same prose `review-runs` already carries explaining (a) the
read-only bind-mount of `.claude/` and (b) the harness write-guard from
claude-code#37157, so a reader sees both barriers and why a worktree
clears both.
- Replaces the recipe with the `git worktree add "$TMPDIR/skill-fix"`
pattern, parameterised on `<topic>-$GITHUB_RUN_ID` for the branch name.
- Keeps the existing TODO referencing claude-code#37157.

The frontmatter snippet is moved up so the recipe sits at the bottom of
step 3 with no interleaving.

## Test plan

- [x] Eaten own dogfood: this PR's branch was authored from a worktree
at `$TMPDIR/skill-fix` using exactly the recipe being landed; commit and
push worked end-to-end. Worktree under `$TMPDIR` is writable, no
`Read-only file system` errors.
- [ ] CI on this PR passes (actionlint, ruff, typos, uv-lock).
- [ ] Visual diff vs `review-runs` SKILL.md prose to confirm the two
skills now agree on the recipe shape.

Independent of [#327](#327)
(different concern in the same file: external-tool verification, lines
~498 — no overlap with this edit at lines ~568).

---------

Co-authored-by: continuous-bot <269947486+continuous-bot@users.noreply.github.com>
Co-authored-by: tend-agent <270458913+tend-agent@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

running-in-ci: flag external-tool behavioral claims that need hand-testing

2 participants