running-in-ci: verify external-tool behavior by run or clone, not just hand-test (#327)

tend-agent · claude · max-sixty · web-flow · commit 586940b887ac · 2026-04-25T15:46:05.000-07:00
## Problem The bundled `running-in-ci` skill's Grounded Analysis section covers behavioral claims by telling the bot to "run the command yourself" or hedge — but that fallback doesn't obviously apply to external CLIs and APIs that aren't installed in CI and aren't exercised by automated tests. In [worktrunk#1907](max-sixty/worktrunk#1907 (comment)) the bot read upstream mintlify docs describing a `cmux list-workspaces --json` flag, believed it, and committed a recipe that broke for every reader — the installed cmux had no such flag. ## Solution Adds a new `### Verifying external-tool behavior` subsection under Grounded Analysis. It points at two concrete verification paths in order of preference: install and run the tool, or clone its public repo and grep the source. Deferring to a human with the tool installed is the fallback only when both paths fail. The "don't make overconfident claims" framing already lives in the preceding `### User-facing comments require source evidence` subsection. Bad/good example is drawn from the cmux incident: the good path now shows cloning the upstream repo and checking the CLI parser, not asking a human to confirm. ## Testing Skill text only. `pre-commit run` on the modified file passes (typos, trim-whitespace, bang-backtick, end-of-files). --- Closes #326 — automated triage --------- Co-authored-by: tend-agent <270458913+tend-agent@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
diff --git a/plugins/tend-ci-runner/skills/running-in-ci/SKILL.md b/plugins/tend-ci-runner/skills/running-in-ci/SKILL.md
@@ -501,6 +501,40 @@ If you can't find source evidence for a specific detail, say so ("I'm not sure o
 syntax") rather than guessing. An honest gap is fixable; a confident hallucination gets
 copy-pasted.
 
+### Verifying external-tool behavior
+
+When a claim turns on how an external CLI, API, or system behaves, verify by running the
+code.
+
+Two paths, in order of preference:
+
+1. **Run the tool.** If it's installable in this environment, install it and invoke the
+   specific command or flag in question. Link the output in your reply.
+2. **Read the source.** Tend can clone any public repo. `gh repo clone <owner>/<repo>`
+   then grep for the flag or behavior. Source doesn't lag itself, and a flag that isn't
+   defined in the parser doesn't exist.
+
+If both paths fail (GUI-only tool, private repo, environment-specific behavior), cite
+what you found, name the remaining gap, and ask a human with the tool installed to
+confirm before shipping a dependent change.
+
+<example>
+<bad reason="Trusted upstream docs for a fast-moving external CLI and shipped a broken recipe">
+
+Bad: Review asked whether `cmux list-workspaces` had structured output. Read a mintlify
+page describing `--json` → rewrote the recipe to `cmux list-workspaces --json | jq ...` →
+committed. The installed cmux had no `--json` flag; every reader hit a broken recipe.
+
+</bad>
+<good reason="Cloned the upstream source and verified the flag before shipping">
+
+Good: Same question. Cloned cmux's source repo → grepped the CLI parser for
+`list-workspaces` → saw no `--json` flag defined → replied with the source link and
+proposed an alternative that matched the actual CLI surface.
+
+</good>
+</example>
+
 ### Rewriting is authoring
 
 Cross-posting, summarizing, or paraphrasing is not copying — any new content you add requires the