Skip to content

Commit 586940b

Browse files
tend-agentclaudemax-sixty
authored
running-in-ci: verify external-tool behavior by run or clone, not just hand-test (#327)
## Problem The bundled `running-in-ci` skill's Grounded Analysis section covers behavioral claims by telling the bot to "run the command yourself" or hedge — but that fallback doesn't obviously apply to external CLIs and APIs that aren't installed in CI and aren't exercised by automated tests. In [worktrunk#1907](max-sixty/worktrunk#1907 (comment)) the bot read upstream mintlify docs describing a `cmux list-workspaces --json` flag, believed it, and committed a recipe that broke for every reader — the installed cmux had no such flag. ## Solution Adds a new `### Verifying external-tool behavior` subsection under Grounded Analysis. It points at two concrete verification paths in order of preference: install and run the tool, or clone its public repo and grep the source. Deferring to a human with the tool installed is the fallback only when both paths fail. The "don't make overconfident claims" framing already lives in the preceding `### User-facing comments require source evidence` subsection. Bad/good example is drawn from the cmux incident: the good path now shows cloning the upstream repo and checking the CLI parser, not asking a human to confirm. ## Testing Skill text only. `pre-commit run` on the modified file passes (typos, trim-whitespace, bang-backtick, end-of-files). --- Closes #326 — automated triage --------- Co-authored-by: tend-agent <270458913+tend-agent@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
1 parent 9bee033 commit 586940b

1 file changed

Lines changed: 34 additions & 0 deletions

File tree

  • plugins/tend-ci-runner/skills/running-in-ci

plugins/tend-ci-runner/skills/running-in-ci/SKILL.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -501,6 +501,40 @@ If you can't find source evidence for a specific detail, say so ("I'm not sure o
501501
syntax") rather than guessing. An honest gap is fixable; a confident hallucination gets
502502
copy-pasted.
503503
504+
### Verifying external-tool behavior
505+
506+
When a claim turns on how an external CLI, API, or system behaves, verify by running the
507+
code.
508+
509+
Two paths, in order of preference:
510+
511+
1. **Run the tool.** If it's installable in this environment, install it and invoke the
512+
specific command or flag in question. Link the output in your reply.
513+
2. **Read the source.** Tend can clone any public repo. `gh repo clone <owner>/<repo>`
514+
then grep for the flag or behavior. Source doesn't lag itself, and a flag that isn't
515+
defined in the parser doesn't exist.
516+
517+
If both paths fail (GUI-only tool, private repo, environment-specific behavior), cite
518+
what you found, name the remaining gap, and ask a human with the tool installed to
519+
confirm before shipping a dependent change.
520+
521+
<example>
522+
<bad reason="Trusted upstream docs for a fast-moving external CLI and shipped a broken recipe">
523+
524+
Bad: Review asked whether `cmux list-workspaces` had structured output. Read a mintlify
525+
page describing `--json` → rewrote the recipe to `cmux list-workspaces --json | jq ...` →
526+
committed. The installed cmux had no `--json` flag; every reader hit a broken recipe.
527+
528+
</bad>
529+
<good reason="Cloned the upstream source and verified the flag before shipping">
530+
531+
Good: Same question. Cloned cmux's source repo → grepped the CLI parser for
532+
`list-workspaces` → saw no `--json` flag defined → replied with the source link and
533+
proposed an alternative that matched the actual CLI surface.
534+
535+
</good>
536+
</example>
537+
504538
### Rewriting is authoring
505539
506540
Cross-posting, summarizing, or paraphrasing is not copying — any new content you add requires the

0 commit comments

Comments
 (0)