From b562b3b4e5924b517310c67952fd42335819cb57 Mon Sep 17 00:00:00 2001
From: tend-agent <270458913+tend-agent@users.noreply.github.com>
Date: Thu, 23 Apr 2026 19:00:33 +0000
Subject: [PATCH 1/3] running-in-ci: require hand-test for external-tool
 behavioral claims

Closes #326

Co-Authored-By: Claude <noreply@anthropic.com>
---
 .../skills/running-in-ci/SKILL.md             | 35 +++++++++++++++++++
 1 file changed, 35 insertions(+)
diff --git a/plugins/tend-ci-runner/skills/running-in-ci/SKILL.md b/plugins/tend-ci-runner/skills/running-in-ci/SKILL.md
index ee8f5c5f..e7e155ed 100644
--- a/plugins/tend-ci-runner/skills/running-in-ci/SKILL.md
+++ b/plugins/tend-ci-runner/skills/running-in-ci/SKILL.md
@@ -498,6 +498,41 @@ If you can't find source evidence for a specific detail, say so ("I'm not sure o
 syntax") rather than guessing. An honest gap is fixable; a confident hallucination gets
 copy-pasted.
 
+### External-tool behavioral claims need a hand-test
+
+When a review or triage turns on how an external CLI, API, or system behaves — and that tool
+is not installed in CI and not exercised by automated tests — the "run the command yourself"
+fallback above doesn't apply. Upstream docs for fast-moving tools lag or describe flags that
+were removed or renamed, and reading the tool's source isn't an option when it lives in a
+separate project. Do not post a confident claim or commit code/doc changes that depend on
+the behavior.
+
+1. Name the gap. Quote the question back and state that the behavior needs hand-testing on
+   a machine with the tool installed.
+2. If you have partial evidence (upstream docs, a commit, a linked issue), cite it and
+   hedge: "According to X's docs, Y — but I haven't verified on a real install. Could
+   someone with X installed confirm before I push a change?"
+3. Don't ship a doc or code change whose correctness depends on the claim until a human
+   confirms. In triage, the same rule applies: if a repro uses a tool not in CI, flag the
+   repro as unverified rather than asserting a fix.
+
+<example>
+<bad reason="Trusted upstream docs for a fast-moving external CLI and shipped a broken recipe">
+
+Bad: Review asked whether `cmux list-workspaces` had structured output. Read a mintlify
+page describing `--json` → rewrote the recipe to `cmux list-workspaces --json | jq ...` →
+committed. The installed cmux had no `--json` flag; every reader hit a broken recipe.
+
+</bad>
+<good reason="Named the verification gap and deferred to a human with the tool installed">
+
+Good: Same question. Read the docs → replied: "The docs describe `--json`, but cmux isn't
+installed in CI so I can't verify against your version. Could you confirm
+`cmux list-workspaces --help` shows `--json` before I push the change?" Waited.
+
+</good>
+</example>
+
 ### Rewriting is authoring
 
 Cross-posting, summarizing, or paraphrasing is not copying — any new content you add requires the

From 643717079e12f7d1ad9ab1488cf07d7166963ca1 Mon Sep 17 00:00:00 2001
From: tend-agent <270458913+tend-agent@users.noreply.github.com>
Date: Fri, 24 Apr 2026 00:23:15 +0000
Subject: [PATCH 2/3] running-in-ci: lead external-tool verification with
 run/clone, not hand-test

Review feedback on #327 pointed out the section framed external-tool claims
as "defer to a human" when the bot can actually verify by running the tool
or cloning its repo. Rewrite the subsection so it opens with the two
verification paths (install-and-run, then clone-and-grep) and only falls
back to asking a human when both fail. The "don't make overconfident
claims" guidance already lives in the preceding subsection.
---
 .../skills/running-in-ci/SKILL.md             | 42 +++++++++----------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/plugins/tend-ci-runner/skills/running-in-ci/SKILL.md b/plugins/tend-ci-runner/skills/running-in-ci/SKILL.md
index e7e155ed..fcf3c723 100644
--- a/plugins/tend-ci-runner/skills/running-in-ci/SKILL.md
+++ b/plugins/tend-ci-runner/skills/running-in-ci/SKILL.md
@@ -498,23 +498,23 @@ If you can't find source evidence for a specific detail, say so ("I'm not sure o
 syntax") rather than guessing. An honest gap is fixable; a confident hallucination gets
 copy-pasted.
 
-### External-tool behavioral claims need a hand-test
-
-When a review or triage turns on how an external CLI, API, or system behaves — and that tool
-is not installed in CI and not exercised by automated tests — the "run the command yourself"
-fallback above doesn't apply. Upstream docs for fast-moving tools lag or describe flags that
-were removed or renamed, and reading the tool's source isn't an option when it lives in a
-separate project. Do not post a confident claim or commit code/doc changes that depend on
-the behavior.
-
-1. Name the gap. Quote the question back and state that the behavior needs hand-testing on
-   a machine with the tool installed.
-2. If you have partial evidence (upstream docs, a commit, a linked issue), cite it and
-   hedge: "According to X's docs, Y — but I haven't verified on a real install. Could
-   someone with X installed confirm before I push a change?"
-3. Don't ship a doc or code change whose correctness depends on the claim until a human
-   confirms. In triage, the same rule applies: if a repro uses a tool not in CI, flag the
-   repro as unverified rather than asserting a fix.
+### Verifying external-tool behavior
+
+When a claim turns on how an external CLI, API, or system behaves, verify by running the
+code. Upstream docs for fast-moving tools lag or describe flags that were removed or
+renamed — don't treat them as proof on their own.
+
+Two paths, in order of preference:
+
+1. **Run the tool.** If it's installable in this environment, install it and invoke the
+   specific command or flag in question. Link the output in your reply.
+2. **Read the source.** Tend can clone any public repo. `gh repo clone <owner>/<repo>`
+   then grep for the flag or behavior. Source doesn't lag itself, and a flag that isn't
+   defined in the parser doesn't exist.
+
+If both paths fail (GUI-only tool, private repo, environment-specific behavior), cite
+what you found, name the remaining gap, and ask a human with the tool installed to
+confirm before shipping a dependent change.
 
 <example>
 <bad reason="Trusted upstream docs for a fast-moving external CLI and shipped a broken recipe">
@@ -524,11 +524,11 @@ page describing `--json` → rewrote the recipe to `cmux list-workspaces --json
 committed. The installed cmux had no `--json` flag; every reader hit a broken recipe.
 
 </bad>
-<good reason="Named the verification gap and deferred to a human with the tool installed">
+<good reason="Cloned the upstream source and verified the flag before shipping">
 
-Good: Same question. Read the docs → replied: "The docs describe `--json`, but cmux isn't
-installed in CI so I can't verify against your version. Could you confirm
-`cmux list-workspaces --help` shows `--json` before I push the change?" Waited.
+Good: Same question. Cloned cmux's source repo → grepped the CLI parser for
+`list-workspaces` → saw no `--json` flag defined → replied with the source link and
+proposed an alternative that matched the actual CLI surface.
 
 </good>
 </example>

From f71f267ab762d78f9ef977d718b4040aec2b0e82 Mon Sep 17 00:00:00 2001
From: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
Date: Fri, 24 Apr 2026 00:16:07 -0700
Subject: [PATCH 3/3] Apply suggestion from @max-sixty

---
 plugins/tend-ci-runner/skills/running-in-ci/SKILL.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/plugins/tend-ci-runner/skills/running-in-ci/SKILL.md b/plugins/tend-ci-runner/skills/running-in-ci/SKILL.md
index fcf3c723..de519fc5 100644
--- a/plugins/tend-ci-runner/skills/running-in-ci/SKILL.md
+++ b/plugins/tend-ci-runner/skills/running-in-ci/SKILL.md
@@ -501,8 +501,7 @@ copy-pasted.
 ### Verifying external-tool behavior
 
 When a claim turns on how an external CLI, API, or system behaves, verify by running the
-code. Upstream docs for fast-moving tools lag or describe flags that were removed or
-renamed — don't treat them as proof on their own.
+code.
 
 Two paths, in order of preference: