garrytan · gregario · Apr 26, 2026 · Apr 26, 2026 · Apr 26, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,38 @@
 # Changelog
 
+## [1.15.4.0] - 2026-04-26
+
+## **`/ship` now calls out tests that the plan promised but the diff doesn't have.**
+
+`/plan-eng-review` produces a Test review section that gets translated into TEST-category plan items. `/ship`'s Plan Completion Audit was rolling those into the generic NOT DONE count, so a disciplined review→implement→ship loop could silently drop the test step and the gap only surfaced if you went looking for it. Step 8 now scans the diff for per-language test-file evidence and surfaces a one-line gap in the PR body when promised tests are missing.
+
+### What this means for you
+
+If you run `/plan-eng-review` and it lists three tests, then `/ship` will tell you "promised 3 tests, 2 landed. Missing: E2E test for signup flow". Informational only — no new gate, the existing `deferred` flow still handles the user-facing decision. The point is to make the test gap explicit instead of hiding it inside an aggregate count.
+
+### The numbers that matter
+
+From the issue author's `/retro global 7d` (8 active repos, 2026-04-11 → 2026-04-18):
+
+| Metric | Count | Share |
+|---|---|---|
+| Total commits | 172 | — |
+| `feat:` commits | 67 | 34% |
+| `fix:` commits | 56 | 28% |
+| `test:` commits | 6 | **3%** |
+
+Every plan listed specific tests. Most just didn't land. The Test-Promise Audit doesn't fix that — implementers still drop the test — but it makes the gap visible at the moment ship would otherwise let it through.
+
+### Itemized changes
+
+#### Added
+- Test-Promise Audit step in `scripts/resolvers/review.ts` (`generatePlanCompletionAuditInner`) instructs the audit subagent to cross-reference each TEST-category plan item against the diff, with per-language patterns enumerated for JS/TS, Python, Go, Ruby, Rust, Java/Kotlin, and shell. (#1070)
+- Three new fields on the Step 8 JSON contract: `tests_promised`, `tests_landed`, `tests_missing`. Parent processing in `ship/SKILL.md.tmpl` appends an informational line to the Plan Completion summary when `tests_missing.length > 0`. No gate, no JSONL schema change.
+
+#### For contributors
+- `test/resolver-test-promise-audit.test.ts` pins the resolver output contract (per-language test patterns, JSON field names, the Item Extraction TEST category). Free-tier; runs in <200ms.
+- Resolver change applies to both `ship` and `review` modes — review's whole job is delivery integrity, so it surfaces the same gap.
+
 ## [1.14.0.0] - 2026-04-25
 
 ## **The gstack browser sidebar is now an interactive Claude Code REPL with live tab awareness.**

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-1.14.0.0
+1.15.4.0
diff --git a/review/SKILL.md b/review/SKILL.md
@@ -1200,6 +1200,35 @@ For each extracted plan item, check the diff and classify:
 **Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present.
 **Be generous with CHANGED** — if the goal is met by different means, that counts as addressed.
 
+### Test-Promise Audit
+
+For each plan item categorized as TEST, scan the diff for evidence that the test actually landed:
+
+1. **New test files** — `git diff origin/<base>...HEAD --name-only --diff-filter=A` and match against language-specific test patterns:
+   - JS/TS: `*.test.ts`, `*.test.tsx`, `*.test.js`, `*.spec.ts`, `__tests__/*`
+   - Python: `test_*.py`, `tests/*.py`, `*_test.py`
+   - Go: `*_test.go`
+   - Ruby: `*_spec.rb`, `spec/*`, `test/*_test.rb`
+   - Rust: `#[test]`/`#[cfg(test)]` blocks added in any `.rs` file
+   - Java/Kotlin: `*Test.java`, `*Test.kt`, `*Spec.kt` under `src/test/`
+   - Shell: `test/*.bats`, `*.test.sh`
+2. **Existing test files modified** — same patterns above, but with `--diff-filter=M` and a non-trivial diff (not just whitespace or imports).
+3. **New test functions/cases** — diff hunks adding `test(...)`, `it(...)`, `describe(...)`, `def test_*`, `func Test*`, `@Test`, `#[test]`.
+
+Match heuristically — a TEST item that says "Add test for UserService" pairs with a new `user_service.test.ts` or new `it('UserService...')` block. Don't require literal name matches; the test file just has to plausibly cover the promised scope.
+
+Classify each TEST item:
+
+- **landed** — clear matching evidence in the diff (cite the file)
+- **missing** — no test file or test function in the diff matches this item
+
+Track these counts for the final JSON output:
+- `tests_promised` = number of TEST-category plan items
+- `tests_landed` = number of TEST items with evidence in the diff
+- `tests_missing` = list of TEST item texts (verbatim or concise) where evidence was absent
+
+**Why this matters:** Reviewer asked for the test in /plan-eng-review. Plan listed it. Implementer didn't write it. Step 8 used to count it as a NOT DONE generic item — now it gets called out specifically so users can decide whether the gap is acceptable before the PR lands.
+
 ### Output Format
 
 ```

diff --git a/scripts/resolvers/review.ts b/scripts/resolvers/review.ts
@@ -770,6 +770,43 @@ For each extracted plan item, check the diff and classify:
 **Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present.
 **Be generous with CHANGED** — if the goal is met by different means, that counts as addressed.`);
 
+  // ── Test-promise audit (#1070) ──
+  // /plan-eng-review extracts a Test review section that produces TEST-category
+  // plan items. They were getting silently dropped at implement time without
+  // anyone noticing — an upstream retro found ~3% test commits over a 7-day
+  // window despite every plan listing specific tests. Cross-reference each
+  // TEST item against the diff explicitly so the audit surface includes
+  // "promised N tests, M landed".
+  sections.push(`
+### Test-Promise Audit
+
+For each plan item categorized as TEST, scan the diff for evidence that the test actually landed:
+
+1. **New test files** — \`git diff origin/<base>...HEAD --name-only --diff-filter=A\` and match against language-specific test patterns:
+   - JS/TS: \`*.test.ts\`, \`*.test.tsx\`, \`*.test.js\`, \`*.spec.ts\`, \`__tests__/*\`
+   - Python: \`test_*.py\`, \`tests/*.py\`, \`*_test.py\`
+   - Go: \`*_test.go\`
+   - Ruby: \`*_spec.rb\`, \`spec/*\`, \`test/*_test.rb\`
+   - Rust: \`#[test]\`/\`#[cfg(test)]\` blocks added in any \`.rs\` file
+   - Java/Kotlin: \`*Test.java\`, \`*Test.kt\`, \`*Spec.kt\` under \`src/test/\`
+   - Shell: \`test/*.bats\`, \`*.test.sh\`
+2. **Existing test files modified** — same patterns above, but with \`--diff-filter=M\` and a non-trivial diff (not just whitespace or imports).
+3. **New test functions/cases** — diff hunks adding \`test(...)\`, \`it(...)\`, \`describe(...)\`, \`def test_*\`, \`func Test*\`, \`@Test\`, \`#[test]\`.
+
+Match heuristically — a TEST item that says "Add test for UserService" pairs with a new \`user_service.test.ts\` or new \`it('UserService...')\` block. Don't require literal name matches; the test file just has to plausibly cover the promised scope.
+
+Classify each TEST item:
+
+- **landed** — clear matching evidence in the diff (cite the file)
+- **missing** — no test file or test function in the diff matches this item
+
+Track these counts for the final JSON output:
+- \`tests_promised\` = number of TEST-category plan items
+- \`tests_landed\` = number of TEST items with evidence in the diff
+- \`tests_missing\` = list of TEST item texts (verbatim or concise) where evidence was absent
+
+**Why this matters:** Reviewer asked for the test in /plan-eng-review. Plan listed it. Implementer didn't write it. Step 8 used to count it as a NOT DONE generic item — now it gets called out specifically so users can decide whether the gap is acceptable before the PR lands.`);
+
   // ── Output format ──
   sections.push(`
 ### Output Format

diff --git a/ship/SKILL.md b/ship/SKILL.md
@@ -1924,6 +1924,35 @@ For each extracted plan item, check the diff and classify:
 **Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present.
 **Be generous with CHANGED** — if the goal is met by different means, that counts as addressed.
 
+### Test-Promise Audit
+
+For each plan item categorized as TEST, scan the diff for evidence that the test actually landed:
+
+1. **New test files** — `git diff origin/<base>...HEAD --name-only --diff-filter=A` and match against language-specific test patterns:
+   - JS/TS: `*.test.ts`, `*.test.tsx`, `*.test.js`, `*.spec.ts`, `__tests__/*`
+   - Python: `test_*.py`, `tests/*.py`, `*_test.py`
+   - Go: `*_test.go`
+   - Ruby: `*_spec.rb`, `spec/*`, `test/*_test.rb`
+   - Rust: `#[test]`/`#[cfg(test)]` blocks added in any `.rs` file
+   - Java/Kotlin: `*Test.java`, `*Test.kt`, `*Spec.kt` under `src/test/`
+   - Shell: `test/*.bats`, `*.test.sh`
+2. **Existing test files modified** — same patterns above, but with `--diff-filter=M` and a non-trivial diff (not just whitespace or imports).
+3. **New test functions/cases** — diff hunks adding `test(...)`, `it(...)`, `describe(...)`, `def test_*`, `func Test*`, `@Test`, `#[test]`.
+
+Match heuristically — a TEST item that says "Add test for UserService" pairs with a new `user_service.test.ts` or new `it('UserService...')` block. Don't require literal name matches; the test file just has to plausibly cover the promised scope.
+
+Classify each TEST item:
+
+- **landed** — clear matching evidence in the diff (cite the file)
+- **missing** — no test file or test function in the diff matches this item
+
+Track these counts for the final JSON output:
+- `tests_promised` = number of TEST-category plan items
+- `tests_landed` = number of TEST items with evidence in the diff
+- `tests_missing` = list of TEST item texts (verbatim or concise) where evidence was absent
+
+**Why this matters:** Reviewer asked for the test in /plan-eng-review. Plan listed it. Implementer didn't write it. Step 8 used to count it as a NOT DONE generic item — now it gets called out specifically so users can decide whether the gap is acceptable before the PR lands.
+
 ### Output Format
 
 ```
@@ -1972,14 +2001,17 @@ After producing the completion checklist:
 **Include in PR body (Step 8):** Add a `## Plan Completion` section with the checklist summary.
 >
 > After your analysis, output a single JSON object on the LAST LINE of your response (no other text after it):
-> `{"total_items":N,"done":N,"changed":N,"deferred":N,"summary":"<markdown checklist for PR body>"}`
+> `{"total_items":N,"done":N,"changed":N,"deferred":N,"tests_promised":N,"tests_landed":N,"tests_missing":["item text",...],"summary":"<markdown checklist for PR body>"}`
+>
+> The `tests_*` fields come from the Test-Promise Audit section. If the plan has no TEST-category items, set `tests_promised:0`, `tests_landed:0`, `tests_missing:[]`.
 
 **Parent processing:**
 
 1. Parse the LAST line of the subagent's output as JSON.
 2. Store `done`, `deferred` for Step 20 metrics; use `summary` in PR body.
 3. If `deferred > 0` and no user override, present the deferred items via AskUserQuestion before continuing.
-4. Embed `summary` in PR body's `## Plan Completion` section (Step 19).
+4. **Test-promise gap (#1070):** if `tests_missing.length > 0`, append one informational line to the `summary` before embedding it: `> ⚠ Plan promised {tests_promised} tests, {tests_landed} landed. Missing: {tests_missing joined with "; "}`. Informational only — does NOT gate. The user already saw NOT-DONE items via the existing `deferred` flow; this just makes the test gap explicit instead of hiding it inside a generic "NOT DONE" count.
+5. Embed `summary` in PR body's `## Plan Completion` section (Step 19).
 
 **If the subagent fails or returns invalid JSON:** Fall back to running the audit inline. Never block /ship on subagent failure.
 

diff --git a/ship/SKILL.md.tmpl b/ship/SKILL.md.tmpl
@@ -268,14 +268,17 @@ If multiple suites need to run, run them sequentially (each needs a test lane).
 > {{PLAN_COMPLETION_AUDIT_SHIP}}
 >
 > After your analysis, output a single JSON object on the LAST LINE of your response (no other text after it):
-> `{"total_items":N,"done":N,"changed":N,"deferred":N,"summary":"<markdown checklist for PR body>"}`
+> `{"total_items":N,"done":N,"changed":N,"deferred":N,"tests_promised":N,"tests_landed":N,"tests_missing":["item text",...],"summary":"<markdown checklist for PR body>"}`
+>
+> The `tests_*` fields come from the Test-Promise Audit section. If the plan has no TEST-category items, set `tests_promised:0`, `tests_landed:0`, `tests_missing:[]`.
 
 **Parent processing:**
 
 1. Parse the LAST line of the subagent's output as JSON.
 2. Store `done`, `deferred` for Step 20 metrics; use `summary` in PR body.
 3. If `deferred > 0` and no user override, present the deferred items via AskUserQuestion before continuing.
-4. Embed `summary` in PR body's `## Plan Completion` section (Step 19).
+4. **Test-promise gap (#1070):** if `tests_missing.length > 0`, append one informational line to the `summary` before embedding it: `> ⚠ Plan promised {tests_promised} tests, {tests_landed} landed. Missing: {tests_missing joined with "; "}`. Informational only — does NOT gate. The user already saw NOT-DONE items via the existing `deferred` flow; this just makes the test gap explicit instead of hiding it inside a generic "NOT DONE" count.
+5. Embed `summary` in PR body's `## Plan Completion` section (Step 19).
 
 **If the subagent fails or returns invalid JSON:** Fall back to running the audit inline. Never block /ship on subagent failure.
 

diff --git a/test/resolver-test-promise-audit.test.ts b/test/resolver-test-promise-audit.test.ts
@@ -0,0 +1,70 @@
+/**
+ * Test-Promise Audit resolver assertions (#1070).
+ *
+ * The Plan Completion Audit subagent has to surface TEST-category items
+ * that were promised but didn't land. The resolver injects per-language
+ * test-file pattern guidance and expands the JSON output contract with
+ * `tests_promised`, `tests_landed`, `tests_missing` fields.
+ *
+ * Pin the format contract here so a future edit to the resolver can't
+ * silently drop the test-promise audit instruction.
+ */
+import { describe, test, expect } from 'bun:test';
+import type { TemplateContext } from '../scripts/resolvers/types';
+import { HOST_PATHS } from '../scripts/resolvers/types';
+import {
+  generatePlanCompletionAuditShip,
+  generatePlanCompletionAuditReview,
+} from '../scripts/resolvers/review';
+
+function makeCtx(skillName: string): TemplateContext {
+  return {
+    skillName,
+    tmplPath: 'test.tmpl',
+    host: 'claude',
+    paths: HOST_PATHS.claude,
+    preambleTier: 4,
+  };
+}
+
+describe('Plan Completion Audit — Test-Promise section (#1070)', () => {
+  test('ship variant includes the Test-Promise Audit instruction', () => {
+    const out = generatePlanCompletionAuditShip(makeCtx('ship'));
+    expect(out).toContain('Test-Promise Audit');
+    // Per-language test patterns must be enumerated so the subagent can
+    // grep deterministically across stacks instead of guessing.
+    expect(out).toContain('*.test.ts');
+    expect(out).toContain('test_*.py');
+    expect(out).toContain('*_test.go');
+    expect(out).toContain('*_spec.rb');
+    // Counters that the parent template parses out of the JSON.
+    expect(out).toContain('tests_promised');
+    expect(out).toContain('tests_landed');
+    expect(out).toContain('tests_missing');
+  });
+
+  test('review variant also includes Test-Promise Audit', () => {
+    // Review-mode runs without /ship's gating but the subagent should still
+    // surface the gap — review's whole job is to call out delivery integrity.
+    const out = generatePlanCompletionAuditReview(makeCtx('review'));
+    expect(out).toContain('Test-Promise Audit');
+    expect(out).toContain('tests_promised');
+  });
+
+  test('audit body still has the Item Extraction step that produces TEST items', () => {
+    // The Test-Promise Audit only works because Item Extraction tags items
+    // by category. Pin the contract: TEST must remain one of the categories.
+    const out = generatePlanCompletionAuditShip(makeCtx('ship'));
+    expect(out).toContain('CODE | TEST | MIGRATION | CONFIG | DOCS');
+  });
+
+  test('Cross-Reference step still classifies DONE/PARTIAL/NOT DONE/CHANGED', () => {
+    // The Test-Promise Audit's `landed`/`missing` classification piggybacks
+    // on the existing diff cross-reference. If the existing classes
+    // disappear, the test-promise output goes with them.
+    const out = generatePlanCompletionAuditShip(makeCtx('ship'));
+    expect(out).toContain('DONE');
+    expect(out).toContain('NOT DONE');
+    expect(out).toContain('PARTIAL');
+  });
+});