diff --git a/CHANGELOG.md b/CHANGELOG.md
index d42dd03552..4fa006a671 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,38 @@
# Changelog
+## [1.15.4.0] - 2026-04-26
+
+## **`/ship` now calls out tests that the plan promised but the diff doesn't have.**
+
+`/plan-eng-review` produces a Test review section that gets translated into TEST-category plan items. `/ship`'s Plan Completion Audit was rolling those into the generic NOT DONE count, so a disciplined review→implement→ship loop could silently drop the test step and the gap only surfaced if you went looking for it. Step 8 now scans the diff for per-language test-file evidence and surfaces a one-line gap in the PR body when promised tests are missing.
+
+### What this means for you
+
+If you run `/plan-eng-review` and it lists three tests, then `/ship` will tell you "promised 3 tests, 2 landed. Missing: E2E test for signup flow". Informational only — no new gate, the existing `deferred` flow still handles the user-facing decision. The point is to make the test gap explicit instead of hiding it inside an aggregate count.
+
+### The numbers that matter
+
+From the issue author's `/retro global 7d` (8 active repos, 2026-04-11 → 2026-04-18):
+
+| Metric | Count | Share |
+|---|---|---|
+| Total commits | 172 | — |
+| `feat:` commits | 67 | 34% |
+| `fix:` commits | 56 | 28% |
+| `test:` commits | 6 | **3%** |
+
+Every plan listed specific tests. Most just didn't land. The Test-Promise Audit doesn't fix that — implementers still drop the test — but it makes the gap visible at the moment ship would otherwise let it through.
+
+### Itemized changes
+
+#### Added
+- Test-Promise Audit step in `scripts/resolvers/review.ts` (`generatePlanCompletionAuditInner`) instructs the audit subagent to cross-reference each TEST-category plan item against the diff, with per-language patterns enumerated for JS/TS, Python, Go, Ruby, Rust, Java/Kotlin, and shell. (#1070)
+- Three new fields on the Step 8 JSON contract: `tests_promised`, `tests_landed`, `tests_missing`. Parent processing in `ship/SKILL.md.tmpl` appends an informational line to the Plan Completion summary when `tests_missing.length > 0`. No gate, no JSONL schema change.
+
+#### For contributors
+- `test/resolver-test-promise-audit.test.ts` pins the resolver output contract (per-language test patterns, JSON field names, the Item Extraction TEST category). Free-tier; runs in <200ms.
+- Resolver change applies to both `ship` and `review` modes — review's whole job is delivery integrity, so it surfaces the same gap.
+
## [1.14.0.0] - 2026-04-25
## **The gstack browser sidebar is now an interactive Claude Code REPL with live tab awareness.**
diff --git a/VERSION b/VERSION
index 323a9cbcfa..ac19af6ff9 100644
--- a/VERSION
+++ b/VERSION
@@ -1 +1 @@
-1.14.0.0
+1.15.4.0
diff --git a/review/SKILL.md b/review/SKILL.md
index a2092af9e1..3df223f940 100644
--- a/review/SKILL.md
+++ b/review/SKILL.md
@@ -1200,6 +1200,35 @@ For each extracted plan item, check the diff and classify:
**Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present.
**Be generous with CHANGED** — if the goal is met by different means, that counts as addressed.
+### Test-Promise Audit
+
+For each plan item categorized as TEST, scan the diff for evidence that the test actually landed:
+
+1. **New test files** — `git diff origin/...HEAD --name-only --diff-filter=A` and match against language-specific test patterns:
+ - JS/TS: `*.test.ts`, `*.test.tsx`, `*.test.js`, `*.spec.ts`, `__tests__/*`
+ - Python: `test_*.py`, `tests/*.py`, `*_test.py`
+ - Go: `*_test.go`
+ - Ruby: `*_spec.rb`, `spec/*`, `test/*_test.rb`
+ - Rust: `#[test]`/`#[cfg(test)]` blocks added in any `.rs` file
+ - Java/Kotlin: `*Test.java`, `*Test.kt`, `*Spec.kt` under `src/test/`
+ - Shell: `test/*.bats`, `*.test.sh`
+2. **Existing test files modified** — same patterns above, but with `--diff-filter=M` and a non-trivial diff (not just whitespace or imports).
+3. **New test functions/cases** — diff hunks adding `test(...)`, `it(...)`, `describe(...)`, `def test_*`, `func Test*`, `@Test`, `#[test]`.
+
+Match heuristically — a TEST item that says "Add test for UserService" pairs with a new `user_service.test.ts` or new `it('UserService...')` block. Don't require literal name matches; the test file just has to plausibly cover the promised scope.
+
+Classify each TEST item:
+
+- **landed** — clear matching evidence in the diff (cite the file)
+- **missing** — no test file or test function in the diff matches this item
+
+Track these counts for the final JSON output:
+- `tests_promised` = number of TEST-category plan items
+- `tests_landed` = number of TEST items with evidence in the diff
+- `tests_missing` = list of TEST item texts (verbatim or concise) where evidence was absent
+
+**Why this matters:** Reviewer asked for the test in /plan-eng-review. Plan listed it. Implementer didn't write it. Step 8 used to count it as a NOT DONE generic item — now it gets called out specifically so users can decide whether the gap is acceptable before the PR lands.
+
### Output Format
```
diff --git a/scripts/resolvers/review.ts b/scripts/resolvers/review.ts
index a0f29e1746..e9981379f4 100644
--- a/scripts/resolvers/review.ts
+++ b/scripts/resolvers/review.ts
@@ -770,6 +770,43 @@ For each extracted plan item, check the diff and classify:
**Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present.
**Be generous with CHANGED** — if the goal is met by different means, that counts as addressed.`);
+ // ── Test-promise audit (#1070) ──
+ // /plan-eng-review extracts a Test review section that produces TEST-category
+ // plan items. They were getting silently dropped at implement time without
+ // anyone noticing — an upstream retro found ~3% test commits over a 7-day
+ // window despite every plan listing specific tests. Cross-reference each
+ // TEST item against the diff explicitly so the audit surface includes
+ // "promised N tests, M landed".
+ sections.push(`
+### Test-Promise Audit
+
+For each plan item categorized as TEST, scan the diff for evidence that the test actually landed:
+
+1. **New test files** — \`git diff origin/...HEAD --name-only --diff-filter=A\` and match against language-specific test patterns:
+ - JS/TS: \`*.test.ts\`, \`*.test.tsx\`, \`*.test.js\`, \`*.spec.ts\`, \`__tests__/*\`
+ - Python: \`test_*.py\`, \`tests/*.py\`, \`*_test.py\`
+ - Go: \`*_test.go\`
+ - Ruby: \`*_spec.rb\`, \`spec/*\`, \`test/*_test.rb\`
+ - Rust: \`#[test]\`/\`#[cfg(test)]\` blocks added in any \`.rs\` file
+ - Java/Kotlin: \`*Test.java\`, \`*Test.kt\`, \`*Spec.kt\` under \`src/test/\`
+ - Shell: \`test/*.bats\`, \`*.test.sh\`
+2. **Existing test files modified** — same patterns above, but with \`--diff-filter=M\` and a non-trivial diff (not just whitespace or imports).
+3. **New test functions/cases** — diff hunks adding \`test(...)\`, \`it(...)\`, \`describe(...)\`, \`def test_*\`, \`func Test*\`, \`@Test\`, \`#[test]\`.
+
+Match heuristically — a TEST item that says "Add test for UserService" pairs with a new \`user_service.test.ts\` or new \`it('UserService...')\` block. Don't require literal name matches; the test file just has to plausibly cover the promised scope.
+
+Classify each TEST item:
+
+- **landed** — clear matching evidence in the diff (cite the file)
+- **missing** — no test file or test function in the diff matches this item
+
+Track these counts for the final JSON output:
+- \`tests_promised\` = number of TEST-category plan items
+- \`tests_landed\` = number of TEST items with evidence in the diff
+- \`tests_missing\` = list of TEST item texts (verbatim or concise) where evidence was absent
+
+**Why this matters:** Reviewer asked for the test in /plan-eng-review. Plan listed it. Implementer didn't write it. Step 8 used to count it as a NOT DONE generic item — now it gets called out specifically so users can decide whether the gap is acceptable before the PR lands.`);
+
// ── Output format ──
sections.push(`
### Output Format
diff --git a/ship/SKILL.md b/ship/SKILL.md
index 7548415246..a7292d6baf 100644
--- a/ship/SKILL.md
+++ b/ship/SKILL.md
@@ -1924,6 +1924,35 @@ For each extracted plan item, check the diff and classify:
**Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present.
**Be generous with CHANGED** — if the goal is met by different means, that counts as addressed.
+### Test-Promise Audit
+
+For each plan item categorized as TEST, scan the diff for evidence that the test actually landed:
+
+1. **New test files** — `git diff origin/...HEAD --name-only --diff-filter=A` and match against language-specific test patterns:
+ - JS/TS: `*.test.ts`, `*.test.tsx`, `*.test.js`, `*.spec.ts`, `__tests__/*`
+ - Python: `test_*.py`, `tests/*.py`, `*_test.py`
+ - Go: `*_test.go`
+ - Ruby: `*_spec.rb`, `spec/*`, `test/*_test.rb`
+ - Rust: `#[test]`/`#[cfg(test)]` blocks added in any `.rs` file
+ - Java/Kotlin: `*Test.java`, `*Test.kt`, `*Spec.kt` under `src/test/`
+ - Shell: `test/*.bats`, `*.test.sh`
+2. **Existing test files modified** — same patterns above, but with `--diff-filter=M` and a non-trivial diff (not just whitespace or imports).
+3. **New test functions/cases** — diff hunks adding `test(...)`, `it(...)`, `describe(...)`, `def test_*`, `func Test*`, `@Test`, `#[test]`.
+
+Match heuristically — a TEST item that says "Add test for UserService" pairs with a new `user_service.test.ts` or new `it('UserService...')` block. Don't require literal name matches; the test file just has to plausibly cover the promised scope.
+
+Classify each TEST item:
+
+- **landed** — clear matching evidence in the diff (cite the file)
+- **missing** — no test file or test function in the diff matches this item
+
+Track these counts for the final JSON output:
+- `tests_promised` = number of TEST-category plan items
+- `tests_landed` = number of TEST items with evidence in the diff
+- `tests_missing` = list of TEST item texts (verbatim or concise) where evidence was absent
+
+**Why this matters:** Reviewer asked for the test in /plan-eng-review. Plan listed it. Implementer didn't write it. Step 8 used to count it as a NOT DONE generic item — now it gets called out specifically so users can decide whether the gap is acceptable before the PR lands.
+
### Output Format
```
@@ -1972,14 +2001,17 @@ After producing the completion checklist:
**Include in PR body (Step 8):** Add a `## Plan Completion` section with the checklist summary.
>
> After your analysis, output a single JSON object on the LAST LINE of your response (no other text after it):
-> `{"total_items":N,"done":N,"changed":N,"deferred":N,"summary":""}`
+> `{"total_items":N,"done":N,"changed":N,"deferred":N,"tests_promised":N,"tests_landed":N,"tests_missing":["item text",...],"summary":""}`
+>
+> The `tests_*` fields come from the Test-Promise Audit section. If the plan has no TEST-category items, set `tests_promised:0`, `tests_landed:0`, `tests_missing:[]`.
**Parent processing:**
1. Parse the LAST line of the subagent's output as JSON.
2. Store `done`, `deferred` for Step 20 metrics; use `summary` in PR body.
3. If `deferred > 0` and no user override, present the deferred items via AskUserQuestion before continuing.
-4. Embed `summary` in PR body's `## Plan Completion` section (Step 19).
+4. **Test-promise gap (#1070):** if `tests_missing.length > 0`, append one informational line to the `summary` before embedding it: `> ⚠ Plan promised {tests_promised} tests, {tests_landed} landed. Missing: {tests_missing joined with "; "}`. Informational only — does NOT gate. The user already saw NOT-DONE items via the existing `deferred` flow; this just makes the test gap explicit instead of hiding it inside a generic "NOT DONE" count.
+5. Embed `summary` in PR body's `## Plan Completion` section (Step 19).
**If the subagent fails or returns invalid JSON:** Fall back to running the audit inline. Never block /ship on subagent failure.
diff --git a/ship/SKILL.md.tmpl b/ship/SKILL.md.tmpl
index b6a19bcbab..ec878a686b 100644
--- a/ship/SKILL.md.tmpl
+++ b/ship/SKILL.md.tmpl
@@ -268,14 +268,17 @@ If multiple suites need to run, run them sequentially (each needs a test lane).
> {{PLAN_COMPLETION_AUDIT_SHIP}}
>
> After your analysis, output a single JSON object on the LAST LINE of your response (no other text after it):
-> `{"total_items":N,"done":N,"changed":N,"deferred":N,"summary":""}`
+> `{"total_items":N,"done":N,"changed":N,"deferred":N,"tests_promised":N,"tests_landed":N,"tests_missing":["item text",...],"summary":""}`
+>
+> The `tests_*` fields come from the Test-Promise Audit section. If the plan has no TEST-category items, set `tests_promised:0`, `tests_landed:0`, `tests_missing:[]`.
**Parent processing:**
1. Parse the LAST line of the subagent's output as JSON.
2. Store `done`, `deferred` for Step 20 metrics; use `summary` in PR body.
3. If `deferred > 0` and no user override, present the deferred items via AskUserQuestion before continuing.
-4. Embed `summary` in PR body's `## Plan Completion` section (Step 19).
+4. **Test-promise gap (#1070):** if `tests_missing.length > 0`, append one informational line to the `summary` before embedding it: `> ⚠ Plan promised {tests_promised} tests, {tests_landed} landed. Missing: {tests_missing joined with "; "}`. Informational only — does NOT gate. The user already saw NOT-DONE items via the existing `deferred` flow; this just makes the test gap explicit instead of hiding it inside a generic "NOT DONE" count.
+5. Embed `summary` in PR body's `## Plan Completion` section (Step 19).
**If the subagent fails or returns invalid JSON:** Fall back to running the audit inline. Never block /ship on subagent failure.
diff --git a/test/resolver-test-promise-audit.test.ts b/test/resolver-test-promise-audit.test.ts
new file mode 100644
index 0000000000..c2b05f7c17
--- /dev/null
+++ b/test/resolver-test-promise-audit.test.ts
@@ -0,0 +1,70 @@
+/**
+ * Test-Promise Audit resolver assertions (#1070).
+ *
+ * The Plan Completion Audit subagent has to surface TEST-category items
+ * that were promised but didn't land. The resolver injects per-language
+ * test-file pattern guidance and expands the JSON output contract with
+ * `tests_promised`, `tests_landed`, `tests_missing` fields.
+ *
+ * Pin the format contract here so a future edit to the resolver can't
+ * silently drop the test-promise audit instruction.
+ */
+import { describe, test, expect } from 'bun:test';
+import type { TemplateContext } from '../scripts/resolvers/types';
+import { HOST_PATHS } from '../scripts/resolvers/types';
+import {
+ generatePlanCompletionAuditShip,
+ generatePlanCompletionAuditReview,
+} from '../scripts/resolvers/review';
+
+function makeCtx(skillName: string): TemplateContext {
+ return {
+ skillName,
+ tmplPath: 'test.tmpl',
+ host: 'claude',
+ paths: HOST_PATHS.claude,
+ preambleTier: 4,
+ };
+}
+
+describe('Plan Completion Audit — Test-Promise section (#1070)', () => {
+ test('ship variant includes the Test-Promise Audit instruction', () => {
+ const out = generatePlanCompletionAuditShip(makeCtx('ship'));
+ expect(out).toContain('Test-Promise Audit');
+ // Per-language test patterns must be enumerated so the subagent can
+ // grep deterministically across stacks instead of guessing.
+ expect(out).toContain('*.test.ts');
+ expect(out).toContain('test_*.py');
+ expect(out).toContain('*_test.go');
+ expect(out).toContain('*_spec.rb');
+ // Counters that the parent template parses out of the JSON.
+ expect(out).toContain('tests_promised');
+ expect(out).toContain('tests_landed');
+ expect(out).toContain('tests_missing');
+ });
+
+ test('review variant also includes Test-Promise Audit', () => {
+ // Review-mode runs without /ship's gating but the subagent should still
+ // surface the gap — review's whole job is to call out delivery integrity.
+ const out = generatePlanCompletionAuditReview(makeCtx('review'));
+ expect(out).toContain('Test-Promise Audit');
+ expect(out).toContain('tests_promised');
+ });
+
+ test('audit body still has the Item Extraction step that produces TEST items', () => {
+ // The Test-Promise Audit only works because Item Extraction tags items
+ // by category. Pin the contract: TEST must remain one of the categories.
+ const out = generatePlanCompletionAuditShip(makeCtx('ship'));
+ expect(out).toContain('CODE | TEST | MIGRATION | CONFIG | DOCS');
+ });
+
+ test('Cross-Reference step still classifies DONE/PARTIAL/NOT DONE/CHANGED', () => {
+ // The Test-Promise Audit's `landed`/`missing` classification piggybacks
+ // on the existing diff cross-reference. If the existing classes
+ // disappear, the test-promise output goes with them.
+ const out = generatePlanCompletionAuditShip(makeCtx('ship'));
+ expect(out).toContain('DONE');
+ expect(out).toContain('NOT DONE');
+ expect(out).toContain('PARTIAL');
+ });
+});