diff --git a/CHANGELOG.md b/CHANGELOG.md index d42dd03552..4fa006a671 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,38 @@ # Changelog +## [1.15.4.0] - 2026-04-26 + +## **`/ship` now calls out tests that the plan promised but the diff doesn't have.** + +`/plan-eng-review` produces a Test review section that gets translated into TEST-category plan items. `/ship`'s Plan Completion Audit was rolling those into the generic NOT DONE count, so a disciplined review→implement→ship loop could silently drop the test step and the gap only surfaced if you went looking for it. Step 8 now scans the diff for per-language test-file evidence and surfaces a one-line gap in the PR body when promised tests are missing. + +### What this means for you + +If you run `/plan-eng-review` and it lists three tests, then `/ship` will tell you "promised 3 tests, 2 landed. Missing: E2E test for signup flow". Informational only — no new gate, the existing `deferred` flow still handles the user-facing decision. The point is to make the test gap explicit instead of hiding it inside an aggregate count. + +### The numbers that matter + +From the issue author's `/retro global 7d` (8 active repos, 2026-04-11 → 2026-04-18): + +| Metric | Count | Share | +|---|---|---| +| Total commits | 172 | — | +| `feat:` commits | 67 | 34% | +| `fix:` commits | 56 | 28% | +| `test:` commits | 6 | **3%** | + +Every plan listed specific tests. Most just didn't land. The Test-Promise Audit doesn't fix that — implementers still drop the test — but it makes the gap visible at the moment ship would otherwise let it through. + +### Itemized changes + +#### Added +- Test-Promise Audit step in `scripts/resolvers/review.ts` (`generatePlanCompletionAuditInner`) instructs the audit subagent to cross-reference each TEST-category plan item against the diff, with per-language patterns enumerated for JS/TS, Python, Go, Ruby, Rust, Java/Kotlin, and shell. (#1070) +- Three new fields on the Step 8 JSON contract: `tests_promised`, `tests_landed`, `tests_missing`. Parent processing in `ship/SKILL.md.tmpl` appends an informational line to the Plan Completion summary when `tests_missing.length > 0`. No gate, no JSONL schema change. + +#### For contributors +- `test/resolver-test-promise-audit.test.ts` pins the resolver output contract (per-language test patterns, JSON field names, the Item Extraction TEST category). Free-tier; runs in <200ms. +- Resolver change applies to both `ship` and `review` modes — review's whole job is delivery integrity, so it surfaces the same gap. + ## [1.14.0.0] - 2026-04-25 ## **The gstack browser sidebar is now an interactive Claude Code REPL with live tab awareness.** diff --git a/VERSION b/VERSION index 323a9cbcfa..ac19af6ff9 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -1.14.0.0 +1.15.4.0 diff --git a/review/SKILL.md b/review/SKILL.md index a2092af9e1..3df223f940 100644 --- a/review/SKILL.md +++ b/review/SKILL.md @@ -1200,6 +1200,35 @@ For each extracted plan item, check the diff and classify: **Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present. **Be generous with CHANGED** — if the goal is met by different means, that counts as addressed. +### Test-Promise Audit + +For each plan item categorized as TEST, scan the diff for evidence that the test actually landed: + +1. **New test files** — `git diff origin/...HEAD --name-only --diff-filter=A` and match against language-specific test patterns: + - JS/TS: `*.test.ts`, `*.test.tsx`, `*.test.js`, `*.spec.ts`, `__tests__/*` + - Python: `test_*.py`, `tests/*.py`, `*_test.py` + - Go: `*_test.go` + - Ruby: `*_spec.rb`, `spec/*`, `test/*_test.rb` + - Rust: `#[test]`/`#[cfg(test)]` blocks added in any `.rs` file + - Java/Kotlin: `*Test.java`, `*Test.kt`, `*Spec.kt` under `src/test/` + - Shell: `test/*.bats`, `*.test.sh` +2. **Existing test files modified** — same patterns above, but with `--diff-filter=M` and a non-trivial diff (not just whitespace or imports). +3. **New test functions/cases** — diff hunks adding `test(...)`, `it(...)`, `describe(...)`, `def test_*`, `func Test*`, `@Test`, `#[test]`. + +Match heuristically — a TEST item that says "Add test for UserService" pairs with a new `user_service.test.ts` or new `it('UserService...')` block. Don't require literal name matches; the test file just has to plausibly cover the promised scope. + +Classify each TEST item: + +- **landed** — clear matching evidence in the diff (cite the file) +- **missing** — no test file or test function in the diff matches this item + +Track these counts for the final JSON output: +- `tests_promised` = number of TEST-category plan items +- `tests_landed` = number of TEST items with evidence in the diff +- `tests_missing` = list of TEST item texts (verbatim or concise) where evidence was absent + +**Why this matters:** Reviewer asked for the test in /plan-eng-review. Plan listed it. Implementer didn't write it. Step 8 used to count it as a NOT DONE generic item — now it gets called out specifically so users can decide whether the gap is acceptable before the PR lands. + ### Output Format ``` diff --git a/scripts/resolvers/review.ts b/scripts/resolvers/review.ts index a0f29e1746..e9981379f4 100644 --- a/scripts/resolvers/review.ts +++ b/scripts/resolvers/review.ts @@ -770,6 +770,43 @@ For each extracted plan item, check the diff and classify: **Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present. **Be generous with CHANGED** — if the goal is met by different means, that counts as addressed.`); + // ── Test-promise audit (#1070) ── + // /plan-eng-review extracts a Test review section that produces TEST-category + // plan items. They were getting silently dropped at implement time without + // anyone noticing — an upstream retro found ~3% test commits over a 7-day + // window despite every plan listing specific tests. Cross-reference each + // TEST item against the diff explicitly so the audit surface includes + // "promised N tests, M landed". + sections.push(` +### Test-Promise Audit + +For each plan item categorized as TEST, scan the diff for evidence that the test actually landed: + +1. **New test files** — \`git diff origin/...HEAD --name-only --diff-filter=A\` and match against language-specific test patterns: + - JS/TS: \`*.test.ts\`, \`*.test.tsx\`, \`*.test.js\`, \`*.spec.ts\`, \`__tests__/*\` + - Python: \`test_*.py\`, \`tests/*.py\`, \`*_test.py\` + - Go: \`*_test.go\` + - Ruby: \`*_spec.rb\`, \`spec/*\`, \`test/*_test.rb\` + - Rust: \`#[test]\`/\`#[cfg(test)]\` blocks added in any \`.rs\` file + - Java/Kotlin: \`*Test.java\`, \`*Test.kt\`, \`*Spec.kt\` under \`src/test/\` + - Shell: \`test/*.bats\`, \`*.test.sh\` +2. **Existing test files modified** — same patterns above, but with \`--diff-filter=M\` and a non-trivial diff (not just whitespace or imports). +3. **New test functions/cases** — diff hunks adding \`test(...)\`, \`it(...)\`, \`describe(...)\`, \`def test_*\`, \`func Test*\`, \`@Test\`, \`#[test]\`. + +Match heuristically — a TEST item that says "Add test for UserService" pairs with a new \`user_service.test.ts\` or new \`it('UserService...')\` block. Don't require literal name matches; the test file just has to plausibly cover the promised scope. + +Classify each TEST item: + +- **landed** — clear matching evidence in the diff (cite the file) +- **missing** — no test file or test function in the diff matches this item + +Track these counts for the final JSON output: +- \`tests_promised\` = number of TEST-category plan items +- \`tests_landed\` = number of TEST items with evidence in the diff +- \`tests_missing\` = list of TEST item texts (verbatim or concise) where evidence was absent + +**Why this matters:** Reviewer asked for the test in /plan-eng-review. Plan listed it. Implementer didn't write it. Step 8 used to count it as a NOT DONE generic item — now it gets called out specifically so users can decide whether the gap is acceptable before the PR lands.`); + // ── Output format ── sections.push(` ### Output Format diff --git a/ship/SKILL.md b/ship/SKILL.md index 7548415246..a7292d6baf 100644 --- a/ship/SKILL.md +++ b/ship/SKILL.md @@ -1924,6 +1924,35 @@ For each extracted plan item, check the diff and classify: **Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present. **Be generous with CHANGED** — if the goal is met by different means, that counts as addressed. +### Test-Promise Audit + +For each plan item categorized as TEST, scan the diff for evidence that the test actually landed: + +1. **New test files** — `git diff origin/...HEAD --name-only --diff-filter=A` and match against language-specific test patterns: + - JS/TS: `*.test.ts`, `*.test.tsx`, `*.test.js`, `*.spec.ts`, `__tests__/*` + - Python: `test_*.py`, `tests/*.py`, `*_test.py` + - Go: `*_test.go` + - Ruby: `*_spec.rb`, `spec/*`, `test/*_test.rb` + - Rust: `#[test]`/`#[cfg(test)]` blocks added in any `.rs` file + - Java/Kotlin: `*Test.java`, `*Test.kt`, `*Spec.kt` under `src/test/` + - Shell: `test/*.bats`, `*.test.sh` +2. **Existing test files modified** — same patterns above, but with `--diff-filter=M` and a non-trivial diff (not just whitespace or imports). +3. **New test functions/cases** — diff hunks adding `test(...)`, `it(...)`, `describe(...)`, `def test_*`, `func Test*`, `@Test`, `#[test]`. + +Match heuristically — a TEST item that says "Add test for UserService" pairs with a new `user_service.test.ts` or new `it('UserService...')` block. Don't require literal name matches; the test file just has to plausibly cover the promised scope. + +Classify each TEST item: + +- **landed** — clear matching evidence in the diff (cite the file) +- **missing** — no test file or test function in the diff matches this item + +Track these counts for the final JSON output: +- `tests_promised` = number of TEST-category plan items +- `tests_landed` = number of TEST items with evidence in the diff +- `tests_missing` = list of TEST item texts (verbatim or concise) where evidence was absent + +**Why this matters:** Reviewer asked for the test in /plan-eng-review. Plan listed it. Implementer didn't write it. Step 8 used to count it as a NOT DONE generic item — now it gets called out specifically so users can decide whether the gap is acceptable before the PR lands. + ### Output Format ``` @@ -1972,14 +2001,17 @@ After producing the completion checklist: **Include in PR body (Step 8):** Add a `## Plan Completion` section with the checklist summary. > > After your analysis, output a single JSON object on the LAST LINE of your response (no other text after it): -> `{"total_items":N,"done":N,"changed":N,"deferred":N,"summary":""}` +> `{"total_items":N,"done":N,"changed":N,"deferred":N,"tests_promised":N,"tests_landed":N,"tests_missing":["item text",...],"summary":""}` +> +> The `tests_*` fields come from the Test-Promise Audit section. If the plan has no TEST-category items, set `tests_promised:0`, `tests_landed:0`, `tests_missing:[]`. **Parent processing:** 1. Parse the LAST line of the subagent's output as JSON. 2. Store `done`, `deferred` for Step 20 metrics; use `summary` in PR body. 3. If `deferred > 0` and no user override, present the deferred items via AskUserQuestion before continuing. -4. Embed `summary` in PR body's `## Plan Completion` section (Step 19). +4. **Test-promise gap (#1070):** if `tests_missing.length > 0`, append one informational line to the `summary` before embedding it: `> ⚠ Plan promised {tests_promised} tests, {tests_landed} landed. Missing: {tests_missing joined with "; "}`. Informational only — does NOT gate. The user already saw NOT-DONE items via the existing `deferred` flow; this just makes the test gap explicit instead of hiding it inside a generic "NOT DONE" count. +5. Embed `summary` in PR body's `## Plan Completion` section (Step 19). **If the subagent fails or returns invalid JSON:** Fall back to running the audit inline. Never block /ship on subagent failure. diff --git a/ship/SKILL.md.tmpl b/ship/SKILL.md.tmpl index b6a19bcbab..ec878a686b 100644 --- a/ship/SKILL.md.tmpl +++ b/ship/SKILL.md.tmpl @@ -268,14 +268,17 @@ If multiple suites need to run, run them sequentially (each needs a test lane). > {{PLAN_COMPLETION_AUDIT_SHIP}} > > After your analysis, output a single JSON object on the LAST LINE of your response (no other text after it): -> `{"total_items":N,"done":N,"changed":N,"deferred":N,"summary":""}` +> `{"total_items":N,"done":N,"changed":N,"deferred":N,"tests_promised":N,"tests_landed":N,"tests_missing":["item text",...],"summary":""}` +> +> The `tests_*` fields come from the Test-Promise Audit section. If the plan has no TEST-category items, set `tests_promised:0`, `tests_landed:0`, `tests_missing:[]`. **Parent processing:** 1. Parse the LAST line of the subagent's output as JSON. 2. Store `done`, `deferred` for Step 20 metrics; use `summary` in PR body. 3. If `deferred > 0` and no user override, present the deferred items via AskUserQuestion before continuing. -4. Embed `summary` in PR body's `## Plan Completion` section (Step 19). +4. **Test-promise gap (#1070):** if `tests_missing.length > 0`, append one informational line to the `summary` before embedding it: `> ⚠ Plan promised {tests_promised} tests, {tests_landed} landed. Missing: {tests_missing joined with "; "}`. Informational only — does NOT gate. The user already saw NOT-DONE items via the existing `deferred` flow; this just makes the test gap explicit instead of hiding it inside a generic "NOT DONE" count. +5. Embed `summary` in PR body's `## Plan Completion` section (Step 19). **If the subagent fails or returns invalid JSON:** Fall back to running the audit inline. Never block /ship on subagent failure. diff --git a/test/resolver-test-promise-audit.test.ts b/test/resolver-test-promise-audit.test.ts new file mode 100644 index 0000000000..c2b05f7c17 --- /dev/null +++ b/test/resolver-test-promise-audit.test.ts @@ -0,0 +1,70 @@ +/** + * Test-Promise Audit resolver assertions (#1070). + * + * The Plan Completion Audit subagent has to surface TEST-category items + * that were promised but didn't land. The resolver injects per-language + * test-file pattern guidance and expands the JSON output contract with + * `tests_promised`, `tests_landed`, `tests_missing` fields. + * + * Pin the format contract here so a future edit to the resolver can't + * silently drop the test-promise audit instruction. + */ +import { describe, test, expect } from 'bun:test'; +import type { TemplateContext } from '../scripts/resolvers/types'; +import { HOST_PATHS } from '../scripts/resolvers/types'; +import { + generatePlanCompletionAuditShip, + generatePlanCompletionAuditReview, +} from '../scripts/resolvers/review'; + +function makeCtx(skillName: string): TemplateContext { + return { + skillName, + tmplPath: 'test.tmpl', + host: 'claude', + paths: HOST_PATHS.claude, + preambleTier: 4, + }; +} + +describe('Plan Completion Audit — Test-Promise section (#1070)', () => { + test('ship variant includes the Test-Promise Audit instruction', () => { + const out = generatePlanCompletionAuditShip(makeCtx('ship')); + expect(out).toContain('Test-Promise Audit'); + // Per-language test patterns must be enumerated so the subagent can + // grep deterministically across stacks instead of guessing. + expect(out).toContain('*.test.ts'); + expect(out).toContain('test_*.py'); + expect(out).toContain('*_test.go'); + expect(out).toContain('*_spec.rb'); + // Counters that the parent template parses out of the JSON. + expect(out).toContain('tests_promised'); + expect(out).toContain('tests_landed'); + expect(out).toContain('tests_missing'); + }); + + test('review variant also includes Test-Promise Audit', () => { + // Review-mode runs without /ship's gating but the subagent should still + // surface the gap — review's whole job is to call out delivery integrity. + const out = generatePlanCompletionAuditReview(makeCtx('review')); + expect(out).toContain('Test-Promise Audit'); + expect(out).toContain('tests_promised'); + }); + + test('audit body still has the Item Extraction step that produces TEST items', () => { + // The Test-Promise Audit only works because Item Extraction tags items + // by category. Pin the contract: TEST must remain one of the categories. + const out = generatePlanCompletionAuditShip(makeCtx('ship')); + expect(out).toContain('CODE | TEST | MIGRATION | CONFIG | DOCS'); + }); + + test('Cross-Reference step still classifies DONE/PARTIAL/NOT DONE/CHANGED', () => { + // The Test-Promise Audit's `landed`/`missing` classification piggybacks + // on the existing diff cross-reference. If the existing classes + // disappear, the test-promise output goes with them. + const out = generatePlanCompletionAuditShip(makeCtx('ship')); + expect(out).toContain('DONE'); + expect(out).toContain('NOT DONE'); + expect(out).toContain('PARTIAL'); + }); +});