Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,38 @@
# Changelog

## [1.15.4.0] - 2026-04-26

## **`/ship` now calls out tests that the plan promised but the diff doesn't have.**

`/plan-eng-review` produces a Test review section that gets translated into TEST-category plan items. `/ship`'s Plan Completion Audit was rolling those into the generic NOT DONE count, so a disciplined review→implement→ship loop could silently drop the test step and the gap only surfaced if you went looking for it. Step 8 now scans the diff for per-language test-file evidence and surfaces a one-line gap in the PR body when promised tests are missing.

### What this means for you

If you run `/plan-eng-review` and it lists three tests, then `/ship` will tell you "promised 3 tests, 2 landed. Missing: E2E test for signup flow". Informational only — no new gate, the existing `deferred` flow still handles the user-facing decision. The point is to make the test gap explicit instead of hiding it inside an aggregate count.

### The numbers that matter

From the issue author's `/retro global 7d` (8 active repos, 2026-04-11 → 2026-04-18):

| Metric | Count | Share |
|---|---|---|
| Total commits | 172 | — |
| `feat:` commits | 67 | 34% |
| `fix:` commits | 56 | 28% |
| `test:` commits | 6 | **3%** |

Every plan listed specific tests. Most just didn't land. The Test-Promise Audit doesn't fix that — implementers still drop the test — but it makes the gap visible at the moment ship would otherwise let it through.

### Itemized changes

#### Added
- Test-Promise Audit step in `scripts/resolvers/review.ts` (`generatePlanCompletionAuditInner`) instructs the audit subagent to cross-reference each TEST-category plan item against the diff, with per-language patterns enumerated for JS/TS, Python, Go, Ruby, Rust, Java/Kotlin, and shell. (#1070)
- Three new fields on the Step 8 JSON contract: `tests_promised`, `tests_landed`, `tests_missing`. Parent processing in `ship/SKILL.md.tmpl` appends an informational line to the Plan Completion summary when `tests_missing.length > 0`. No gate, no JSONL schema change.

#### For contributors
- `test/resolver-test-promise-audit.test.ts` pins the resolver output contract (per-language test patterns, JSON field names, the Item Extraction TEST category). Free-tier; runs in <200ms.
- Resolver change applies to both `ship` and `review` modes — review's whole job is delivery integrity, so it surfaces the same gap.

## [1.14.0.0] - 2026-04-25

## **The gstack browser sidebar is now an interactive Claude Code REPL with live tab awareness.**
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.14.0.0
1.15.4.0
29 changes: 29 additions & 0 deletions review/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -1200,6 +1200,35 @@ For each extracted plan item, check the diff and classify:
**Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present.
**Be generous with CHANGED** — if the goal is met by different means, that counts as addressed.

### Test-Promise Audit

For each plan item categorized as TEST, scan the diff for evidence that the test actually landed:

1. **New test files** — `git diff origin/<base>...HEAD --name-only --diff-filter=A` and match against language-specific test patterns:
- JS/TS: `*.test.ts`, `*.test.tsx`, `*.test.js`, `*.spec.ts`, `__tests__/*`
- Python: `test_*.py`, `tests/*.py`, `*_test.py`
- Go: `*_test.go`
- Ruby: `*_spec.rb`, `spec/*`, `test/*_test.rb`
- Rust: `#[test]`/`#[cfg(test)]` blocks added in any `.rs` file
- Java/Kotlin: `*Test.java`, `*Test.kt`, `*Spec.kt` under `src/test/`
- Shell: `test/*.bats`, `*.test.sh`
2. **Existing test files modified** — same patterns above, but with `--diff-filter=M` and a non-trivial diff (not just whitespace or imports).
3. **New test functions/cases** — diff hunks adding `test(...)`, `it(...)`, `describe(...)`, `def test_*`, `func Test*`, `@Test`, `#[test]`.

Match heuristically — a TEST item that says "Add test for UserService" pairs with a new `user_service.test.ts` or new `it('UserService...')` block. Don't require literal name matches; the test file just has to plausibly cover the promised scope.

Classify each TEST item:

- **landed** — clear matching evidence in the diff (cite the file)
- **missing** — no test file or test function in the diff matches this item

Track these counts for the final JSON output:
- `tests_promised` = number of TEST-category plan items
- `tests_landed` = number of TEST items with evidence in the diff
- `tests_missing` = list of TEST item texts (verbatim or concise) where evidence was absent

**Why this matters:** Reviewer asked for the test in /plan-eng-review. Plan listed it. Implementer didn't write it. Step 8 used to count it as a NOT DONE generic item — now it gets called out specifically so users can decide whether the gap is acceptable before the PR lands.

### Output Format

```
Expand Down
37 changes: 37 additions & 0 deletions scripts/resolvers/review.ts
Original file line number Diff line number Diff line change
Expand Up @@ -770,6 +770,43 @@ For each extracted plan item, check the diff and classify:
**Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present.
**Be generous with CHANGED** — if the goal is met by different means, that counts as addressed.`);

// ── Test-promise audit (#1070) ──
// /plan-eng-review extracts a Test review section that produces TEST-category
// plan items. They were getting silently dropped at implement time without
// anyone noticing — an upstream retro found ~3% test commits over a 7-day
// window despite every plan listing specific tests. Cross-reference each
// TEST item against the diff explicitly so the audit surface includes
// "promised N tests, M landed".
sections.push(`
### Test-Promise Audit

For each plan item categorized as TEST, scan the diff for evidence that the test actually landed:

1. **New test files** — \`git diff origin/<base>...HEAD --name-only --diff-filter=A\` and match against language-specific test patterns:
- JS/TS: \`*.test.ts\`, \`*.test.tsx\`, \`*.test.js\`, \`*.spec.ts\`, \`__tests__/*\`
- Python: \`test_*.py\`, \`tests/*.py\`, \`*_test.py\`
- Go: \`*_test.go\`
- Ruby: \`*_spec.rb\`, \`spec/*\`, \`test/*_test.rb\`
- Rust: \`#[test]\`/\`#[cfg(test)]\` blocks added in any \`.rs\` file
- Java/Kotlin: \`*Test.java\`, \`*Test.kt\`, \`*Spec.kt\` under \`src/test/\`
- Shell: \`test/*.bats\`, \`*.test.sh\`
2. **Existing test files modified** — same patterns above, but with \`--diff-filter=M\` and a non-trivial diff (not just whitespace or imports).
3. **New test functions/cases** — diff hunks adding \`test(...)\`, \`it(...)\`, \`describe(...)\`, \`def test_*\`, \`func Test*\`, \`@Test\`, \`#[test]\`.

Match heuristically — a TEST item that says "Add test for UserService" pairs with a new \`user_service.test.ts\` or new \`it('UserService...')\` block. Don't require literal name matches; the test file just has to plausibly cover the promised scope.

Classify each TEST item:

- **landed** — clear matching evidence in the diff (cite the file)
- **missing** — no test file or test function in the diff matches this item

Track these counts for the final JSON output:
- \`tests_promised\` = number of TEST-category plan items
- \`tests_landed\` = number of TEST items with evidence in the diff
- \`tests_missing\` = list of TEST item texts (verbatim or concise) where evidence was absent

**Why this matters:** Reviewer asked for the test in /plan-eng-review. Plan listed it. Implementer didn't write it. Step 8 used to count it as a NOT DONE generic item — now it gets called out specifically so users can decide whether the gap is acceptable before the PR lands.`);

// ── Output format ──
sections.push(`
### Output Format
Expand Down
36 changes: 34 additions & 2 deletions ship/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -1924,6 +1924,35 @@ For each extracted plan item, check the diff and classify:
**Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present.
**Be generous with CHANGED** — if the goal is met by different means, that counts as addressed.

### Test-Promise Audit

For each plan item categorized as TEST, scan the diff for evidence that the test actually landed:

1. **New test files** — `git diff origin/<base>...HEAD --name-only --diff-filter=A` and match against language-specific test patterns:
- JS/TS: `*.test.ts`, `*.test.tsx`, `*.test.js`, `*.spec.ts`, `__tests__/*`
- Python: `test_*.py`, `tests/*.py`, `*_test.py`
- Go: `*_test.go`
- Ruby: `*_spec.rb`, `spec/*`, `test/*_test.rb`
- Rust: `#[test]`/`#[cfg(test)]` blocks added in any `.rs` file
- Java/Kotlin: `*Test.java`, `*Test.kt`, `*Spec.kt` under `src/test/`
- Shell: `test/*.bats`, `*.test.sh`
2. **Existing test files modified** — same patterns above, but with `--diff-filter=M` and a non-trivial diff (not just whitespace or imports).
3. **New test functions/cases** — diff hunks adding `test(...)`, `it(...)`, `describe(...)`, `def test_*`, `func Test*`, `@Test`, `#[test]`.

Match heuristically — a TEST item that says "Add test for UserService" pairs with a new `user_service.test.ts` or new `it('UserService...')` block. Don't require literal name matches; the test file just has to plausibly cover the promised scope.

Classify each TEST item:

- **landed** — clear matching evidence in the diff (cite the file)
- **missing** — no test file or test function in the diff matches this item

Track these counts for the final JSON output:
- `tests_promised` = number of TEST-category plan items
- `tests_landed` = number of TEST items with evidence in the diff
- `tests_missing` = list of TEST item texts (verbatim or concise) where evidence was absent

**Why this matters:** Reviewer asked for the test in /plan-eng-review. Plan listed it. Implementer didn't write it. Step 8 used to count it as a NOT DONE generic item — now it gets called out specifically so users can decide whether the gap is acceptable before the PR lands.

### Output Format

```
Expand Down Expand Up @@ -1972,14 +2001,17 @@ After producing the completion checklist:
**Include in PR body (Step 8):** Add a `## Plan Completion` section with the checklist summary.
>
> After your analysis, output a single JSON object on the LAST LINE of your response (no other text after it):
> `{"total_items":N,"done":N,"changed":N,"deferred":N,"summary":"<markdown checklist for PR body>"}`
> `{"total_items":N,"done":N,"changed":N,"deferred":N,"tests_promised":N,"tests_landed":N,"tests_missing":["item text",...],"summary":"<markdown checklist for PR body>"}`
>
> The `tests_*` fields come from the Test-Promise Audit section. If the plan has no TEST-category items, set `tests_promised:0`, `tests_landed:0`, `tests_missing:[]`.

**Parent processing:**

1. Parse the LAST line of the subagent's output as JSON.
2. Store `done`, `deferred` for Step 20 metrics; use `summary` in PR body.
3. If `deferred > 0` and no user override, present the deferred items via AskUserQuestion before continuing.
4. Embed `summary` in PR body's `## Plan Completion` section (Step 19).
4. **Test-promise gap (#1070):** if `tests_missing.length > 0`, append one informational line to the `summary` before embedding it: `> ⚠ Plan promised {tests_promised} tests, {tests_landed} landed. Missing: {tests_missing joined with "; "}`. Informational only — does NOT gate. The user already saw NOT-DONE items via the existing `deferred` flow; this just makes the test gap explicit instead of hiding it inside a generic "NOT DONE" count.
5. Embed `summary` in PR body's `## Plan Completion` section (Step 19).

**If the subagent fails or returns invalid JSON:** Fall back to running the audit inline. Never block /ship on subagent failure.

Expand Down
7 changes: 5 additions & 2 deletions ship/SKILL.md.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -268,14 +268,17 @@ If multiple suites need to run, run them sequentially (each needs a test lane).
> {{PLAN_COMPLETION_AUDIT_SHIP}}
>
> After your analysis, output a single JSON object on the LAST LINE of your response (no other text after it):
> `{"total_items":N,"done":N,"changed":N,"deferred":N,"summary":"<markdown checklist for PR body>"}`
> `{"total_items":N,"done":N,"changed":N,"deferred":N,"tests_promised":N,"tests_landed":N,"tests_missing":["item text",...],"summary":"<markdown checklist for PR body>"}`
>
> The `tests_*` fields come from the Test-Promise Audit section. If the plan has no TEST-category items, set `tests_promised:0`, `tests_landed:0`, `tests_missing:[]`.

**Parent processing:**

1. Parse the LAST line of the subagent's output as JSON.
2. Store `done`, `deferred` for Step 20 metrics; use `summary` in PR body.
3. If `deferred > 0` and no user override, present the deferred items via AskUserQuestion before continuing.
4. Embed `summary` in PR body's `## Plan Completion` section (Step 19).
4. **Test-promise gap (#1070):** if `tests_missing.length > 0`, append one informational line to the `summary` before embedding it: `> ⚠ Plan promised {tests_promised} tests, {tests_landed} landed. Missing: {tests_missing joined with "; "}`. Informational only — does NOT gate. The user already saw NOT-DONE items via the existing `deferred` flow; this just makes the test gap explicit instead of hiding it inside a generic "NOT DONE" count.
5. Embed `summary` in PR body's `## Plan Completion` section (Step 19).

**If the subagent fails or returns invalid JSON:** Fall back to running the audit inline. Never block /ship on subagent failure.

Expand Down
70 changes: 70 additions & 0 deletions test/resolver-test-promise-audit.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
/**
* Test-Promise Audit resolver assertions (#1070).
*
* The Plan Completion Audit subagent has to surface TEST-category items
* that were promised but didn't land. The resolver injects per-language
* test-file pattern guidance and expands the JSON output contract with
* `tests_promised`, `tests_landed`, `tests_missing` fields.
*
* Pin the format contract here so a future edit to the resolver can't
* silently drop the test-promise audit instruction.
*/
import { describe, test, expect } from 'bun:test';
import type { TemplateContext } from '../scripts/resolvers/types';
import { HOST_PATHS } from '../scripts/resolvers/types';
import {
generatePlanCompletionAuditShip,
generatePlanCompletionAuditReview,
} from '../scripts/resolvers/review';

function makeCtx(skillName: string): TemplateContext {
return {
skillName,
tmplPath: 'test.tmpl',
host: 'claude',
paths: HOST_PATHS.claude,
preambleTier: 4,
};
}

describe('Plan Completion Audit — Test-Promise section (#1070)', () => {
test('ship variant includes the Test-Promise Audit instruction', () => {
const out = generatePlanCompletionAuditShip(makeCtx('ship'));
expect(out).toContain('Test-Promise Audit');
// Per-language test patterns must be enumerated so the subagent can
// grep deterministically across stacks instead of guessing.
expect(out).toContain('*.test.ts');
expect(out).toContain('test_*.py');
expect(out).toContain('*_test.go');
expect(out).toContain('*_spec.rb');
// Counters that the parent template parses out of the JSON.
expect(out).toContain('tests_promised');
expect(out).toContain('tests_landed');
expect(out).toContain('tests_missing');
});

test('review variant also includes Test-Promise Audit', () => {
// Review-mode runs without /ship's gating but the subagent should still
// surface the gap — review's whole job is to call out delivery integrity.
const out = generatePlanCompletionAuditReview(makeCtx('review'));
expect(out).toContain('Test-Promise Audit');
expect(out).toContain('tests_promised');
});

test('audit body still has the Item Extraction step that produces TEST items', () => {
// The Test-Promise Audit only works because Item Extraction tags items
// by category. Pin the contract: TEST must remain one of the categories.
const out = generatePlanCompletionAuditShip(makeCtx('ship'));
expect(out).toContain('CODE | TEST | MIGRATION | CONFIG | DOCS');
});

test('Cross-Reference step still classifies DONE/PARTIAL/NOT DONE/CHANGED', () => {
// The Test-Promise Audit's `landed`/`missing` classification piggybacks
// on the existing diff cross-reference. If the existing classes
// disappear, the test-promise output goes with them.
const out = generatePlanCompletionAuditShip(makeCtx('ship'));
expect(out).toContain('DONE');
expect(out).toContain('NOT DONE');
expect(out).toContain('PARTIAL');
});
});
Loading