You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(code-ops): add specification-implementation test separation rules
Introduce Rule 31 (code.md), Rule 10 (testing.md), and spec test
sections in make_plan.md to enforce physical separation of specification
tests and implementation tests. This prevents tautological testing where
tests mirror the implementation instead of independently verifying
behavior against the specification, allowing bugs to ship undetected.
Key additions:
- Spec vs impl test file naming conventions (.spec.test / .impl.test)
- Immutable oracle rule: spec test failures mean the implementation is
wrong, not the test
- Red-phase verification protocol and escalation procedures
- Mandatory spec test case tables in planning documents with source
traceability to requirements, API contracts, and ambiguity register
Copy file name to clipboardExpand all lines: docs/code.md
+41Lines changed: 41 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -402,6 +402,47 @@ These rules are **mandatory** and must be applied **strictly and consistently**
402
402
- api.errors.test.go
403
403
```
404
404
405
+
31. **🚨 Specification-Implementation Test Separation (🚨 NON-NEGOTIABLE)**
406
+
407
+
Every feature MUST have two distinct categories of tests, physically separated into different files. This rule prevents **tautological testing** — where tests mirror the implementation instead of independently verifying it against the specification, causing bugs to ship to production undetected.
408
+
409
+
**Two mandatory test categories:**
410
+
411
+
| Category | Source of Truth | File Convention | Purpose |
| **Specification Tests** | Requirements, acceptance criteria, API contracts, RFCs | `[feature].spec.test.[ext]` | Verify the code does what the **specification** says |
- Spec test expectations MUST be derived from specification documents (requirements, acceptance criteria, API contracts) — NEVER from reading the implementation code
418
+
- Spec tests are **immutable oracles** — if a spec test fails after implementation, the **implementation is wrong**, not the test
419
+
- The agent MUST NOT modify spec test expectations to make them match the implementation without explicit user approval
420
+
- When writing spec tests, the agent MAY read type definitions and function signatures (public API surface) but MUST NOT read implementation logic (function bodies, internal algorithms)
421
+
- Every spec test MUST include a traceability comment linking it to its source requirement or AR entry
422
+
423
+
**File organization:**
424
+
```
425
+
tests/
426
+
├── auth/
427
+
│ ├── auth.login.spec.test.[ext] # Specification tests — from requirements
**Describe block labeling:** Within spec test files, use `describe('Specification: [Feature]', ...)` to make the test category unmistakable in test output.
435
+
436
+
**🚫 PROHIBITED — The agent MUST NOT:**
437
+
- ❌ Write only implementation tests without specification tests
438
+
- ❌ Combine spec and impl tests in the same file
439
+
- ❌ Derive spec test expectations from running the code and observing output
440
+
- ❌ Modify spec test assertions to match a broken implementation
441
+
- ❌ Skip, disable, or weaken spec tests that fail after implementation
442
+
- ❌ Rationalize spec test failures as "the spec was wrong" without user approval
443
+
444
+
> **📖 See `testing.md` Rule 10 (Specification-First Testing Protocol)** for the full protocol including the red-phase verification, escalation procedures, and interaction with `make_plan`.
-[ ] Spec test tasks reference ST-cases from `07-testing-strategy.md`
792
+
-[ ] Spec test and impl test files use separate naming convention (`*.spec.test.*` and `*.impl.test.*`)
793
+
-[ ] Red-phase verification task exists in execution plan (verify spec tests fail before implementation)
794
+
732
795
**✅ No Dead Code (per `code.md` rule 4)**
733
796
-[ ] No unused parameters (except interface contracts, overrides, and framework-required signatures)
734
797
-[ ] No unused functions, classes, or modules
@@ -858,6 +921,87 @@ For each task in order:
858
921
4.**Techdocs check (after each phase):** If `docs/index.md` exists with `techdocs: true` frontmatter and the just-completed phase introduced architectural changes (new components, data entities, API endpoints, integrations, or infrastructure), perform an incremental techdocs update (see `techdocs.md` Phase 6.1)
859
922
5. Continue until all tasks complete OR context window reaches 90%
> When executing implementation tasks for any feature, the agent MUST follow the three-phase task ordering defined below. This is enforced at the execution plan level — every generated `99-execution-plan.md` MUST structure feature phases in this order. See `testing.md` Rule 10 for the full Specification-First Testing Protocol.
**Every feature implementation phase in `99-execution-plan.md` MUST follow this three-phase task structure.** This prevents tautological testing — where tests mirror the implementation instead of independently verifying it against the specification. See `testing.md` Rule 10 and `code.md` Rule 31.
N.3.2 Full verification (project's verify command)
958
+
```
959
+
960
+
### Why This Ordering Is Non-Negotiable
961
+
962
+
| Step | What It Prevents |
963
+
|------|-----------------|
964
+
|**Spec tests BEFORE implementation**| Prevents agent from deriving test expectations from the code it just wrote |
965
+
|**Red phase verification**| Proves spec tests are meaningful (they test something that doesn't exist yet) |
966
+
|**Spec tests PASS after implementation**| Proves the implementation satisfies the specification |
967
+
|**Impl tests AFTER implementation**| These tests CAN be derived from the code (edge cases, internals) — but spec tests cannot |
968
+
969
+
### Enforcement Rules
970
+
971
+
**🚫 PROHIBITED — The agent MUST NOT:**
972
+
973
+
- ❌ Write implementation code before specification tests exist for that feature
974
+
- ❌ Skip the spec test phase ("we'll write tests after")
975
+
- ❌ Combine spec tests and implementation in the same task
976
+
- ❌ Write spec tests and implementation simultaneously
977
+
- ❌ Generate an execution plan where implementation tasks come before spec test tasks for the same feature
978
+
979
+
**✅ REQUIRED — Every generated `99-execution-plan.md` MUST:**
980
+
981
+
- ✅ Structure each feature phase with the three-session ordering above
982
+
- ✅ Include explicit spec test file references (`[feature].spec.test.[ext]`)
983
+
- ✅ Include explicit impl test file references (`[feature].impl.test.[ext]`)
984
+
- ✅ Reference the ST-cases from `07-testing-strategy.md` in spec test tasks
985
+
- ✅ Include red-phase verification as a distinct task
986
+
987
+
### Adaptation for Small Features
988
+
989
+
For small features where three separate sessions would be excessive, the agent MAY compress into a single session — but the **task ordering is still mandatory**:
990
+
991
+
```
992
+
Session N.1: [Feature Name]
993
+
N.1.1 Write specification tests (from ST-cases)
994
+
N.1.2 Verify spec tests fail (red phase)
995
+
N.1.3 Implement feature
996
+
N.1.4 Verify spec tests pass (green phase)
997
+
N.1.5 Write implementation tests
998
+
N.1.6 Full verification
999
+
```
1000
+
1001
+
The order `spec tests → red phase → implement → green phase → impl tests → verify` is NEVER negotiable, regardless of feature size.
Copy file name to clipboardExpand all lines: docs/preflight.md
+27Lines changed: 27 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -524,6 +524,33 @@ For any finding in dimensions 2, 4, 5, 6, 11, or 13 — and for any finding that
524
524
525
525
Codebase reconnaissance (Step 2) should be thorough but proportional. For a plan that modifies 3 files, read those 3 files deeply plus their direct dependents. For a requirements document about a new subsystem, understand the overall architecture and the integration points. Do NOT attempt to read the entire codebase for a small, scoped artifact — that wastes context window. Focus on the code that the artifact actually touches or depends on.
**The agent performing preflight MUST explicitly acknowledge and counteract the risk of same-agent bias.** When the same AI model created the artifact and reviews it, systematic blind spots are likely — the agent shares the same training biases, the same knowledge gaps, and the same reasoning patterns. A bug the agent missed during creation is exactly the kind of bug it will miss during review.
530
+
531
+
**Structural safeguards:**
532
+
533
+
1.**Fresh context required** — If the agent created the artifact in the CURRENT session, it MUST note this at the top of the preflight report:
534
+
```
535
+
⚠️ SAME-SESSION REVIEW: This artifact was created in the current session.
536
+
Same-agent bias risk is elevated. Consider running preflight in a new session
537
+
for maximum review independence.
538
+
```
539
+
540
+
2.**Standard-first checking** — For any behavior that must conform to an external standard (RFC, protocol, specification, regulation), the agent MUST verify conformance by **citing the specific standard text**, not by reasoning from memory. If the agent cannot cite the standard, it MUST flag this as a limitation:
541
+
```
542
+
⚠️ Unable to verify conformance with [standard] — agent does not have
543
+
access to the full standard text. Flag for human review.
544
+
```
545
+
546
+
3.**Adversarial question checklist** — Before concluding the 13-dimension scan, the agent MUST ask itself:
547
+
- "What assumption did I make during creation that I might be unconsciously confirming now?"
548
+
- "What external standard or convention might this violate that I'm not aware of?"
549
+
- "What would a domain expert who disagrees with my approach flag as wrong?"
550
+
If any of these questions surface concerns, add them as 🔵 OBSERVATION findings.
551
+
552
+
4.**User recommendation** — If the artifact is high-stakes (security-related, compliance-related, or architecturally foundational), the agent SHOULD recommend: *"Consider having a human domain expert review this artifact in addition to the automated preflight."*
**Acceptance criteria MUST be specific enough that a developer who has never spoken to the user can write a correct test from the criterion alone.** This rule prevents the acceptance criteria tautology — where the agent writes vague criteria, then later writes tests that interpret the criteria however the implementation happens to work, creating a self-validating loop.
644
+
645
+
**Every acceptance criterion MUST meet ALL of these requirements:**
646
+
647
+
1.**Measurable outcome** — States a concrete, observable result (not "works correctly" or "handles errors properly")
648
+
2.**Specific values** — Includes exact numbers, formats, status codes, or field names where applicable
649
+
3.**Standard references** — When the behavior must conform to a standard (RFC, protocol, specification), the criterion MUST cite the specific standard and section (e.g., "per RFC 8414 §2" not "follows the OIDC spec")
650
+
4.**Boundary conditions** — States what happens at the edges (empty input, maximum length, zero items, expired tokens)
651
+
5.**Negative cases** — States what should NOT happen or what should be rejected
652
+
653
+
**Examples:**
654
+
655
+
```
656
+
❌ BAD: "The API returns a valid OIDC discovery document"
657
+
✅ GOOD: "GET /.well-known/openid-configuration returns a JSON document where
658
+
the 'issuer' field exactly matches the URL used to access the endpoint
659
+
(per RFC 8414 §2), and includes all REQUIRED fields: issuer,
POST /api/users with an email longer than 254 characters returns 400."
673
+
```
674
+
675
+
**If the user provides vague acceptance criteria** during review (Step 3.5), the agent MUST ask for specifics: *"This criterion says 'handles errors properly' — what specific error conditions should be handled, and what should the response look like for each?"*
676
+
677
+
**Traceability to tests:** When `make_plan` later derives test cases from these criteria, each spec test expectation MUST map directly to a specific acceptance criterion. If a criterion is too vague to produce a concrete test assertion, the criterion is defective — not the test.
678
+
641
679
### 3.5 Authoring Workflow
642
680
643
681
Write RDs one at a time, presenting each to the user for review:
0 commit comments