blendsdk
diff --git a/‎docs/code.md‎
Lines changed: 41 additions & 0 deletions b/‎docs/code.md‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎docs/make_plan.md‎
Lines changed: 148 additions & 4 deletions b/‎docs/make_plan.md‎
Lines changed: 148 additions & 4 deletions
diff --git a/‎docs/preflight.md‎
Lines changed: 27 additions & 0 deletions b/‎docs/preflight.md‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎docs/requirements.md‎
Lines changed: 38 additions & 0 deletions b/‎docs/requirements.md‎
Lines changed: 38 additions & 0 deletions
@@ -402,6 +402,47 @@ These rules are **mandatory** and must be applied **strictly and consistently**
     - api.errors.test.go
     ```
 
+31. **🚨 Specification-Implementation Test Separation (🚨 NON-NEGOTIABLE)**
+
+    Every feature MUST have two distinct categories of tests, physically separated into different files. This rule prevents **tautological testing** — where tests mirror the implementation instead of independently verifying it against the specification, causing bugs to ship to production undetected.
+
+    **Two mandatory test categories:**
+
+    | Category | Source of Truth | File Convention | Purpose |
+    |----------|----------------|-----------------|---------|
+    | **Specification Tests** | Requirements, acceptance criteria, API contracts, RFCs | `[feature].spec.test.[ext]` | Verify the code does what the **specification** says |
+    | **Implementation Tests** | The code itself | `[feature].impl.test.[ext]` | Verify internals, edge cases, error paths, boundary conditions |
+
+    **Specification test rules:**
+    - Spec test expectations MUST be derived from specification documents (requirements, acceptance criteria, API contracts) — NEVER from reading the implementation code
+    - Spec tests are **immutable oracles** — if a spec test fails after implementation, the **implementation is wrong**, not the test
+    - The agent MUST NOT modify spec test expectations to make them match the implementation without explicit user approval
+    - When writing spec tests, the agent MAY read type definitions and function signatures (public API surface) but MUST NOT read implementation logic (function bodies, internal algorithms)
+    - Every spec test MUST include a traceability comment linking it to its source requirement or AR entry
+
+    **File organization:**
+    ```
+    tests/
+    ├── auth/
+    │   ├── auth.login.spec.test.[ext]       # Specification tests — from requirements
+    │   ├── auth.login.impl.test.[ext]       # Implementation tests — edge cases, internals
+    ├── user/
+    │   ├── user.creation.spec.test.[ext]    # Specification tests
+    │   └── user.creation.impl.test.[ext]    # Implementation tests
+    ```
+
+    **Describe block labeling:** Within spec test files, use `describe('Specification: [Feature]', ...)` to make the test category unmistakable in test output.
+
+    **🚫 PROHIBITED — The agent MUST NOT:**
+    - ❌ Write only implementation tests without specification tests
+    - ❌ Combine spec and impl tests in the same file
+    - ❌ Derive spec test expectations from running the code and observing output
+    - ❌ Modify spec test assertions to match a broken implementation
+    - ❌ Skip, disable, or weaken spec tests that fail after implementation
+    - ❌ Rationalize spec test failures as "the spec was wrong" without user approval
+
+    > **📖 See `testing.md` Rule 10 (Specification-First Testing Protocol)** for the full protocol including the red-phase verification, escalation procedures, and interaction with `make_plan`.
+
 ---
 
 ## 10. Security-First Development
 
@@ -662,13 +662,60 @@ Choose based on estimated size — each document should be manageable within AI
 - Integration tests: Key workflows covered
 - E2E tests: Complete feature verification
 
+## 🚨 Specification Test Cases (MANDATORY — NON-NEGOTIABLE)
+
+> **These test cases are derived EXCLUSIVELY from requirements (`01-requirements.md`),
+> component specifications (`03-XX-*.md`), API contracts, RFCs, and the Ambiguity Register
+> (`00-ambiguity-register.md`). They define the expected behavior BEFORE any
+> implementation exists.**
+>
+> **IMMUTABLE ORACLE RULE:** The agent MUST NOT modify these expectations to match the
+> implementation. If the implementation does not match a spec test case, the implementation
+> is wrong — not the test. See `testing.md` Rule 10 for the full protocol.
+>
+> **Every spec test case MUST include a source reference** tracing it to the requirement,
+> spec document, or AR entry that defines the expected behavior.
+
+### [Component/Feature 1]
+
+| # | Input / Scenario | Expected Output / Behavior | Source |
+|---|-----------------|---------------------------|--------|
+| ST-1 | [Concrete input or action] | [Concrete expected output or behavior] | [Req X.X / AR #X / RFC §X] |
+| ST-2 | [Concrete input or action] | [Concrete expected output or behavior] | [Req X.X / AR #X] |
+| ST-3 | [Error/edge scenario] | [Expected error type and message] | [Req X.X / AR #X] |
+
+### [Component/Feature 2]
+
+| # | Input / Scenario | Expected Output / Behavior | Source |
+|---|-----------------|---------------------------|--------|
+| ST-4 | [Concrete input or action] | [Concrete expected output or behavior] | [Req X.X / AR #X] |
+| ST-5 | [Concrete input or action] | [Concrete expected output or behavior] | [Req X.X / AR #X] |
+
+> **⚠️ AUTHORING RULE:** When writing spec test cases, the plan author MUST derive
+> expectations from the specification documents listed above. The author MUST NOT
+> imagine or infer what the implementation will produce. If the expected output cannot
+> be determined from the specification, this is an ambiguity — add it to the Ambiguity
+> Register and resolve with the user before defining the test case.
+
 ## Test Categories
 
-### Unit Tests
+### Specification Tests (from ST-cases above)
+
+> Written BEFORE implementation. Filed as `[feature].spec.test.[ext]`.
+> See `testing.md` Rule 10 and `code.md` Rule 31.
 
-| Test        | Description      | Priority     |
-| ----------- | ---------------- | ------------ |
-| [Test name] | [What it tests]  | High/Med/Low |
+| Test File | ST Cases Covered | Component |
+| --------- | ---------------- | --------- |
+| `[feature].spec.test.[ext]` | ST-1, ST-2, ST-3 | [Component 1] |
+| `[feature].spec.test.[ext]` | ST-4, ST-5 | [Component 2] |
+
+### Implementation Tests (edge cases, internals)
+
+> Written AFTER implementation. Filed as `[feature].impl.test.[ext]`.
+
+| Test File | Description | Priority |
+| --------- | ----------- | -------- |
+| `[feature].impl.test.[ext]` | [Edge cases, boundary conditions, internal logic] | High/Med/Low |
 
 ### Integration Tests
 
@@ -694,6 +741,12 @@ Choose based on estimated size — each document should be manageable within AI
 
 ## Verification Checklist
 
+- [ ] All specification test cases (ST-*) defined with concrete input/output pairs
+- [ ] Every ST case traces to a requirement, spec doc, or AR entry
+- [ ] Specification tests written BEFORE implementation
+- [ ] Specification tests verified to FAIL before implementation (red phase)
+- [ ] All specification tests pass after implementation (green phase)
+- [ ] Implementation tests written for edge cases and internals
 - [ ] All unit tests pass
 - [ ] All integration tests pass
 - [ ] All E2E tests pass
@@ -729,6 +782,16 @@ Before finalizing plan documents, run this checklist:
 - [ ] E2E tests planned
 - [ ] Test coverage goals defined
 
+**✅ Specification-First Testing (per `testing.md` Rule 10, `code.md` Rule 31) — 🚨 NON-NEGOTIABLE**
+- [ ] `07-testing-strategy.md` contains the `🚨 Specification Test Cases` section with concrete ST-cases
+- [ ] Every ST-case has concrete input → expected output pairs (not just test names/descriptions)
+- [ ] Every ST-case traces to a requirement, spec document, RFC, or AR entry
+- [ ] ST-case expectations are derived from specification documents, NOT from imagined implementation behavior
+- [ ] `99-execution-plan.md` follows the three-phase task ordering: spec tests → implementation → impl tests
+- [ ] Spec test tasks reference ST-cases from `07-testing-strategy.md`
+- [ ] Spec test and impl test files use separate naming convention (`*.spec.test.*` and `*.impl.test.*`)
+- [ ] Red-phase verification task exists in execution plan (verify spec tests fail before implementation)
+
 **✅ No Dead Code (per `code.md` rule 4)**
 - [ ] No unused parameters (except interface contracts, overrides, and framework-required signatures)
 - [ ] No unused functions, classes, or modules
@@ -858,6 +921,87 @@ For each task in order:
 4. **Techdocs check (after each phase):** If `docs/index.md` exists with `techdocs: true` frontmatter and the just-completed phase introduced architectural changes (new components, data entities, API endpoints, integrations, or infrastructure), perform an incremental techdocs update (see `techdocs.md` Phase 6.1)
 5. Continue until all tasks complete OR context window reaches 90%
 
+> **🚨 SPECIFICATION-FIRST TASK ORDERING — NON-NEGOTIABLE 🚨**
+>
+> When executing implementation tasks for any feature, the agent MUST follow the three-phase task ordering defined below. This is enforced at the execution plan level — every generated `99-execution-plan.md` MUST structure feature phases in this order. See `testing.md` Rule 10 for the full Specification-First Testing Protocol.
+
+---
+
+## **🚨 CRITICAL: Specification-First Task Ordering in Execution Plans (NON-NEGOTIABLE) 🚨**
+
+**Every feature implementation phase in `99-execution-plan.md` MUST follow this three-phase task structure.** This prevents tautological testing — where tests mirror the implementation instead of independently verifying it against the specification. See `testing.md` Rule 10 and `code.md` Rule 31.
+
+### Mandatory Task Ordering Per Feature
+
+```
+Phase N: [Feature Name]
+
+  Session N.1: Specification Tests (BEFORE implementation)
+    N.1.1  Write specification tests from 07-testing-strategy.md ST-cases
+           → File: [feature].spec.test.[ext]
+           → Source: 07-testing-strategy.md ST-1 through ST-X
+           → Agent MUST NOT read implementation logic when writing these tests
+    N.1.2  Run spec tests — verify they FAIL (red phase)
+           → Document any that pass pre-implementation with justification
+
+  Session N.2: Implementation
+    N.2.1  Implement [feature/component] per technical specification
+           → File: [implementation files]
+           → Reference: 03-XX-[component].md
+    N.2.2  Run spec tests — verify they PASS (green phase)
+           → If any spec test fails: STOP, fix implementation (NOT the test)
+
+  Session N.3: Implementation Tests & Hardening
+    N.3.1  Write implementation tests (edge cases, internals, error paths)
+           → File: [feature].impl.test.[ext]
+    N.3.2  Full verification (project's verify command)
+```
+
+### Why This Ordering Is Non-Negotiable
+
+| Step | What It Prevents |
+|------|-----------------|
+| **Spec tests BEFORE implementation** | Prevents agent from deriving test expectations from the code it just wrote |
+| **Red phase verification** | Proves spec tests are meaningful (they test something that doesn't exist yet) |
+| **Spec tests PASS after implementation** | Proves the implementation satisfies the specification |
+| **Impl tests AFTER implementation** | These tests CAN be derived from the code (edge cases, internals) — but spec tests cannot |
+
+### Enforcement Rules
+
+**🚫 PROHIBITED — The agent MUST NOT:**
+
+- ❌ Write implementation code before specification tests exist for that feature
+- ❌ Skip the spec test phase ("we'll write tests after")
+- ❌ Combine spec tests and implementation in the same task
+- ❌ Write spec tests and implementation simultaneously
+- ❌ Generate an execution plan where implementation tasks come before spec test tasks for the same feature
+
+**✅ REQUIRED — Every generated `99-execution-plan.md` MUST:**
+
+- ✅ Structure each feature phase with the three-session ordering above
+- ✅ Include explicit spec test file references (`[feature].spec.test.[ext]`)
+- ✅ Include explicit impl test file references (`[feature].impl.test.[ext]`)
+- ✅ Reference the ST-cases from `07-testing-strategy.md` in spec test tasks
+- ✅ Include red-phase verification as a distinct task
+
+### Adaptation for Small Features
+
+For small features where three separate sessions would be excessive, the agent MAY compress into a single session — but the **task ordering is still mandatory**:
+
+```
+Session N.1: [Feature Name]
+  N.1.1  Write specification tests (from ST-cases)
+  N.1.2  Verify spec tests fail (red phase)
+  N.1.3  Implement feature
+  N.1.4  Verify spec tests pass (green phase)
+  N.1.5  Write implementation tests
+  N.1.6  Full verification
+```
+
+The order `spec tests → red phase → implement → green phase → impl tests → verify` is NEVER negotiable, regardless of feature size.
+
+---
+
 #### Step 3: Session Wrap-Up
 
 1. ✅ Complete current task before stopping
 
@@ -524,6 +524,33 @@ For any finding in dimensions 2, 4, 5, 6, 11, or 13 — and for any finding that
 
 Codebase reconnaissance (Step 2) should be thorough but proportional. For a plan that modifies 3 files, read those 3 files deeply plus their direct dependents. For a requirements document about a new subsystem, understand the overall architecture and the integration points. Do NOT attempt to read the entire codebase for a small, scoped artifact — that wastes context window. Focus on the code that the artifact actually touches or depends on.
 
+### Rule 10: Same-Agent Bias Awareness — 🚨 NON-NEGOTIABLE
+
+**The agent performing preflight MUST explicitly acknowledge and counteract the risk of same-agent bias.** When the same AI model created the artifact and reviews it, systematic blind spots are likely — the agent shares the same training biases, the same knowledge gaps, and the same reasoning patterns. A bug the agent missed during creation is exactly the kind of bug it will miss during review.
+
+**Structural safeguards:**
+
+1. **Fresh context required** — If the agent created the artifact in the CURRENT session, it MUST note this at the top of the preflight report:
+   ```
+   ⚠️ SAME-SESSION REVIEW: This artifact was created in the current session.
+   Same-agent bias risk is elevated. Consider running preflight in a new session
+   for maximum review independence.
+   ```
+
+2. **Standard-first checking** — For any behavior that must conform to an external standard (RFC, protocol, specification, regulation), the agent MUST verify conformance by **citing the specific standard text**, not by reasoning from memory. If the agent cannot cite the standard, it MUST flag this as a limitation:
+   ```
+   ⚠️ Unable to verify conformance with [standard] — agent does not have
+   access to the full standard text. Flag for human review.
+   ```
+
+3. **Adversarial question checklist** — Before concluding the 13-dimension scan, the agent MUST ask itself:
+   - "What assumption did I make during creation that I might be unconsciously confirming now?"
+   - "What external standard or convention might this violate that I'm not aware of?"
+   - "What would a domain expert who disagrees with my approach flag as wrong?"
+   If any of these questions surface concerns, add them as 🔵 OBSERVATION findings.
+
+4. **User recommendation** — If the artifact is high-stakes (security-related, compliance-related, or architecturally foundational), the agent SHOULD recommend: *"Consider having a human domain expert review this artifact in addition to the automated preflight."*
+
 ---
 
 ## **Cross-References**
 
@@ -638,6 +638,44 @@ When writing each RD:
 - **Complexity Estimates**: Tag each requirement section with estimated complexity (S/M/L/XL) to aid planning
 - **Non-Functional RD**: Always create one dedicated RD for non-functional requirements (performance targets, security, scalability, accessibility, availability, backup/recovery). Users frequently forget these.
 
+### 3.4B 🚨 Acceptance Criteria Specificity — NON-NEGOTIABLE
+
+**Acceptance criteria MUST be specific enough that a developer who has never spoken to the user can write a correct test from the criterion alone.** This rule prevents the acceptance criteria tautology — where the agent writes vague criteria, then later writes tests that interpret the criteria however the implementation happens to work, creating a self-validating loop.
+
+**Every acceptance criterion MUST meet ALL of these requirements:**
+
+1. **Measurable outcome** — States a concrete, observable result (not "works correctly" or "handles errors properly")
+2. **Specific values** — Includes exact numbers, formats, status codes, or field names where applicable
+3. **Standard references** — When the behavior must conform to a standard (RFC, protocol, specification), the criterion MUST cite the specific standard and section (e.g., "per RFC 8414 §2" not "follows the OIDC spec")
+4. **Boundary conditions** — States what happens at the edges (empty input, maximum length, zero items, expired tokens)
+5. **Negative cases** — States what should NOT happen or what should be rejected
+
+**Examples:**
+
+```
+❌ BAD: "The API returns a valid OIDC discovery document"
+✅ GOOD: "GET /.well-known/openid-configuration returns a JSON document where
+   the 'issuer' field exactly matches the URL used to access the endpoint
+   (per RFC 8414 §2), and includes all REQUIRED fields: issuer,
+   authorization_endpoint, token_endpoint, jwks_uri,
+   response_types_supported, subject_types_supported,
+   id_token_signing_alg_values_supported"
+
+❌ BAD: "Users can reset their password"
+✅ GOOD: "POST /auth/reset-password with a valid email returns 202 Accepted,
+   sends an email with a one-time reset link that expires after 60 minutes,
+   and the link cannot be reused after the password is changed"
+
+❌ BAD: "The system handles invalid input gracefully"
+✅ GOOD: "POST /api/users with a missing 'email' field returns 400 with
+   { error: 'VALIDATION_ERROR', details: [{ field: 'email', message: '...' }] }.
+   POST /api/users with an email longer than 254 characters returns 400."
+```
+
+**If the user provides vague acceptance criteria** during review (Step 3.5), the agent MUST ask for specifics: *"This criterion says 'handles errors properly' — what specific error conditions should be handled, and what should the response look like for each?"*
+
+**Traceability to tests:** When `make_plan` later derives test cases from these criteria, each spec test expectation MUST map directly to a specific acceptance criterion. If a criterion is too vague to produce a concrete test assertion, the criterion is defective — not the test.
+
 ### 3.5 Authoring Workflow
 
 Write RDs one at a time, presenting each to the user for review: