franklywatson
diff --git a/‎docs/L1-feedback-loops.md‎
Lines changed: 7 additions & 7 deletions b/‎docs/L1-feedback-loops.md‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎docs/L1-patterns/1.1-stack-tests.md‎
Lines changed: 11 additions & 11 deletions b/‎docs/L1-patterns/1.1-stack-tests.md‎
Lines changed: 11 additions & 11 deletions
diff --git a/‎docs/L1-patterns/1.5-no-mock-philosophy.md‎
Lines changed: 93 additions & 2 deletions b/‎docs/L1-patterns/1.5-no-mock-philosophy.md‎
Lines changed: 93 additions & 2 deletions
@@ -75,31 +75,31 @@ The following patterns describe the verification mechanisms that close the loop
 
 ### [Pattern 1.1 — Stack Tests](L1-patterns/1.1-stack-tests.md)
 
-Full-system tests running the complete Docker stack with API-level verification, zero mocks for owned services, and high failure diagnosticity. Each test models an atomic user journey — a single, complete interaction from the user's perspective. Includes health endpoint test mode, test fixture bootstrapping, and the "Beyond API Testing" extension to browser-driven verification with Playwright.
+Full-system tests running the complete Docker stack with API-level verification, zero mocks for owned services, and high failure diagnosticity. Each test models an atomic user journey — a single, complete interaction from the user's perspective. Includes health endpoint test mode, test fixture bootstrapping, and the "Beyond API Testing" extension to browser-driven verification with Playwright. **Why it matters:** Stack tests close the feedback loop — the agent gets binary confirmation that the system works end-to-end, not that individual functions return correct values.
 
 ### [Pattern 1.2 — Full-Loop Assertion Layering](L1-patterns/1.2-full-loop-assertions.md)
 
-Three-level assertion structure (primary, second-order, third-order) that provides diagnostic signal at increasing distance from the primary action. Primary verifies the direct response, second-order verifies side effects through a different API endpoint, third-order verifies cross-functional consistency (audit logs, notifications, cross-endpoint agreement).
+Three-level assertion structure (primary, second-order, third-order) that provides diagnostic signal at increasing distance from the primary action. Primary verifies the direct response, second-order verifies side effects through a different API endpoint, third-order verifies cross-functional consistency (audit logs, notifications, cross-endpoint agreement). **Why it matters:** A `200 OK` response proves nothing about side effects. Second-order assertions catch missing persistence; third-order assertions catch broken observability. Each failure mode points to a specific subsystem.
 
 ### [Pattern 1.3 — Sequential / Additive Test Design](L1-patterns/1.3-sequential-design.md)
 
-Tests ordered by dependency so that if test N fails, the agent knows tests 1 through N-1 passed. The sequence acts as a diagnostic ladder narrowing the search space. Stack tests as vertical slices — each test is one atomic user journey that can be run individually during development or as a full suite before completion.
+Tests ordered by dependency so that if test N fails, the agent knows tests 1 through N-1 passed. The sequence acts as a diagnostic ladder narrowing the search space. Stack tests as vertical slices — each test is one atomic user journey that can be run individually during development or as a full suite before completion. **Why it matters:** Without ordering, a checkout failure could mean broken auth, broken database, or broken checkout logic — ordering tells the agent exactly which subsystem to investigate.
 
 ### [Pattern 1.4 — Container Isolation](L1-patterns/1.4-container-isolation.md)
 
-Four mechanisms ensuring tests never interfere: unique container names, dynamic port allocation, transient volumes, and per-test compose files. Aggressive cleanup prevents Docker resource exhaustion during concurrent test execution.
+Four mechanisms ensuring tests never interfere: unique container names, dynamic port allocation, transient volumes, and per-test compose files. Aggressive cleanup prevents Docker resource exhaustion during concurrent test execution. **Why it matters:** Hardcoded ports and shared state produce flaky, non-deterministic test failures that waste agent tokens on false investigations. Isolation makes every test run deterministic.
 
 ### [Pattern 1.5 — Real Dependencies in E2E/Integration and Stack Tests](L1-patterns/1.5-no-mock-philosophy.md)
 
-Stack tests and E2E/integration tests use real everything — real PostgreSQL, real Redis, real message queues. The only acceptable mocks in these tests are external services without test environments. If you own it, run it. If you can run it in Docker, run it in Docker. Mocks are appropriate and encouraged in unit tests, which validate module contracts in isolation.
+Stack tests and E2E/integration tests use real everything — real PostgreSQL, real Redis, real message queues. If you own it, run it. If you can run it in Docker, run it in Docker. Also covers **focused integration tests** — a niche pattern for exhaustive edge-case coverage of complex external dependencies (payment providers, blockchain RPCs) against their real APIs, without running the full stack. **Why it matters:** Mocks create a fantasy system that passes tests but fails in production. Mocks are appropriate and encouraged in unit tests, which validate module contracts in isolation.
 
 ### [Pattern 1.6 — Test Integrity Rules](L1-patterns/1.6-test-integrity.md)
 
-Six forbidden patterns that allow tests to silently pass: conditional assertions, catch without rethrow, optional chaining on expect, early returns before assertions, try-catch wrapped expectations, and soft assertions. Every test must either pass or fail explicitly.
+Six forbidden patterns that allow tests to silently pass: conditional assertions, catch without rethrow, optional chaining on expect, early returns before assertions, try-catch wrapped expectations, and soft assertions. Every test must either pass or fail explicitly. **Why it matters:** A test that can silently pass is worse than no test — it gives false confidence. Every test must provide unambiguous signal to the agent.
 
 ### [Testing Infrastructure Is Production Code](L1-patterns/testing-infrastructure.md)
 
-The tooling that enables stack tests — port allocators, compose file generators, container managers — is application code, not scaffolding. Treat it with the same rigor: unit tests, error handling, edge case coverage, and code review.
+The tooling that enables stack tests — port allocators, compose file generators, container managers — is application code, not scaffolding. Treat it with the same rigor: unit tests, error handling, edge case coverage, and code review. **Why it matters:** Brittle test infrastructure wastes more agent tokens than any other source — the agent debugs test tooling instead of application logic.
 
 ---
 
 
@@ -2,7 +2,7 @@
 
 ## Problem
 
-Integration tests occupy a painful middle ground: too slow for rapid iteration, too incomplete for deployment confidence. They mock some components but not others, creating a "fake system" that passes tests but fails in production. Unit tests are fast but test code, not behavior. E2E tests are comprehensive but slow and brittle. We need a testing approach that provides real confidence without sacrificing developer velocity.
+Integration tests can occupy a painful middle ground: too slow for rapid iteration, too incomplete for deployment confidence. Partial-stack integration tests that mock some components but run others create a "fake system" that passes tests but fails in production. The exception is focused integration tests — narrow-scope tests that exercise a single external dependency's real API with exhaustive edge-case coverage (see [Pattern 1.5](1.5-no-mock-philosophy.md#focused-integration-tests-for-external-dependencies)). Unit tests are fast but test code, not behavior. E2E tests are comprehensive but slow and brittle. We need a testing approach that provides real confidence without sacrificing developer velocity.
 
 ## Solution
 
@@ -59,24 +59,24 @@ Run sequentially, each test building confidence in layers. If `01-app-startup` f
 
 ## Comparison: Test Types
 
-| Dimension | Unit Tests | Integration Tests | Stack Tests | E2E Tests |
-|-----------|------------|-------------------|-------------|-----------|
-| Scope | Single function/class | Multiple components, partial stack | Full system + external deps (to fullest possible extent) | Full system + external deps |
+| Dimension | Unit Tests | Focused Integration Tests | Stack Tests | E2E Tests |
+|-----------|------------|--------------------------|-------------|-----------|
+| Scope | Single function/class | Adapter code + one real external dependency | Full system + external deps (to fullest possible extent) | Full system + external deps |
 | Speed | Milliseconds | Seconds | Seconds to minutes | Minutes |
-| Isolation | Complete (in-memory) | Partial (shared fixtures) | Complete (per-test containers) | Usually shared environments |
-| Confidence Level | Low (implementation detail) | Medium (partial system) | High (production-like) | High (but flaky) |
-| Mock Policy | Everything | Some components | Zero mocks | Zero mocks |
-| Failure Diagnosticity | Low (false positives from mocks) | Medium (mock mismatches) | High (real failures) | Low (timing, flakiness) |
-| Typical Use | Algorithm correctness | Component interaction | System behavior, user journeys | Critical paths, smoke tests |
-| Runs Locally | Always | Usually | Must — Docker only | Often requires cloud/staging |
+| Isolation | Complete (in-memory) | Partial (one real external service) | Complete (per-test containers) | Usually shared environments |
+| Confidence Level | Low (implementation detail) | Medium-high (real external behavior) | High (production-like) | High (but flaky) |
+| Mock Policy | Everything | Zero (test against real API) | Zero mocks | Zero mocks |
+| Failure Diagnosticity | Low (false positives from mocks) | High (real dependency behavior) | High (real failures) | Low (timing, flakiness) |
+| Typical Use | Algorithm correctness, module contracts | Exhaustive edge-case coverage of external dependency | System behavior, user journeys | Critical paths, smoke tests |
+| Runs Locally | Always | Usually (testnet/sandbox) | Must — Docker only | Often requires cloud/staging |
 
 ## Beyond API Testing
 
 Stack tests are not limited to driving backend APIs directly. For web applications, the same pattern applies with a browser automation layer like Playwright: spin up the full stack, then drive user journeys through the actual UI — form submissions, page transitions, rendered output — to verify that the combined frontend and backend work correctly end-to-end. The principle remains the same: real system, real dependencies, no mocks. Only the entry point changes from HTTP API calls to browser interactions.
 
 ## Anti-Pattern
 
-**Don't** write "integration tests" that start a few services and mock others. You end up testing your mocks, not your system. Either test at the unit level (fast, isolated) or at the stack level (complete, real). The middle ground gives you the worst of both worlds: slow tests that don't prove anything.
+**Don't** write "integration tests" that start a few services and mock others. You end up testing your mocks, not your system. Either test at the unit level (fast, isolated), use focused integration tests for exhaustive coverage of a single external dependency's edge cases ([Pattern 1.5](1.5-no-mock-philosophy.md#focused-integration-tests-for-external-dependencies)), or test at the stack level (complete, real). Partial-stack tests with mixed mocks give you the worst of both worlds: slow tests that don't prove anything.
 
 **Don't** run stack tests for every code change during development. Use unit tests for rapid iteration. Run stack tests before committing or as a pre-commit hook.
 
 
@@ -61,6 +61,95 @@ const mockStripe = {
 // including edge cases like declined cards and network errors
 ```
 
+## Focused Integration Tests for External Dependencies
+
+Some external dependencies have complex, nuanced contracts. Payment providers have dozens of error codes, webhook behaviors, idempotency requirements, and rate limiting patterns. Blockchain RPCs have gas estimation edge cases, nonce management, and mempool behavior. Testing each edge case through a full stack test is impractical — each scenario requires spinning up the entire system. But unit tests with mocks test your mocks, not the real dependency's behavior.
+
+**Focused integration tests** exercise a single external dependency's real API (testnet or sandbox) with exhaustive coverage of its edge cases. They run without the full system stack — just your adapter code plus the real external service. No mocks. They exist specifically to validate that your integration layer correctly handles every way the external dependency can respond.
+
+**Characteristics:**
+
+- **Single dependency focus** — One external service per test suite
+- **Real service** — Use testnet, sandbox, or test mode
+- **Exhaustive edge cases** — Every error code, timeout scenario, rate limit behavior your adapter must handle
+- **No mocks** — The whole point is testing against real behavior
+- **Narrower than stack tests** — Don't need the full system, just the adapter and the dependency
+- **More targeted than unit tests** — Test against real API responses, not mock return values
+
+**When to use focused integration tests:**
+
+- External dependencies with complex contracts (payment processors, blockchain RPCs, cloud APIs with many error states)
+- When the dependency has many error states that need individual verification
+- When the dependency's documentation doesn't capture all nuances — you discover behaviors through testing
+
+**When not to use:**
+
+- For internal services you control (those belong in stack tests)
+- As a replacement for stack tests (focused integration tests verify adapter code, not user journeys)
+- With mocks (that defeats the purpose — use unit tests if you need mocks)
+
+```typescript
+// Focused integration test — exercises real Stripe testnet
+describe('Stripe Payment Adapter — Edge Cases', () => {
+  let adapter: StripePaymentAdapter;
+
+  beforeAll(() => {
+    adapter = new StripePaymentAdapter(process.env.STRIPE_TEST_KEY);
+  });
+
+  it('handles card_declined error correctly', async () => {
+    const result = await adapter.charge({
+      card: '4000000000000002', // Stripe test card: always declines
+      amount: 1000,
+      currency: 'usd',
+    });
+    expect(result.status).toBe('failed');
+    expect(result.errorCode).toBe('card_declined');
+    expect(result.retryable).toBe(false);
+  });
+
+  it('handles insufficient_funds with retryable flag', async () => {
+    const result = await adapter.charge({
+      card: '4000000000009995', // Stripe test card: insufficient funds
+      amount: 1000,
+      currency: 'usd',
+    });
+    expect(result.status).toBe('failed');
+    expect(result.errorCode).toBe('insufficient_funds');
+    expect(result.retryable).toBe(true); // Can retry with lower amount
+  });
+
+  it('respects idempotency key — no double charge', async () => {
+    const idempotencyKey = `test_${Date.now()}`;
+    const charge1 = await adapter.charge({
+      card: '4242424242424242', // Stripe test card: succeeds
+      amount: 1000,
+      currency: 'usd',
+      idempotencyKey,
+    });
+    const charge2 = await adapter.charge({
+      card: '4242424242424242',
+      amount: 1000,
+      currency: 'usd',
+      idempotencyKey, // Same key
+    });
+    expect(charge2.transactionId).toBe(charge1.transactionId); // Same charge, not a new one
+  });
+});
+```
+
+Each test exercises the real Stripe testnet with a specific scenario. The adapter code is tested against actual Stripe behavior — not what the docs say should happen, but what actually happens. Edge cases that would be impractical to enumerate in stack tests (one full system spin-up per error code) become fast, focused tests.
+
+**Relationship to other test tiers:**
+
+| Tier | Scope | Mocks | What It Proves |
+|------|-------|-------|----------------|
+| Unit test | Single module | Yes (appropriate) | Module contract — logic correctness, edge case handling |
+| Focused integration test | Adapter + real dependency | No | Adapter handles the dependency's actual behavior — error codes, timeouts, rate limits |
+| Stack test | Full system user journey | No | End-to-end behavior — the system delivers value through the API |
+
+All three tiers serve distinct purposes. Stack tests verify the system works as a whole. Focused integration tests verify your adapter code handles a specific dependency's real behavior exhaustively. Unit tests verify module logic in isolation. A codebase benefits from all three — they answer different questions.
+
 ## Anti-Pattern
 
 **Don't** mock databases in stack tests "because they're slow." PostgreSQL in Docker adds ~2 seconds to startup. That's the cost of real confidence.
@@ -71,10 +160,12 @@ const mockStripe = {
 
 **Don't** avoid mocks in unit tests out of misplaced consistency. Unit tests and stack tests serve different purposes — mocks provide isolation in unit tests, real dependencies provide confidence in stack tests.
 
+**Don't** dismiss focused integration tests as "partial stack tests." They serve a different purpose — exhaustive edge-case coverage of a single dependency's contract — not a half-measure between unit and stack tests.
+
 ## Cross-References
 
-- **[Pattern 1.1 — Stack Tests](1.1-stack-tests.md)**: Real dependencies is core to stack test philosophy
-- **[L0 — Unit Tests as Contract](../L0-foundation.md#pattern-05--unit-tests-as-contract)**: Unit tests validate individual module contracts — mocks are appropriate for isolation
+- **[Pattern 1.1 — Stack Tests](1.1-stack-tests.md)**: Real dependencies is core to stack test philosophy; focused integration tests complement stack tests for external dependency edge cases
+- **[L0 — Unit Tests as Contract](../L0-foundation.md#pattern-05--unit-tests-as-contract)**: Unit tests validate individual module contracts — focused integration tests validate adapter correctness — stack tests validate system behavior
 - **[L2 — Constitutional Rules](../L2-behavioral-guardrails.md#pattern-24--constitutional-rules)**: Constitutional rule enforces real dependencies in stack tests specifically
 
 ---