You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: introduce focused integration tests as L1 testing tier
Acknowledge a niche but valuable role for focused integration tests —
exhaustive edge-case coverage of a single external dependency's real
API (payment providers, blockchain RPCs) without running the full
system stack. Updates the three-tier testing model (unit / focused
integration / stack) across L1 patterns, anti-patterns, adoption
guide, and glossary. Enriches L1 pattern index with "why it matters"
summaries for each section.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Full-system tests running the complete Docker stack with API-level verification, zero mocks for owned services, and high failure diagnosticity. Each test models an atomic user journey — a single, complete interaction from the user's perspective. Includes health endpoint test mode, test fixture bootstrapping, and the "Beyond API Testing" extension to browser-driven verification with Playwright.
78
+
Full-system tests running the complete Docker stack with API-level verification, zero mocks for owned services, and high failure diagnosticity. Each test models an atomic user journey — a single, complete interaction from the user's perspective. Includes health endpoint test mode, test fixture bootstrapping, and the "Beyond API Testing" extension to browser-driven verification with Playwright.**Why it matters:** Stack tests close the feedback loop — the agent gets binary confirmation that the system works end-to-end, not that individual functions return correct values.
Three-level assertion structure (primary, second-order, third-order) that provides diagnostic signal at increasing distance from the primary action. Primary verifies the direct response, second-order verifies side effects through a different API endpoint, third-order verifies cross-functional consistency (audit logs, notifications, cross-endpoint agreement).
82
+
Three-level assertion structure (primary, second-order, third-order) that provides diagnostic signal at increasing distance from the primary action. Primary verifies the direct response, second-order verifies side effects through a different API endpoint, third-order verifies cross-functional consistency (audit logs, notifications, cross-endpoint agreement).**Why it matters:** A `200 OK` response proves nothing about side effects. Second-order assertions catch missing persistence; third-order assertions catch broken observability. Each failure mode points to a specific subsystem.
83
83
84
84
### [Pattern 1.3 — Sequential / Additive Test Design](L1-patterns/1.3-sequential-design.md)
85
85
86
-
Tests ordered by dependency so that if test N fails, the agent knows tests 1 through N-1 passed. The sequence acts as a diagnostic ladder narrowing the search space. Stack tests as vertical slices — each test is one atomic user journey that can be run individually during development or as a full suite before completion.
86
+
Tests ordered by dependency so that if test N fails, the agent knows tests 1 through N-1 passed. The sequence acts as a diagnostic ladder narrowing the search space. Stack tests as vertical slices — each test is one atomic user journey that can be run individually during development or as a full suite before completion.**Why it matters:** Without ordering, a checkout failure could mean broken auth, broken database, or broken checkout logic — ordering tells the agent exactly which subsystem to investigate.
Four mechanisms ensuring tests never interfere: unique container names, dynamic port allocation, transient volumes, and per-test compose files. Aggressive cleanup prevents Docker resource exhaustion during concurrent test execution.
90
+
Four mechanisms ensuring tests never interfere: unique container names, dynamic port allocation, transient volumes, and per-test compose files. Aggressive cleanup prevents Docker resource exhaustion during concurrent test execution.**Why it matters:** Hardcoded ports and shared state produce flaky, non-deterministic test failures that waste agent tokens on false investigations. Isolation makes every test run deterministic.
91
91
92
92
### [Pattern 1.5 — Real Dependencies in E2E/Integration and Stack Tests](L1-patterns/1.5-no-mock-philosophy.md)
93
93
94
-
Stack tests and E2E/integration tests use real everything — real PostgreSQL, real Redis, real message queues. The only acceptable mocks in these tests are external services without test environments. If you own it, run it. If you can run it in Docker, run it in Docker. Mocks are appropriate and encouraged in unit tests, which validate module contracts in isolation.
94
+
Stack tests and E2E/integration tests use real everything — real PostgreSQL, real Redis, real message queues. If you own it, run it. If you can run it in Docker, run it in Docker. Also covers **focused integration tests** — a niche pattern for exhaustive edge-case coverage of complex external dependencies (payment providers, blockchain RPCs) against their real APIs, without running the full stack. **Why it matters:** Mocks create a fantasy system that passes tests but fails in production. Mocks are appropriate and encouraged in unit tests, which validate module contracts in isolation.
95
95
96
96
### [Pattern 1.6 — Test Integrity Rules](L1-patterns/1.6-test-integrity.md)
97
97
98
-
Six forbidden patterns that allow tests to silently pass: conditional assertions, catch without rethrow, optional chaining on expect, early returns before assertions, try-catch wrapped expectations, and soft assertions. Every test must either pass or fail explicitly.
98
+
Six forbidden patterns that allow tests to silently pass: conditional assertions, catch without rethrow, optional chaining on expect, early returns before assertions, try-catch wrapped expectations, and soft assertions. Every test must either pass or fail explicitly.**Why it matters:** A test that can silently pass is worse than no test — it gives false confidence. Every test must provide unambiguous signal to the agent.
99
99
100
100
### [Testing Infrastructure Is Production Code](L1-patterns/testing-infrastructure.md)
101
101
102
-
The tooling that enables stack tests — port allocators, compose file generators, container managers — is application code, not scaffolding. Treat it with the same rigor: unit tests, error handling, edge case coverage, and code review.
102
+
The tooling that enables stack tests — port allocators, compose file generators, container managers — is application code, not scaffolding. Treat it with the same rigor: unit tests, error handling, edge case coverage, and code review.**Why it matters:** Brittle test infrastructure wastes more agent tokens than any other source — the agent debugs test tooling instead of application logic.
Copy file name to clipboardExpand all lines: docs/L1-patterns/1.1-stack-tests.md
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
## Problem
4
4
5
-
Integration tests occupy a painful middle ground: too slow for rapid iteration, too incomplete for deployment confidence. They mock some components but not others, creating a "fake system" that passes tests but fails in production. Unit tests are fast but test code, not behavior. E2E tests are comprehensive but slow and brittle. We need a testing approach that provides real confidence without sacrificing developer velocity.
5
+
Integration tests can occupy a painful middle ground: too slow for rapid iteration, too incomplete for deployment confidence. Partial-stack integration tests that mock some components but run others create a "fake system" that passes tests but fails in production. The exception is focused integration tests — narrow-scope tests that exercise a single external dependency's real API with exhaustive edge-case coverage (see [Pattern 1.5](1.5-no-mock-philosophy.md#focused-integration-tests-for-external-dependencies)). Unit tests are fast but test code, not behavior. E2E tests are comprehensive but slow and brittle. We need a testing approach that provides real confidence without sacrificing developer velocity.
6
6
7
7
## Solution
8
8
@@ -59,24 +59,24 @@ Run sequentially, each test building confidence in layers. If `01-app-startup` f
| Scope | Single function/class |Multiple components, partial stack| Full system + external deps (to fullest possible extent) | Full system + external deps |
| Scope | Single function/class |Adapter code + one real external dependency| Full system + external deps (to fullest possible extent) | Full system + external deps |
| Typical Use | Algorithm correctness, module contracts | Exhaustive edge-case coverage of external dependency| System behavior, user journeys | Critical paths, smoke tests |
71
+
| Runs Locally | Always | Usually (testnet/sandbox) | Must — Docker only | Often requires cloud/staging |
72
72
73
73
## Beyond API Testing
74
74
75
75
Stack tests are not limited to driving backend APIs directly. For web applications, the same pattern applies with a browser automation layer like Playwright: spin up the full stack, then drive user journeys through the actual UI — form submissions, page transitions, rendered output — to verify that the combined frontend and backend work correctly end-to-end. The principle remains the same: real system, real dependencies, no mocks. Only the entry point changes from HTTP API calls to browser interactions.
76
76
77
77
## Anti-Pattern
78
78
79
-
**Don't** write "integration tests" that start a few services and mock others. You end up testing your mocks, not your system. Either test at the unit level (fast, isolated)or at the stack level (complete, real). The middle ground gives you the worst of both worlds: slow tests that don't prove anything.
79
+
**Don't** write "integration tests" that start a few services and mock others. You end up testing your mocks, not your system. Either test at the unit level (fast, isolated), use focused integration tests for exhaustive coverage of a single external dependency's edge cases ([Pattern 1.5](1.5-no-mock-philosophy.md#focused-integration-tests-for-external-dependencies)), or test at the stack level (complete, real). Partial-stack tests with mixed mocks give you the worst of both worlds: slow tests that don't prove anything.
80
80
81
81
**Don't** run stack tests for every code change during development. Use unit tests for rapid iteration. Run stack tests before committing or as a pre-commit hook.
Copy file name to clipboardExpand all lines: docs/L1-patterns/1.5-no-mock-philosophy.md
+93-2Lines changed: 93 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -61,6 +61,95 @@ const mockStripe = {
61
61
// including edge cases like declined cards and network errors
62
62
```
63
63
64
+
## Focused Integration Tests for External Dependencies
65
+
66
+
Some external dependencies have complex, nuanced contracts. Payment providers have dozens of error codes, webhook behaviors, idempotency requirements, and rate limiting patterns. Blockchain RPCs have gas estimation edge cases, nonce management, and mempool behavior. Testing each edge case through a full stack test is impractical — each scenario requires spinning up the entire system. But unit tests with mocks test your mocks, not the real dependency's behavior.
67
+
68
+
**Focused integration tests** exercise a single external dependency's real API (testnet or sandbox) with exhaustive coverage of its edge cases. They run without the full system stack — just your adapter code plus the real external service. No mocks. They exist specifically to validate that your integration layer correctly handles every way the external dependency can respond.
69
+
70
+
**Characteristics:**
71
+
72
+
-**Single dependency focus** — One external service per test suite
73
+
-**Real service** — Use testnet, sandbox, or test mode
74
+
-**Exhaustive edge cases** — Every error code, timeout scenario, rate limit behavior your adapter must handle
75
+
-**No mocks** — The whole point is testing against real behavior
76
+
-**Narrower than stack tests** — Don't need the full system, just the adapter and the dependency
77
+
-**More targeted than unit tests** — Test against real API responses, not mock return values
78
+
79
+
**When to use focused integration tests:**
80
+
81
+
- External dependencies with complex contracts (payment processors, blockchain RPCs, cloud APIs with many error states)
82
+
- When the dependency has many error states that need individual verification
83
+
- When the dependency's documentation doesn't capture all nuances — you discover behaviors through testing
84
+
85
+
**When not to use:**
86
+
87
+
- For internal services you control (those belong in stack tests)
88
+
- As a replacement for stack tests (focused integration tests verify adapter code, not user journeys)
89
+
- With mocks (that defeats the purpose — use unit tests if you need mocks)
90
+
91
+
```typescript
92
+
// Focused integration test — exercises real Stripe testnet
card: '4242424242424242', // Stripe test card: succeeds
126
+
amount: 1000,
127
+
currency: 'usd',
128
+
idempotencyKey,
129
+
});
130
+
const charge2 =awaitadapter.charge({
131
+
card: '4242424242424242',
132
+
amount: 1000,
133
+
currency: 'usd',
134
+
idempotencyKey, // Same key
135
+
});
136
+
expect(charge2.transactionId).toBe(charge1.transactionId); // Same charge, not a new one
137
+
});
138
+
});
139
+
```
140
+
141
+
Each test exercises the real Stripe testnet with a specific scenario. The adapter code is tested against actual Stripe behavior — not what the docs say should happen, but what actually happens. Edge cases that would be impractical to enumerate in stack tests (one full system spin-up per error code) become fast, focused tests.
142
+
143
+
**Relationship to other test tiers:**
144
+
145
+
| Tier | Scope | Mocks | What It Proves |
146
+
|------|-------|-------|----------------|
147
+
| Unit test | Single module | Yes (appropriate) | Module contract — logic correctness, edge case handling |
148
+
| Focused integration test | Adapter + real dependency | No | Adapter handles the dependency's actual behavior — error codes, timeouts, rate limits |
149
+
| Stack test | Full system user journey | No | End-to-end behavior — the system delivers value through the API |
150
+
151
+
All three tiers serve distinct purposes. Stack tests verify the system works as a whole. Focused integration tests verify your adapter code handles a specific dependency's real behavior exhaustively. Unit tests verify module logic in isolation. A codebase benefits from all three — they answer different questions.
152
+
64
153
## Anti-Pattern
65
154
66
155
**Don't** mock databases in stack tests "because they're slow." PostgreSQL in Docker adds ~2 seconds to startup. That's the cost of real confidence.
@@ -71,10 +160,12 @@ const mockStripe = {
71
160
72
161
**Don't** avoid mocks in unit tests out of misplaced consistency. Unit tests and stack tests serve different purposes — mocks provide isolation in unit tests, real dependencies provide confidence in stack tests.
73
162
163
+
**Don't** dismiss focused integration tests as "partial stack tests." They serve a different purpose — exhaustive edge-case coverage of a single dependency's contract — not a half-measure between unit and stack tests.
164
+
74
165
## Cross-References
75
166
76
-
-**[Pattern 1.1 — Stack Tests](1.1-stack-tests.md)**: Real dependencies is core to stack test philosophy
77
-
-**[L0 — Unit Tests as Contract](../L0-foundation.md#pattern-05--unit-tests-as-contract)**: Unit tests validate individual module contracts — mocks are appropriate for isolation
167
+
-**[Pattern 1.1 — Stack Tests](1.1-stack-tests.md)**: Real dependencies is core to stack test philosophy; focused integration tests complement stack tests for external dependency edge cases
168
+
-**[L0 — Unit Tests as Contract](../L0-foundation.md#pattern-05--unit-tests-as-contract)**: Unit tests validate individual module contracts — focused integration tests validate adapter correctness — stack tests validate system behavior
78
169
-**[L2 — Constitutional Rules](../L2-behavioral-guardrails.md#pattern-24--constitutional-rules)**: Constitutional rule enforces real dependencies in stack tests specifically
0 commit comments