Skip to content

Commit d15b674

Browse files
franklywatsonclaude
andcommitted
docs: introduce focused integration tests as L1 testing tier
Acknowledge a niche but valuable role for focused integration tests — exhaustive edge-case coverage of a single external dependency's real API (payment providers, blockchain RPCs) without running the full system stack. Updates the three-tier testing model (unit / focused integration / stack) across L1 patterns, anti-patterns, adoption guide, and glossary. Enriches L1 pattern index with "why it matters" summaries for each section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ae7d68a commit d15b674

6 files changed

Lines changed: 132 additions & 28 deletions

File tree

docs/L1-feedback-loops.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -75,31 +75,31 @@ The following patterns describe the verification mechanisms that close the loop
7575

7676
### [Pattern 1.1 — Stack Tests](L1-patterns/1.1-stack-tests.md)
7777

78-
Full-system tests running the complete Docker stack with API-level verification, zero mocks for owned services, and high failure diagnosticity. Each test models an atomic user journey — a single, complete interaction from the user's perspective. Includes health endpoint test mode, test fixture bootstrapping, and the "Beyond API Testing" extension to browser-driven verification with Playwright.
78+
Full-system tests running the complete Docker stack with API-level verification, zero mocks for owned services, and high failure diagnosticity. Each test models an atomic user journey — a single, complete interaction from the user's perspective. Includes health endpoint test mode, test fixture bootstrapping, and the "Beyond API Testing" extension to browser-driven verification with Playwright. **Why it matters:** Stack tests close the feedback loop — the agent gets binary confirmation that the system works end-to-end, not that individual functions return correct values.
7979

8080
### [Pattern 1.2 — Full-Loop Assertion Layering](L1-patterns/1.2-full-loop-assertions.md)
8181

82-
Three-level assertion structure (primary, second-order, third-order) that provides diagnostic signal at increasing distance from the primary action. Primary verifies the direct response, second-order verifies side effects through a different API endpoint, third-order verifies cross-functional consistency (audit logs, notifications, cross-endpoint agreement).
82+
Three-level assertion structure (primary, second-order, third-order) that provides diagnostic signal at increasing distance from the primary action. Primary verifies the direct response, second-order verifies side effects through a different API endpoint, third-order verifies cross-functional consistency (audit logs, notifications, cross-endpoint agreement). **Why it matters:** A `200 OK` response proves nothing about side effects. Second-order assertions catch missing persistence; third-order assertions catch broken observability. Each failure mode points to a specific subsystem.
8383

8484
### [Pattern 1.3 — Sequential / Additive Test Design](L1-patterns/1.3-sequential-design.md)
8585

86-
Tests ordered by dependency so that if test N fails, the agent knows tests 1 through N-1 passed. The sequence acts as a diagnostic ladder narrowing the search space. Stack tests as vertical slices — each test is one atomic user journey that can be run individually during development or as a full suite before completion.
86+
Tests ordered by dependency so that if test N fails, the agent knows tests 1 through N-1 passed. The sequence acts as a diagnostic ladder narrowing the search space. Stack tests as vertical slices — each test is one atomic user journey that can be run individually during development or as a full suite before completion. **Why it matters:** Without ordering, a checkout failure could mean broken auth, broken database, or broken checkout logic — ordering tells the agent exactly which subsystem to investigate.
8787

8888
### [Pattern 1.4 — Container Isolation](L1-patterns/1.4-container-isolation.md)
8989

90-
Four mechanisms ensuring tests never interfere: unique container names, dynamic port allocation, transient volumes, and per-test compose files. Aggressive cleanup prevents Docker resource exhaustion during concurrent test execution.
90+
Four mechanisms ensuring tests never interfere: unique container names, dynamic port allocation, transient volumes, and per-test compose files. Aggressive cleanup prevents Docker resource exhaustion during concurrent test execution. **Why it matters:** Hardcoded ports and shared state produce flaky, non-deterministic test failures that waste agent tokens on false investigations. Isolation makes every test run deterministic.
9191

9292
### [Pattern 1.5 — Real Dependencies in E2E/Integration and Stack Tests](L1-patterns/1.5-no-mock-philosophy.md)
9393

94-
Stack tests and E2E/integration tests use real everything — real PostgreSQL, real Redis, real message queues. The only acceptable mocks in these tests are external services without test environments. If you own it, run it. If you can run it in Docker, run it in Docker. Mocks are appropriate and encouraged in unit tests, which validate module contracts in isolation.
94+
Stack tests and E2E/integration tests use real everything — real PostgreSQL, real Redis, real message queues. If you own it, run it. If you can run it in Docker, run it in Docker. Also covers **focused integration tests** — a niche pattern for exhaustive edge-case coverage of complex external dependencies (payment providers, blockchain RPCs) against their real APIs, without running the full stack. **Why it matters:** Mocks create a fantasy system that passes tests but fails in production. Mocks are appropriate and encouraged in unit tests, which validate module contracts in isolation.
9595

9696
### [Pattern 1.6 — Test Integrity Rules](L1-patterns/1.6-test-integrity.md)
9797

98-
Six forbidden patterns that allow tests to silently pass: conditional assertions, catch without rethrow, optional chaining on expect, early returns before assertions, try-catch wrapped expectations, and soft assertions. Every test must either pass or fail explicitly.
98+
Six forbidden patterns that allow tests to silently pass: conditional assertions, catch without rethrow, optional chaining on expect, early returns before assertions, try-catch wrapped expectations, and soft assertions. Every test must either pass or fail explicitly. **Why it matters:** A test that can silently pass is worse than no test — it gives false confidence. Every test must provide unambiguous signal to the agent.
9999

100100
### [Testing Infrastructure Is Production Code](L1-patterns/testing-infrastructure.md)
101101

102-
The tooling that enables stack tests — port allocators, compose file generators, container managers — is application code, not scaffolding. Treat it with the same rigor: unit tests, error handling, edge case coverage, and code review.
102+
The tooling that enables stack tests — port allocators, compose file generators, container managers — is application code, not scaffolding. Treat it with the same rigor: unit tests, error handling, edge case coverage, and code review. **Why it matters:** Brittle test infrastructure wastes more agent tokens than any other source — the agent debugs test tooling instead of application logic.
103103

104104
---
105105

docs/L1-patterns/1.1-stack-tests.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Problem
44

5-
Integration tests occupy a painful middle ground: too slow for rapid iteration, too incomplete for deployment confidence. They mock some components but not others, creating a "fake system" that passes tests but fails in production. Unit tests are fast but test code, not behavior. E2E tests are comprehensive but slow and brittle. We need a testing approach that provides real confidence without sacrificing developer velocity.
5+
Integration tests can occupy a painful middle ground: too slow for rapid iteration, too incomplete for deployment confidence. Partial-stack integration tests that mock some components but run others create a "fake system" that passes tests but fails in production. The exception is focused integration tests — narrow-scope tests that exercise a single external dependency's real API with exhaustive edge-case coverage (see [Pattern 1.5](1.5-no-mock-philosophy.md#focused-integration-tests-for-external-dependencies)). Unit tests are fast but test code, not behavior. E2E tests are comprehensive but slow and brittle. We need a testing approach that provides real confidence without sacrificing developer velocity.
66

77
## Solution
88

@@ -59,24 +59,24 @@ Run sequentially, each test building confidence in layers. If `01-app-startup` f
5959

6060
## Comparison: Test Types
6161

62-
| Dimension | Unit Tests | Integration Tests | Stack Tests | E2E Tests |
63-
|-----------|------------|-------------------|-------------|-----------|
64-
| Scope | Single function/class | Multiple components, partial stack | Full system + external deps (to fullest possible extent) | Full system + external deps |
62+
| Dimension | Unit Tests | Focused Integration Tests | Stack Tests | E2E Tests |
63+
|-----------|------------|--------------------------|-------------|-----------|
64+
| Scope | Single function/class | Adapter code + one real external dependency | Full system + external deps (to fullest possible extent) | Full system + external deps |
6565
| Speed | Milliseconds | Seconds | Seconds to minutes | Minutes |
66-
| Isolation | Complete (in-memory) | Partial (shared fixtures) | Complete (per-test containers) | Usually shared environments |
67-
| Confidence Level | Low (implementation detail) | Medium (partial system) | High (production-like) | High (but flaky) |
68-
| Mock Policy | Everything | Some components | Zero mocks | Zero mocks |
69-
| Failure Diagnosticity | Low (false positives from mocks) | Medium (mock mismatches) | High (real failures) | Low (timing, flakiness) |
70-
| Typical Use | Algorithm correctness | Component interaction | System behavior, user journeys | Critical paths, smoke tests |
71-
| Runs Locally | Always | Usually | Must — Docker only | Often requires cloud/staging |
66+
| Isolation | Complete (in-memory) | Partial (one real external service) | Complete (per-test containers) | Usually shared environments |
67+
| Confidence Level | Low (implementation detail) | Medium-high (real external behavior) | High (production-like) | High (but flaky) |
68+
| Mock Policy | Everything | Zero (test against real API) | Zero mocks | Zero mocks |
69+
| Failure Diagnosticity | Low (false positives from mocks) | High (real dependency behavior) | High (real failures) | Low (timing, flakiness) |
70+
| Typical Use | Algorithm correctness, module contracts | Exhaustive edge-case coverage of external dependency | System behavior, user journeys | Critical paths, smoke tests |
71+
| Runs Locally | Always | Usually (testnet/sandbox) | Must — Docker only | Often requires cloud/staging |
7272

7373
## Beyond API Testing
7474

7575
Stack tests are not limited to driving backend APIs directly. For web applications, the same pattern applies with a browser automation layer like Playwright: spin up the full stack, then drive user journeys through the actual UI — form submissions, page transitions, rendered output — to verify that the combined frontend and backend work correctly end-to-end. The principle remains the same: real system, real dependencies, no mocks. Only the entry point changes from HTTP API calls to browser interactions.
7676

7777
## Anti-Pattern
7878

79-
**Don't** write "integration tests" that start a few services and mock others. You end up testing your mocks, not your system. Either test at the unit level (fast, isolated) or at the stack level (complete, real). The middle ground gives you the worst of both worlds: slow tests that don't prove anything.
79+
**Don't** write "integration tests" that start a few services and mock others. You end up testing your mocks, not your system. Either test at the unit level (fast, isolated), use focused integration tests for exhaustive coverage of a single external dependency's edge cases ([Pattern 1.5](1.5-no-mock-philosophy.md#focused-integration-tests-for-external-dependencies)), or test at the stack level (complete, real). Partial-stack tests with mixed mocks give you the worst of both worlds: slow tests that don't prove anything.
8080

8181
**Don't** run stack tests for every code change during development. Use unit tests for rapid iteration. Run stack tests before committing or as a pre-commit hook.
8282

docs/L1-patterns/1.5-no-mock-philosophy.md

Lines changed: 93 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,95 @@ const mockStripe = {
6161
// including edge cases like declined cards and network errors
6262
```
6363

64+
## Focused Integration Tests for External Dependencies
65+
66+
Some external dependencies have complex, nuanced contracts. Payment providers have dozens of error codes, webhook behaviors, idempotency requirements, and rate limiting patterns. Blockchain RPCs have gas estimation edge cases, nonce management, and mempool behavior. Testing each edge case through a full stack test is impractical — each scenario requires spinning up the entire system. But unit tests with mocks test your mocks, not the real dependency's behavior.
67+
68+
**Focused integration tests** exercise a single external dependency's real API (testnet or sandbox) with exhaustive coverage of its edge cases. They run without the full system stack — just your adapter code plus the real external service. No mocks. They exist specifically to validate that your integration layer correctly handles every way the external dependency can respond.
69+
70+
**Characteristics:**
71+
72+
- **Single dependency focus** — One external service per test suite
73+
- **Real service** — Use testnet, sandbox, or test mode
74+
- **Exhaustive edge cases** — Every error code, timeout scenario, rate limit behavior your adapter must handle
75+
- **No mocks** — The whole point is testing against real behavior
76+
- **Narrower than stack tests** — Don't need the full system, just the adapter and the dependency
77+
- **More targeted than unit tests** — Test against real API responses, not mock return values
78+
79+
**When to use focused integration tests:**
80+
81+
- External dependencies with complex contracts (payment processors, blockchain RPCs, cloud APIs with many error states)
82+
- When the dependency has many error states that need individual verification
83+
- When the dependency's documentation doesn't capture all nuances — you discover behaviors through testing
84+
85+
**When not to use:**
86+
87+
- For internal services you control (those belong in stack tests)
88+
- As a replacement for stack tests (focused integration tests verify adapter code, not user journeys)
89+
- With mocks (that defeats the purpose — use unit tests if you need mocks)
90+
91+
```typescript
92+
// Focused integration test — exercises real Stripe testnet
93+
describe('Stripe Payment Adapter — Edge Cases', () => {
94+
let adapter: StripePaymentAdapter;
95+
96+
beforeAll(() => {
97+
adapter = new StripePaymentAdapter(process.env.STRIPE_TEST_KEY);
98+
});
99+
100+
it('handles card_declined error correctly', async () => {
101+
const result = await adapter.charge({
102+
card: '4000000000000002', // Stripe test card: always declines
103+
amount: 1000,
104+
currency: 'usd',
105+
});
106+
expect(result.status).toBe('failed');
107+
expect(result.errorCode).toBe('card_declined');
108+
expect(result.retryable).toBe(false);
109+
});
110+
111+
it('handles insufficient_funds with retryable flag', async () => {
112+
const result = await adapter.charge({
113+
card: '4000000000009995', // Stripe test card: insufficient funds
114+
amount: 1000,
115+
currency: 'usd',
116+
});
117+
expect(result.status).toBe('failed');
118+
expect(result.errorCode).toBe('insufficient_funds');
119+
expect(result.retryable).toBe(true); // Can retry with lower amount
120+
});
121+
122+
it('respects idempotency key — no double charge', async () => {
123+
const idempotencyKey = `test_${Date.now()}`;
124+
const charge1 = await adapter.charge({
125+
card: '4242424242424242', // Stripe test card: succeeds
126+
amount: 1000,
127+
currency: 'usd',
128+
idempotencyKey,
129+
});
130+
const charge2 = await adapter.charge({
131+
card: '4242424242424242',
132+
amount: 1000,
133+
currency: 'usd',
134+
idempotencyKey, // Same key
135+
});
136+
expect(charge2.transactionId).toBe(charge1.transactionId); // Same charge, not a new one
137+
});
138+
});
139+
```
140+
141+
Each test exercises the real Stripe testnet with a specific scenario. The adapter code is tested against actual Stripe behavior — not what the docs say should happen, but what actually happens. Edge cases that would be impractical to enumerate in stack tests (one full system spin-up per error code) become fast, focused tests.
142+
143+
**Relationship to other test tiers:**
144+
145+
| Tier | Scope | Mocks | What It Proves |
146+
|------|-------|-------|----------------|
147+
| Unit test | Single module | Yes (appropriate) | Module contract — logic correctness, edge case handling |
148+
| Focused integration test | Adapter + real dependency | No | Adapter handles the dependency's actual behavior — error codes, timeouts, rate limits |
149+
| Stack test | Full system user journey | No | End-to-end behavior — the system delivers value through the API |
150+
151+
All three tiers serve distinct purposes. Stack tests verify the system works as a whole. Focused integration tests verify your adapter code handles a specific dependency's real behavior exhaustively. Unit tests verify module logic in isolation. A codebase benefits from all three — they answer different questions.
152+
64153
## Anti-Pattern
65154

66155
**Don't** mock databases in stack tests "because they're slow." PostgreSQL in Docker adds ~2 seconds to startup. That's the cost of real confidence.
@@ -71,10 +160,12 @@ const mockStripe = {
71160

72161
**Don't** avoid mocks in unit tests out of misplaced consistency. Unit tests and stack tests serve different purposes — mocks provide isolation in unit tests, real dependencies provide confidence in stack tests.
73162

163+
**Don't** dismiss focused integration tests as "partial stack tests." They serve a different purpose — exhaustive edge-case coverage of a single dependency's contract — not a half-measure between unit and stack tests.
164+
74165
## Cross-References
75166

76-
- **[Pattern 1.1 — Stack Tests](1.1-stack-tests.md)**: Real dependencies is core to stack test philosophy
77-
- **[L0 — Unit Tests as Contract](../L0-foundation.md#pattern-05--unit-tests-as-contract)**: Unit tests validate individual module contracts — mocks are appropriate for isolation
167+
- **[Pattern 1.1 — Stack Tests](1.1-stack-tests.md)**: Real dependencies is core to stack test philosophy; focused integration tests complement stack tests for external dependency edge cases
168+
- **[L0 — Unit Tests as Contract](../L0-foundation.md#pattern-05--unit-tests-as-contract)**: Unit tests validate individual module contracts — focused integration tests validate adapter correctness — stack tests validate system behavior
78169
- **[L2 — Constitutional Rules](../L2-behavioral-guardrails.md#pattern-24--constitutional-rules)**: Constitutional rule enforces real dependencies in stack tests specifically
79170

80171
---

0 commit comments

Comments
 (0)