|
| 1 | +# Testing the Datum edge |
| 2 | + |
| 3 | +This is the map for how we test the network-services-operator (NSO) — what we |
| 4 | +prove, and why it's built the way it is. It's written for anyone who wants to |
| 5 | +understand the safety net, not just the people who maintain it. |
| 6 | + |
| 7 | +## Why this exists |
| 8 | + |
| 9 | +NSO programs the edge: when a customer creates a Gateway, a route, a web-app |
| 10 | +firewall policy, or a connector, NSO turns that intent into live configuration |
| 11 | +on the Envoy proxies that actually serve customer traffic. The hard part isn't |
| 12 | +producing the configuration — it's making sure the configuration that lands on |
| 13 | +the running proxy *does what the customer asked*. |
| 14 | + |
| 15 | +Almost every production incident in this system has shared one shape: |
| 16 | + |
| 17 | +> **The configuration was logically correct, the platform reported success, but |
| 18 | +> the running proxy behaved differently than intended** — a firewall rule that |
| 19 | +> protected nothing, an offline backend that still returned a blank error, a |
| 20 | +> branded error page that never appeared, a single bad certificate that froze a |
| 21 | +> whole shared listener. |
| 22 | +
|
| 23 | +These failures are invisible to ordinary tests. The Kubernetes resources report |
| 24 | +"Programmed = True," unit tests pass, and the gap only shows up when real |
| 25 | +traffic — often a real *attack* — arrives at the edge. So our testing is built |
| 26 | +around one principle: |
| 27 | + |
| 28 | +**Prove behavior at the edge with real traffic, against an environment that |
| 29 | +looks like production — not against the platform's own report of success.** |
| 30 | + |
| 31 | +## The two ideas everything rests on |
| 32 | + |
| 33 | +**1. Production fidelity.** The recent incidents lived precisely on the axes |
| 34 | +where our old test environment differed from production: the proxy version, the |
| 35 | +"fail closed" availability coupling, the firewall data plane, and multi-cluster |
| 36 | +replication. So the test environment stands up the *real* shape of the edge — |
| 37 | +the same proxy version, the same extension server that rewrites configuration, |
| 38 | +the same firewall image, and the same federation mechanism that fans |
| 39 | +configuration out to edge clusters. If a test passes here, it passes against |
| 40 | +something production actually runs. This environment is brought online with a |
| 41 | +single command set (`task test-infra:up`); see |
| 42 | +[`Taskfile.test-infra.yml`](../../Taskfile.test-infra.yml). |
| 43 | + |
| 44 | +**2. Traffic-first, with a tie-breaker.** Every test's verdict is the |
| 45 | +*observed behavior* of real traffic through the real proxy — a blocked attack, |
| 46 | +a 503 for an offline backend, the branded page in the response body. We never |
| 47 | +let "the platform says it worked" stand in for "the edge did the right thing." |
| 48 | + |
| 49 | +Real traffic alone has one blind spot, though: a firewall that protects nothing |
| 50 | +and a firewall that's simply not being attacked produce the *same* successful |
| 51 | +response. So traffic is backed by a **parity check** — a comparison of what the |
| 52 | +edge was told to do against what the running proxy is actually doing — which |
| 53 | +turns a surprising result into a diagnosis instead of a guess. Traffic is always |
| 54 | +the verdict; parity is the tie-breaker that catches the silently-inert case. |
| 55 | +See [`test/parity/README.md`](../../test/parity/README.md). |
| 56 | + |
| 57 | +## What we guarantee |
| 58 | + |
| 59 | +The end-to-end suite ([`test/e2e-edge/README.md`](../../test/e2e-edge/README.md)) turns |
| 60 | +each past incident into a standing guarantee, checked against real traffic: |
| 61 | + |
| 62 | +- **A web-app firewall actually blocks attacks** — a malicious request is |
| 63 | + refused while a legitimate one still succeeds. |
| 64 | +- **An offline backend fails cleanly** — the customer path returns a real 503, |
| 65 | + not a hang or a blank. |
| 66 | +- **Branded error pages reach the customer** — the styled page appears in the |
| 67 | + actual response, not just in configuration. |
| 68 | +- **One bad certificate can't take down its neighbors** — an invalid listener |
| 69 | + is isolated while the healthy listeners on the same proxy keep serving. |
| 70 | + |
| 71 | +And the federation layer ([`config/federation/README.md`](../../config/federation/README.md)) |
| 72 | +proves that configuration created in the control plane genuinely arrives at the |
| 73 | +edge clusters that serve traffic. |
| 74 | + |
| 75 | +## Where things live |
| 76 | + |
| 77 | +| Area | What it covers | |
| 78 | +|---|---| |
| 79 | +| [`Taskfile.test-infra.yml`](../../Taskfile.test-infra.yml) | Brings the production-fidelity edge online and runs the suites | |
| 80 | +| [`test/e2e-edge/README.md`](../../test/e2e-edge/README.md) | The real-traffic guarantees, scenario by scenario | |
| 81 | +| [`test/parity/README.md`](../../test/parity/README.md) | The parity check that catches silently-inert configuration | |
| 82 | +| [`config/federation/README.md`](../../config/federation/README.md) | Fanning configuration out to edge clusters | |
| 83 | + |
| 84 | +> The design rationale that led here — the original audit of test-vs-production |
| 85 | +> gaps and the proposals for closing them — lives in the pull requests that |
| 86 | +> introduced this work, not in the repository, because it describes a plan |
| 87 | +> rather than the system as it stands today. |
0 commit comments