Skip to content

Commit bc04aa9

Browse files
scotwellsclaude
andcommitted
test(e2e): prove the edge's core guarantees with real traffic
Adds four end-to-end scenarios that send real traffic through the edge and confirm the promises customers depend on: the firewall enforces, an offline origin fails cleanly, the branded error page shows, and one bad certificate can't break its neighbors. Each asserts on a real response, not just that the control plane wrote the config. Includes a plain-language guide to how we test the edge and what we can't yet guarantee. These scenarios need the two-cluster production-fidelity environment, so they live under test/e2e-edge (separate from the single-stack test/e2e suite the default CI runs) and execute via `task test-infra:e2e` against that environment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JbCy8vy66RdNYzGSgqH6P6
1 parent 2a12d5d commit bc04aa9

20 files changed

Lines changed: 3585 additions & 0 deletions

File tree

docs/testing/README.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Testing the Datum edge
2+
3+
This is the map for how we test the network-services-operator (NSO) — what we
4+
prove, and why it's built the way it is. It's written for anyone who wants to
5+
understand the safety net, not just the people who maintain it.
6+
7+
## Why this exists
8+
9+
NSO programs the edge: when a customer creates a Gateway, a route, a web-app
10+
firewall policy, or a connector, NSO turns that intent into live configuration
11+
on the Envoy proxies that actually serve customer traffic. The hard part isn't
12+
producing the configuration — it's making sure the configuration that lands on
13+
the running proxy *does what the customer asked*.
14+
15+
Almost every production incident in this system has shared one shape:
16+
17+
> **The configuration was logically correct, the platform reported success, but
18+
> the running proxy behaved differently than intended** — a firewall rule that
19+
> protected nothing, an offline backend that still returned a blank error, a
20+
> branded error page that never appeared, a single bad certificate that froze a
21+
> whole shared listener.
22+
23+
These failures are invisible to ordinary tests. The Kubernetes resources report
24+
"Programmed = True," unit tests pass, and the gap only shows up when real
25+
traffic — often a real *attack* — arrives at the edge. So our testing is built
26+
around one principle:
27+
28+
**Prove behavior at the edge with real traffic, against an environment that
29+
looks like production — not against the platform's own report of success.**
30+
31+
## The two ideas everything rests on
32+
33+
**1. Production fidelity.** The recent incidents lived precisely on the axes
34+
where our old test environment differed from production: the proxy version, the
35+
"fail closed" availability coupling, the firewall data plane, and multi-cluster
36+
replication. So the test environment stands up the *real* shape of the edge —
37+
the same proxy version, the same extension server that rewrites configuration,
38+
the same firewall image, and the same federation mechanism that fans
39+
configuration out to edge clusters. If a test passes here, it passes against
40+
something production actually runs. This environment is brought online with a
41+
single command set (`task test-infra:up`); see
42+
[`Taskfile.test-infra.yml`](../../Taskfile.test-infra.yml).
43+
44+
**2. Traffic-first, with a tie-breaker.** Every test's verdict is the
45+
*observed behavior* of real traffic through the real proxy — a blocked attack,
46+
a 503 for an offline backend, the branded page in the response body. We never
47+
let "the platform says it worked" stand in for "the edge did the right thing."
48+
49+
Real traffic alone has one blind spot, though: a firewall that protects nothing
50+
and a firewall that's simply not being attacked produce the *same* successful
51+
response. So traffic is backed by a **parity check** — a comparison of what the
52+
edge was told to do against what the running proxy is actually doing — which
53+
turns a surprising result into a diagnosis instead of a guess. Traffic is always
54+
the verdict; parity is the tie-breaker that catches the silently-inert case.
55+
See [`test/parity/README.md`](../../test/parity/README.md).
56+
57+
## What we guarantee
58+
59+
The end-to-end suite ([`test/e2e-edge/README.md`](../../test/e2e-edge/README.md)) turns
60+
each past incident into a standing guarantee, checked against real traffic:
61+
62+
- **A web-app firewall actually blocks attacks** — a malicious request is
63+
refused while a legitimate one still succeeds.
64+
- **An offline backend fails cleanly** — the customer path returns a real 503,
65+
not a hang or a blank.
66+
- **Branded error pages reach the customer** — the styled page appears in the
67+
actual response, not just in configuration.
68+
- **One bad certificate can't take down its neighbors** — an invalid listener
69+
is isolated while the healthy listeners on the same proxy keep serving.
70+
71+
And the federation layer ([`config/federation/README.md`](../../config/federation/README.md))
72+
proves that configuration created in the control plane genuinely arrives at the
73+
edge clusters that serve traffic.
74+
75+
## Where things live
76+
77+
| Area | What it covers |
78+
|---|---|
79+
| [`Taskfile.test-infra.yml`](../../Taskfile.test-infra.yml) | Brings the production-fidelity edge online and runs the suites |
80+
| [`test/e2e-edge/README.md`](../../test/e2e-edge/README.md) | The real-traffic guarantees, scenario by scenario |
81+
| [`test/parity/README.md`](../../test/parity/README.md) | The parity check that catches silently-inert configuration |
82+
| [`config/federation/README.md`](../../config/federation/README.md) | Fanning configuration out to edge clusters |
83+
84+
> The design rationale that led here — the original audit of test-vs-production
85+
> gaps and the proposals for closing them — lives in the pull requests that
86+
> introduced this work, not in the repository, because it describes a plan
87+
> rather than the system as it stands today.

test/e2e-edge/README.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# End-to-end edge guarantees
2+
3+
These tests prove that the Datum edge *behaves* the way customers expect, by
4+
sending real traffic through the real proxy and checking what actually comes
5+
back. They are the standing guarantees described in
6+
[`docs/testing/README.md`](../../docs/testing/README.md).
7+
8+
## How a guarantee is checked
9+
10+
Every scenario follows the same three-part check, in priority order:
11+
12+
1. **The traffic verdict (always decisive).** The test makes a real request and
13+
asserts on the real response — a blocked attack, a 503, a branded page, a
14+
200 from a healthy listener. If the edge behaves wrongly, this fails. A test
15+
is never satisfied by the platform merely *reporting* success.
16+
2. **The configuration is genuinely present.** A
17+
[parity check](../parity/README.md) confirms the running proxy actually
18+
carries the configuration it was told to — closing the blind spot where a
19+
successful-looking response hides a rule that protects nothing.
20+
3. **It's the right configuration serving the request.** A build marker
21+
confirms the response came from the configuration under test, not from stale
22+
config left over from a previous state — so a pass can't be a timing fluke.
23+
24+
The first is the point; the second and third exist so a surprising result
25+
becomes a diagnosis rather than a mystery.
26+
27+
## The scenarios
28+
29+
Each one corresponds to a past production incident, now held in place.
30+
31+
### Web-app firewall enforcement
32+
A malicious request (matching the firewall's attack rules) must be refused,
33+
while a legitimate request to the same endpoint still succeeds. The test also
34+
flips the policy into observe-only mode and confirms the same attack is then
35+
*allowed* — proving the block is genuinely driven by the customer's firewall
36+
policy and not by some unrelated default. This guards the customer's actual
37+
protection, and the risk that a single bad firewall rule wedges the whole
38+
listener.
39+
40+
### Offline backend returns a clean 503
41+
When a backend connector is offline, the customer-facing path must return a
42+
real 503 — not hang, and not serve a blank. This is checked as an observed
43+
response on the user path, because that's what a customer experiences.
44+
45+
### Branded error page reaches the customer
46+
When the edge serves an error, the customer must receive Datum's styled page,
47+
confirmed by finding the page's content in the actual response body — not by
48+
trusting that the configuration was applied. Production once needed manual
49+
restarts to make this take effect, exactly the kind of inert-configuration gap
50+
this scenario now catches.
51+
52+
### One bad certificate can't break its neighbors
53+
A single invalid certificate must not take down the other, healthy listeners
54+
sharing the same proxy. The test introduces a genuinely bad certificate and
55+
confirms its listener is isolated while sibling listeners keep serving real
56+
traffic — the "one bad resource freezes everything" failure mode, contained.
57+
58+
## Running them
59+
60+
The scenarios run against the production-fidelity environment:
61+
62+
```
63+
task test-infra:up # bring the edge online (proxy + extension server + firewall)
64+
task test-infra:smoke # quick confidence check: real traffic serves
65+
task test-infra:e2e # the full guarantee suite above
66+
```
67+
68+
See [`Taskfile.test-infra.yml`](../../Taskfile.test-infra.yml) for the
69+
environment these assume.
70+
71+
## Layout
72+
73+
- Scenario folders (`waf-enforcement/`, `connector-offline-503/`,
74+
`branded-error-page/`, `atomic-reject-isolation/`) — one guarantee each.
75+
- `_steps/` — shared, reusable checks (send-a-request, confirm-configuration,
76+
capture-the-build-marker) so every scenario asserts behavior the same way.
77+
- `_fixtures/` — the supporting pieces a scenario needs (a sample backend, an
78+
attack corpus, a pre-made bad certificate, the offline-backend stand-in).
79+
80+
## A note on honesty
81+
82+
Where the edge genuinely cannot yet do something, the suite says so rather than
83+
papering over it. The "offline backend recovers" path is deliberately held back
84+
because the edge today does not reliably re-apply configuration when a backend
85+
comes *back* online — and the test proves that gap exists instead of pretending
86+
it's closed. A guarantee we can't keep is documented as a gap, not asserted as
87+
a pass.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# GENERATED by gen-certs.sh — do not edit by hand; re-run the script to refresh.
2+
#
3+
# A pre-minted EXPIRED certificate for bad.e2e.env.datum.net. Its validity window
4+
# (20260621000000Z .. 20260622000000Z) is in the past, so the extension server
5+
# drops the listener that references it while sibling listeners keep serving.
6+
apiVersion: v1
7+
kind: Secret
8+
metadata:
9+
name: expired-leaf-tls
10+
namespace: default
11+
type: kubernetes.io/tls
12+
data:
13+
tls.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJ0ekNDQVY2Z0F3SUJBZ0lVTW5lMFBkKzczaDFPcFh0YWo1eDlIbzRxR2Fjd0NnWUlLb1pJemowRUF3SXcKSURFZU1Cd0dBMVVFQXd3VlltRmtMbVV5WlM1bGJuWXVaR0YwZFcwdWJtVjBNQjRYRFRJMk1EWXlNVEF3TURBdwpNRm9YRFRJMk1EWXlNakF3TURBd01Gb3dJREVlTUJ3R0ExVUVBd3dWWW1Ga0xtVXlaUzVsYm5ZdVpHRjBkVzB1CmJtVjBNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUVKWEYyV0U1UmZzN2lJbkpmVWRWZHNRa0IKWU9aRWk1TmV3cmZ5T3hkRWdiZ2ViNnRGMkpUbHA5L2tMNUYweEkvSWpYeXpFWlF1YU9vMVNwWCtYd0pWS3FOMgpNSFF3SUFZRFZSMFJCQmt3RjRJVlltRmtMbVV5WlM1bGJuWXVaR0YwZFcwdWJtVjBNQXdHQTFVZEV3RUIvd1FDCk1BQXdEZ1lEVlIwUEFRSC9CQVFEQWdXZ01CTUdBMVVkSlFRTU1Bb0dDQ3NHQVFVRkJ3TUJNQjBHQTFVZERnUVcKQkJTaTRpOTB2VkEwdmlQU0k2aW9WdFFEdXFhOE5qQUtCZ2dxaGtqT1BRUURBZ05IQURCRUFpQTJmalMvbGVmNQpTRHFEMEt5UWJRV1hENlltMHg4NDJnU2VPMS9XZ041MHpnSWdYVHd4MWZwSTNVc0U1UXI4bkNXWk0wd044b0cyCkU4a0ttZUo1a3BDL3lFST0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
14+
tls.key: LS0tLS1CRUdJTiBFQyBQUklWQVRFIEtFWS0tLS0tCk1IY0NBUUVFSUM3Ri9qbkVFSTBYQk12emg4K0lCaVVycG5MMnVGeEpFSUlMQWkyd1Q4eXlvQW9HQ0NxR1NNNDkKQXdFSG9VUURRZ0FFSlhGMldFNVJmczdpSW5KZlVkVmRzUWtCWU9aRWk1TmV3cmZ5T3hkRWdiZ2ViNnRGMkpUbApwOS9rTDVGMHhJL0lqWHl6RVpRdWFPbzFTcFgrWHdKVktnPT0KLS0tLS1FTkQgRUMgUFJJVkFURSBLRVktLS0tLQo=
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
#!/usr/bin/env bash
2+
# Regenerate the pre-minted EXPIRED TLS certificate the invalid-certificate test
3+
# relies on.
4+
#
5+
# The test needs a certificate that is already expired at test time so the
6+
# extension server drops the listener that references it while sibling listeners
7+
# keep serving. cert-manager cannot help here: it renews before expiry, so it
8+
# cannot hold a certificate in the expired state. We instead mint a self-signed
9+
# certificate whose validity window is in the past.
10+
#
11+
# This is a listener certificate for the customer hostname under test, entirely
12+
# separate from the certificate authority securing the extension server's own
13+
# connection to the gateway; do not wire it to that authority.
14+
#
15+
# Output: expired-cert-secret.yaml — a Secret with the expired certificate and
16+
# key inline. The committed Secret is what the test consumes, so a live run needs
17+
# no openssl; re-run this script only to refresh it.
18+
#
19+
# Usage: ./gen-certs.sh [HOSTNAME] [SECRET_NAME] [SECRET_NAMESPACE]
20+
set -euo pipefail
21+
22+
HOST="${1:-bad.e2e.env.datum.net}"
23+
SECRET_NAME="${2:-expired-leaf-tls}"
24+
SECRET_NS="${3:-default}"
25+
26+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
27+
WORK="$(mktemp -d)"
28+
trap 'rm -rf "${WORK}"' EXIT
29+
30+
# ECDSA P-256 key.
31+
openssl ecparam -name prime256v1 -genkey -noout -out "${WORK}/tls.key"
32+
33+
# A self-signed certificate whose validity window is entirely in the past. We set
34+
# the exact start and end times with OpenSSL 3.x's -not_before/-not_after so the
35+
# certificate is already expired by the time the test runs.
36+
NOT_BEFORE="20260621000000Z"
37+
NOT_AFTER="20260622000000Z"
38+
39+
cat > "${WORK}/leaf.cnf" <<EOF
40+
[req]
41+
distinguished_name = dn
42+
prompt = no
43+
x509_extensions = v3
44+
[dn]
45+
CN = ${HOST}
46+
[v3]
47+
subjectAltName = DNS:${HOST}
48+
basicConstraints = critical, CA:FALSE
49+
keyUsage = critical, digitalSignature, keyEncipherment
50+
extendedKeyUsage = serverAuth
51+
EOF
52+
53+
openssl req -new -x509 \
54+
-key "${WORK}/tls.key" \
55+
-out "${WORK}/tls.crt" \
56+
-config "${WORK}/leaf.cnf" \
57+
-not_before "${NOT_BEFORE}" \
58+
-not_after "${NOT_AFTER}" \
59+
-sha256
60+
61+
echo "minted leaf for ${HOST}:"
62+
openssl x509 -in "${WORK}/tls.crt" -noout -subject -dates
63+
64+
CRT_B64="$(base64 < "${WORK}/tls.crt" | tr -d '\n')"
65+
KEY_B64="$(base64 < "${WORK}/tls.key" | tr -d '\n')"
66+
67+
cat > "${SCRIPT_DIR}/expired-cert-secret.yaml" <<EOF
68+
# GENERATED by gen-certs.sh — do not edit by hand; re-run the script to refresh.
69+
#
70+
# A pre-minted EXPIRED certificate for ${HOST}. Its validity window
71+
# (${NOT_BEFORE} .. ${NOT_AFTER}) is in the past, so the extension server drops
72+
# the listener that references it while sibling listeners keep serving.
73+
apiVersion: v1
74+
kind: Secret
75+
metadata:
76+
name: ${SECRET_NAME}
77+
namespace: ${SECRET_NS}
78+
type: kubernetes.io/tls
79+
data:
80+
tls.crt: ${CRT_B64}
81+
tls.key: ${KEY_B64}
82+
EOF
83+
84+
echo "wrote ${SCRIPT_DIR}/expired-cert-secret.yaml"
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# CONNECT-proxy stand-in for the connector tunnel.
2+
# Single static binary, no dependencies beyond the standard library.
3+
FROM golang:1.23-alpine AS build
4+
WORKDIR /src
5+
COPY go.mod .
6+
COPY main.go .
7+
# Standard library only; go.mod just gives the build a module context.
8+
RUN CGO_ENABLED=0 go build -o /connect-proxy .
9+
10+
FROM gcr.io/distroless/static:nonroot
11+
COPY --from=build /connect-proxy /connect-proxy
12+
USER nonroot:nonroot
13+
ENTRYPOINT ["/connect-proxy"]
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
module connect-proxy
2+
3+
go 1.23
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
// connect-proxy is a stand-in for the real connector tunnel. It exercises the
2+
// proxy-side CONNECT wiring and the 503<->200 liveness swap, but not real tunnel
3+
// establishment or NAT traversal.
4+
//
5+
// When a connector is online, the extension server points its backend at a path
6+
// that issues an HTTP CONNECT toward the connector's target, which in the test
7+
// is this pod's Service. On a CONNECT, this proxy dials the configured upstream
8+
// (the echo backend) and blindly splices bytes both ways — a minimal forward
9+
// proxy.
10+
//
11+
// Liveness is driven entirely by the connector's annotation on the control plane
12+
// (see _steps/flip-connector-liveness.yaml); this proxy is always up. It exists
13+
// so an online request can only succeed via the tunnel, never via a direct
14+
// fallback route — which is what makes the 503->200 transition meaningful.
15+
package main
16+
17+
import (
18+
"io"
19+
"log"
20+
"net"
21+
"net/http"
22+
"os"
23+
"time"
24+
)
25+
26+
func main() {
27+
listen := envOr("LISTEN_ADDR", ":8080")
28+
// Where CONNECT requests are forwarded. Default to the echo backend Service.
29+
// The proxy ignores the CONNECT target host and always dials this upstream,
30+
// so the test controls the destination via UPSTREAM_ADDR rather than the
31+
// host the proxy sends.
32+
upstream := envOr("UPSTREAM_ADDR", "echo-backend.default.svc.cluster.local:8080")
33+
34+
srv := &http.Server{
35+
Addr: listen,
36+
ReadTimeout: 0, // tunnels are long-lived; no read deadline on the hijacked conn
37+
Handler: &proxy{upstream: upstream},
38+
}
39+
log.Printf("connect-proxy listening on %s, forwarding CONNECT -> %s", listen, upstream)
40+
if err := srv.ListenAndServe(); err != nil {
41+
log.Fatalf("server exited: %v", err)
42+
}
43+
}
44+
45+
type proxy struct {
46+
upstream string
47+
}
48+
49+
func (p *proxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
50+
if r.Method != http.MethodConnect {
51+
// A plain GET is handy as a liveness/readiness probe.
52+
w.WriteHeader(http.StatusOK)
53+
_, _ = io.WriteString(w, "connect-proxy ready\n")
54+
return
55+
}
56+
57+
dst, err := net.DialTimeout("tcp", p.upstream, 10*time.Second)
58+
if err != nil {
59+
log.Printf("CONNECT %s: dial upstream %s failed: %v", r.Host, p.upstream, err)
60+
http.Error(w, "upstream unavailable", http.StatusBadGateway)
61+
return
62+
}
63+
defer dst.Close()
64+
65+
hj, ok := w.(http.Hijacker)
66+
if !ok {
67+
http.Error(w, "hijacking unsupported", http.StatusInternalServerError)
68+
return
69+
}
70+
client, _, err := hj.Hijack()
71+
if err != nil {
72+
log.Printf("CONNECT %s: hijack failed: %v", r.Host, err)
73+
return
74+
}
75+
defer client.Close()
76+
77+
// Tell the client the tunnel is established, then splice bytes both ways.
78+
if _, err := client.Write([]byte("HTTP/1.1 200 Connection Established\r\n\r\n")); err != nil {
79+
log.Printf("CONNECT %s: write 200 failed: %v", r.Host, err)
80+
return
81+
}
82+
83+
done := make(chan struct{}, 2)
84+
go func() { _, _ = io.Copy(dst, client); done <- struct{}{} }()
85+
go func() { _, _ = io.Copy(client, dst); done <- struct{}{} }()
86+
<-done
87+
}
88+
89+
func envOr(key, def string) string {
90+
if v := os.Getenv(key); v != "" {
91+
return v
92+
}
93+
return def
94+
}

0 commit comments

Comments
 (0)