|
1 | 1 | # Performance characterisation |
2 | 2 |
|
3 | | -**Date:** 2026-04-30 |
4 | | -**Commit:** `5e0b6ae` |
5 | | -**Target:** production deployment (single edge region, single uvicorn worker, single container). |
6 | | -**Test client:** Belgian residential connection → DE PoP, single source IP, authenticated via a labeled trusted token (revoked after the run). |
7 | | -**Tools:** `bombardier` v1.2.6, `vegeta` v12.12.0. |
8 | | -**Reproduction:** `scripts/perf_test.sh` (parameterised on `PC2NUTS_TARGET` and `PC2NUTS_TOKEN`). |
| 3 | +Two runs on file: a single-worker baseline at `5e0b6ae` and a multi-worker |
| 4 | +re-baseline at `18e1908` after #68/#71 shipped (`PC2NUTS_WORKERS=2`, |
| 5 | +Redis-backed shared rate-limit storage via a sidecar container). |
| 6 | + |
| 7 | +| | Single-worker baseline | Multi-worker (current) | |
| 8 | +|---|---|---| |
| 9 | +| Date | 2026-04-30 | 2026-05-01 | |
| 10 | +| Commit | `5e0b6ae` | `18e1908` | |
| 11 | +| uvicorn workers | 1 | 2 | |
| 12 | +| Rate-limit backend | per-process in-memory | Redis sidecar (`redis://localhost:6379/0`), shared across workers | |
| 13 | +| Test client | BE residential → DE PoP, single source IP, authenticated via labeled trusted token (revoked after each run) | (same) | |
| 14 | +| Tools | `bombardier` v1.2.6, `vegeta` v12.12.0 | (same) | |
| 15 | +| Reproduction | `scripts/perf_test.sh` | (same) | |
9 | 16 |
|
10 | 17 | --- |
11 | 18 |
|
12 | 19 | ## Headline |
13 | 20 |
|
14 | | -> **Sustained throughput ceiling: ~30 requests/second (~1,800 requests/minute).** |
| 21 | +> **Multi-worker plateau under realistic random-corpus load (Scenario B): ~35-38 RPS** before queue saturation. Hot-key with persistent connections (Scenario A) sustains **~50 RPS**. Single-worker baseline plateaued at ~30 RPS in both. |
15 | 22 | > |
16 | | -> **Recommended operating point: 27 RPS (~1,620/min), p99 < 200 ms.** |
| 23 | +> **Recommended operating point: unchanged at 27 RPS.** The 3-minute sustained run holds 100% success, p99 162 ms — well inside the SLO. Multi-worker raises *headroom* above the operating point from ~10% to ~30-40%, not the operating point itself. |
| 24 | +> |
| 25 | +> **Rate-limit shared-storage verified.** 130 anonymous requests sequentially from a single source IP yielded exactly 120×`200` + 10×`429`. The Redis sidecar is reachable from both workers and the cap is enforced globally, not per-worker. |
17 | 26 |
|
18 | | -The per-IP cap is therefore not the system bottleneck — the deployment can serve roughly **15× the default `120/minute` cap in aggregate** before throughput plateaus. A single client could in principle be permitted up to ~1,500/minute (25 RPS) without affecting overall headroom; the per-IP cap is set well below the aggregate ceiling so that ~15 simultaneous full-rate clients can coexist without degradation. |
| 27 | +The aggregate ceiling roughly **scales 1.5×** with two workers (not 2×). Likely contributors: GIL contention on Pydantic serialisation, fresh-TLS overhead per request in vegeta's connection pattern, and shared platform-edge serialisation in front of the pod. Scenario A's higher ceiling (50 RPS with persistent connections) implies the per-request TLS handshake is part of the cap, not just per-request CPU. |
19 | 28 |
|
20 | 29 | --- |
21 | 30 |
|
22 | 31 | ## Latency curve (Scenario B — random valid lookups across 5 countries) |
23 | 32 |
|
24 | | -This is the realistic-input scenario and the basis for the headline number. |
| 33 | +This is the realistic-input scenario and the basis for the headline numbers. |
25 | 34 |
|
26 | 35 | | Offered RPS | Achieved RPS | Success | p50 | p90 | p95 | p99 | Max | |
27 | 36 | |------------:|-------------:|--------:|----:|----:|----:|----:|----:| |
28 | | -| 10 | 10.0 | 100% | 46 ms | 53 ms | 63 ms | 74 ms | 104 ms | |
29 | | -| 20 | 20.0 | 100% | 45 ms | 54 ms | 60 ms | 96 ms | 136 ms | |
30 | | -| 25 | 25.1 | 100% | 46 ms | 54 ms | 73 ms | 151 ms | 228 ms | |
31 | | -| **30** | **30.0** |**100%** |**48 ms**|**109 ms**|**137 ms**|**193 ms**|**222 ms**| |
32 | | -| 35 | 32.2 | 100% |2.27 s |3.65 s |4.07 s |4.47 s |5.62 s | |
| 37 | +| 10 | 10.0 | 100% | 57 ms | 68 ms | 71 ms | 102 ms | 120 ms | |
| 38 | +| 20 | 20.0 | 100% | 60 ms | 70 ms | 83 ms | 111 ms | 211 ms | |
| 39 | +| 25 | 25.1 | 100% | 56 ms | 71 ms | 79 ms | 112 ms | 181 ms | |
| 40 | +| **30** | **30.0** |**100%** |**62 ms**|**75 ms**|**95 ms**|**122 ms**|**210 ms**| |
| 41 | +| 35 | 34.8 | 100% | 63 ms | 97 ms | 110 ms | 150 ms | 170 ms | |
| 42 | +| 40 | 38.3 | 100% | 1.71 s | 3.14 s | 3.61 s | 4.24 s | 4.60 s | |
| 43 | +| 50 | 36.3 | 89.3% | 3.85 s | 6.63 s | 8.75 s | 9.86 s | 10.7 s | |
| 44 | +| 60 | 52.2 | 100% | 1.60 s | 2.44 s | 2.65 s | 2.98 s | 3.14 s | |
| 45 | + |
| 46 | +**The new knee sits between 35 and 40 RPS.** From 35 → 40 the throughput barely moves (38 vs 35) but tail latencies jump 30×. Beyond, behaviour is bimodal: 50 RPS hit transient platform back-pressure (107 × 503), while 60 RPS pushed through cleanly at higher achieved throughput than 50 — the platform-edge layer's overload mode is non-monotonic. |
| 47 | + |
| 48 | +Compared to the single-worker baseline: |
33 | 49 |
|
34 | | -The **knee is at 30 RPS**. From 30 → 35 the throughput barely moves (32.2 vs 30.0) but tail latencies jump 12-30×. Beyond the knee, queue depth grows without bound — the curve is sharp, not gradual. |
| 50 | +| Offered RPS | p99 (single-worker) | p99 (multi-worker) | Δ | |
| 51 | +|------------:|--------------------:|-------------------:|---:| |
| 52 | +| 10 | 74 ms | 102 ms | +28 ms (within noise) | |
| 53 | +| 20 | 96 ms | 111 ms | +15 ms | |
| 54 | +| 25 | 151 ms | 112 ms | **−39 ms** | |
| 55 | +| 30 | 193 ms | 122 ms | **−71 ms** | |
| 56 | +| 35 | 4.5 s | 150 ms | **−4.3 s — single-worker collapsed here** | |
| 57 | + |
| 58 | +At and below the operating point the curves are similar; the win shows up at and beyond the old knee, where the new system absorbs ~20% more sustained throughput before breaking down. |
35 | 59 |
|
36 | 60 | ## Saturation discovery (Scenario A — hot single key, BE 3080) |
37 | 61 |
|
38 | | -Throughput plateaus regardless of client concurrency, confirming the bottleneck is per-request work on the server (single event loop / single worker), not concurrency exhaustion on the client. |
| 62 | +Throughput plateaus regardless of client concurrency, confirming the bottleneck is per-request work, not concurrency exhaustion on the client. Plateau roughly **1.6× the single-worker baseline.** |
39 | 63 |
|
40 | | -| Connections | Reqs/sec | p50 | p95 | p99 | |
41 | | -|------------:|---------:|----:|----:|----:| |
42 | | -| 5 | 29.6 | 169 ms | 225 ms | 267 ms | |
43 | | -| 10 | 31.0 | 325 ms | 443 ms | 479 ms | |
44 | | -| 20 | 31.8 | 617 ms | 795 ms | 1.00 s | |
45 | | -| 40 | 30.9 | 1.21 s | 1.63 s | 2.31 s | |
46 | | -| 80 | 30.4 | 2.30 s | 3.92 s | 6.92 s | |
| 64 | +| Connections | Reqs/sec (single) | Reqs/sec (multi) | p99 (single) | p99 (multi) | |
| 65 | +|------------:|------------------:|-----------------:|-------------:|------------:| |
| 66 | +| 5 | 29.6 | **46.9** | 267 ms | 186 ms | |
| 67 | +| 10 | 31.0 | **50.8** | 479 ms | 338 ms | |
| 68 | +| 20 | 31.8 | **47.6** | 1.00 s | 746 ms | |
| 69 | +| 40 | 30.9 | **51.4** | 2.31 s | 1.20 s | |
| 70 | +| 80 | 30.4 | **47.9** | 6.92 s | 2.78 s | |
47 | 71 |
|
48 | | -Throughput is bounded; concurrency just queues. |
| 72 | +Throughput is bounded around 50 RPS; concurrency just queues. **Tail latency at saturation is ~2.5× lower** under multi-worker — at c=80 the single-worker setup pushed p99 to 6.9 s, multi-worker holds it under 2.8 s. |
49 | 73 |
|
50 | | -**At c≥100 the platform pushes back.** An exploratory pre-run at c=100, 200, 400, 800 produced widespread `5xx`, `dial tcp … connection timed out`, and `tls handshake timed out` errors — i.e. the edge platform aggressively refuses connections at very high concurrency from a single source. Stay well below c=100 in any scripted test against this deployment. |
| 74 | +**At c≥100 the platform pushes back** (unchanged from single-worker baseline). Stay below c=80 in any scripted test against this deployment. |
51 | 75 |
|
52 | 76 | ## Fallback-path cost (Scenario C — 50/50 hit/miss at 25 RPS) |
53 | 77 |
|
54 | | -Compared to Scenario B at the same rate (25/s), the 50/50 mix is statistically indistinguishable: p50 45 ms vs 46 ms; p99 136 ms vs 151 ms. The Tier 3 prefix-approximation path (taken on every "miss") imposes **no measurable latency cost** at this load. The hard work is per-request HTTP/TLS framing and JSON serialisation, not the lookup itself. |
| 78 | +Compared to Scenario B at the same rate: p50 62 ms vs 56 ms; p99 115 ms vs 112 ms. The Tier 3 prefix-approximation path imposes **no measurable latency cost** at this load (matches the single-worker conclusion). |
55 | 79 |
|
56 | 80 | ## FastAPI/uvicorn floor (Scenario D — `/health` at 25 RPS) |
57 | 81 |
|
58 | 82 | | Endpoint | p50 | p95 | p99 | Max | |
59 | 83 | |---|---:|---:|---:|---:| |
60 | | -| `/health` | **15 ms** | 19 ms | 27 ms | 62 ms | |
61 | | -| `/lookup` (Scenario B at 25/s) | 46 ms | 73 ms | 151 ms | 228 ms | |
| 84 | +| `/health` (multi-worker) | 18 ms | 37 ms | 63 ms | 91 ms | |
| 85 | +| `/health` (single-worker baseline) | 15 ms | 19 ms | 27 ms | 62 ms | |
| 86 | +| `/lookup` (Scenario B at 25 RPS, multi-worker) | 56 ms | 79 ms | 112 ms | 181 ms | |
62 | 87 |
|
63 | | -`/health` is roughly **3× faster** than `/lookup`. About 15 ms of every request is the platform/network/TLS/uvicorn floor; the additional ~30 ms on `/lookup` is the endpoint logic plus Pydantic response serialisation. **Optimisation candidates** if a higher ceiling is needed: response serialisation (the dict access itself is microseconds), reducing JSON envelope size, or moving to multi-worker. |
| 88 | +`/health` p99 is ~2× higher under multi-worker (63 ms vs 27 ms) — small absolute number, but the only place the second worker is visibly *worse*. Probable cause: process scheduling jitter when the OS load-balances incoming connections across two workers vs one. Worth re-measuring if `/health` ever becomes a hot path; not material for the `/lookup` ceiling. |
64 | 89 |
|
65 | 90 | ## Stability (Scenario E — sustained 27 RPS for 3 minutes) |
66 | 91 |
|
67 | | -| Metric | Value | |
68 | | -|---|---| |
69 | | -| Total requests | 4,860 | |
70 | | -| Achieved rate | 27.0/s | |
71 | | -| Success | 100.0% (200:4860) | |
72 | | -| p50 / p95 / p99 / max | 46 / 89 / 132 / 324 ms | |
73 | | -| <50 ms | 73.0% | |
74 | | -| <100 ms | 97.4% | |
75 | | -| <200 ms | 99.8% | |
76 | | -| 5xx | 0 | |
77 | | -| 429 | 0 | |
78 | | - |
79 | | -No drift over the 3-minute window. p99 stayed well under 200 ms throughout. |
| 92 | +| Metric | Single-worker | Multi-worker | |
| 93 | +|---|---|---| |
| 94 | +| Total requests | 4,860 | 4,860 | |
| 95 | +| Achieved rate | 27.0/s | 27.0/s | |
| 96 | +| Success | 100.0% | 100.0% | |
| 97 | +| p50 / p95 / p99 / max | 46 / 89 / 132 / 324 ms | 63 / 111 / 162 / 391 ms | |
| 98 | +| <50 ms | 73.0% | 13.8% | |
| 99 | +| <100 ms | 97.4% | 93.2% | |
| 100 | +| <200 ms | 99.8% | 99.6% | |
| 101 | +| 5xx | 0 | 0 | |
| 102 | +| 429 | 0 | 0 | |
| 103 | + |
| 104 | +No drift over the 3-minute window. p99 stayed under 200 ms throughout. Tail-latency distribution is tighter at the median under single-worker (much more <50 ms) but the >100 ms tail is slightly fatter under multi-worker — net p99 is ~30 ms higher. Within the SLO either way. |
| 105 | + |
| 106 | +## Rate-limit shared-storage verification |
| 107 | + |
| 108 | +A separate probe with **no `Authorization` header** was used to exercise the |
| 109 | +per-IP cap (the trusted-token bypass turns the cap off, so the perf scenarios |
| 110 | +can't observe it). 130 sequential requests from a single source IP, against |
| 111 | +the published cap of `120/minute`: |
| 112 | + |
| 113 | +| Outcome | Count | |
| 114 | +|---|---:| |
| 115 | +| `200` | 120 | |
| 116 | +| `429` | 10 | |
| 117 | + |
| 118 | +Result is exact, not approximate. If both workers had used per-process |
| 119 | +in-memory storage (the failure mode the startup validator at |
| 120 | +`app/config.py:42-50` exists to prevent), the effective cap would have been |
| 121 | +240 — and 130 requests from one IP would have produced 130 × `200`, zero |
| 122 | +`429`s. The `120 + 10` split is conclusive evidence that: |
| 123 | + |
| 124 | +1. `PC2NUTS_RATE_LIMIT_STORAGE_URI=redis://localhost:6379/0` is being read. |
| 125 | +2. The Redis sidecar (`library/redis@sha256:84b07a33…5cf5b27`) is reachable |
| 126 | + from both workers via the shared pod network namespace. |
| 127 | +3. slowapi's shared-counter increments are synchronised across workers. |
| 128 | +4. The `120/minute` cap is honoured globally, not per-worker. |
80 | 129 |
|
81 | 130 | --- |
82 | 131 |
|
83 | 132 | ## Methodology notes |
84 | 133 |
|
85 | | -- **Cooldown between runs.** A short pause (10 s) between scenarios is needed; without it, residual queueing from the previous run pollutes the next. |
86 | | -- **Bombardier default 2 s timeout is too aggressive** here — runs at near-saturation see legitimate 1-2 s tail latencies. Use `--timeout 30s` to avoid spurious "timeout" classifications. |
87 | | -- **Single-region edge means single-PoP measurements.** The platform allocates the deployment to one region (DE). Latency from clients elsewhere will differ accordingly, but the throughput ceiling is unaffected — every request still hits the same one container. |
88 | | -- **Single source IP test client.** Distributed traffic from many IPs would not change the aggregate ceiling (the bottleneck is the container) but would change the per-IP rate-limit behaviour, since slowapi keys per source. |
89 | | -- **No CDN cache between client and `/lookup`.** Verified by inspecting response headers — no `Cache-Status`, no `CDN-Cache-Status`, every request reaches the container. |
| 134 | +- **Tools, methodology, and corpus are unchanged from the single-worker baseline** — same `bombardier`/`vegeta` versions, same scenarios, same target file format. Numbers are directly comparable. |
| 135 | +- **Cooldown between runs.** Same 10 s pause between scenarios as before; needed to keep residual queueing from one run from polluting the next. |
| 136 | +- **Single source IP test client.** Aggregate ceiling reflects a single TCP/TLS termination path; distributed traffic from many IPs would push the ceiling up to where the actual per-pod work is the bottleneck — but the recommended operating point is set by single-client latency, so this is the realistic measurement. |
| 137 | +- **Multi-worker container topology.** The deployment is now a single pod with two co-located containers (`api` running uvicorn with two workers; `redis:7-alpine` started with `--save "" --appendonly no` for in-memory rate-limit counters). Both share the pod network namespace, so `redis://localhost:6379/0` is the api-to-redis URI. Rate-limit counters reset every minute, so no persistence is needed. |
| 138 | +- **Why the 1.6× and not 2×.** Two workers don't double throughput. Likely contributors, in rough order: shared edge-layer TLS termination in front of the single pod; Pydantic serialisation contending under the GIL when both workers are CPU-bound on JSON; vegeta's per-request fresh-connection pattern at higher rates putting more weight on TLS than on the lookup itself. Scenario A (persistent connections) sustains 50 RPS, Scenario B (fresh per-request connections) plateaus at 35-38 — the difference is the TLS handshake cost, which a third worker won't help with. |
90 | 139 |
|
91 | 140 | --- |
92 | 141 |
|
93 | 142 | ## Recommendations |
94 | 143 |
|
95 | | -1. **Per-IP cap set to `120/minute` (2 RPS per IP).** Chosen as 1/15 of the aggregate ceiling — up to 15 simultaneous full-rate anonymous clients can sustain themselves before the aggregate degrades. Friendlier UX for casual users (a small country's worth of postcodes finishes in roughly half the time it took at `60/minute`) while still tight enough that batch users feel the pressure to request a trusted token. Revisit when multi-worker (#68) ships and the aggregate ceiling rises. |
| 144 | +1. **Recommended operating point unchanged at 27 RPS.** Scenario E meets the p99 ≤ 200 ms SLO with 100% success at this rate. The multi-worker headroom buys a wider safety margin to that operating point (~30-40% vs ~10%) — useful for absorbing bursts without rewriting the recommendation. |
| 145 | + |
| 146 | +2. **Per-IP cap unchanged at `120/minute` (2 RPS per IP).** With aggregate ceiling now ~50 RPS, this is roughly 1/25 of the ceiling — comfortable margin, supports up to ~25 simultaneous full-rate anonymous clients before the aggregate degrades. Bumping the cap is reasonable if dashboards show consistent under-utilisation, but not required. |
96 | 147 |
|
97 | | -2. **Pick `p99 ≤ 200 ms` as the SLO** at the recommended 27 RPS operating point. The full 3-minute sustained run met this. |
| 148 | +3. **Don't push above `PC2NUTS_WORKERS=2` yet.** The remaining gap between Scenario A (50 RPS) and Scenario B (35 RPS) suggests the bottleneck has shifted from pure compute to TLS+connection setup. Adding a third worker would help only if the platform's TLS termination scales with it — empirical question, but the cheapest first investigation is reusing connections client-side, not adding more workers. |
98 | 149 |
|
99 | | -3. **Re-baseline after issue #7 or any worker-count change lands.** Specifically: |
100 | | - - **#7 (UK NSPL, +1.79M postcodes)** — should not change per-request latency materially (still a dict lookup) but doubles in-memory state. Re-run to confirm. |
101 | | - - **Switching from single-worker to multi-worker** — likely the easiest large win. Each additional worker should approximately add another 30 RPS of headroom up to the container's CPU count. |
| 150 | +4. **Re-baseline if the topology changes.** Specifically: |
| 151 | + - **Adding a second pod replica** (raising `autoScaling.max` above 1) — would multiply both ceilings, and the rate-limit storage already supports it (Redis is shared per pod today; would need to move to a cross-pod shared service if scaling out). |
| 152 | + - **#7 (UK NSPL, +1.79M postcodes)** — should not change per-request latency materially (still a dict lookup) but doubles in-memory state per worker. Re-run to confirm. |
102 | 153 |
|
103 | | -4. **Don't run unattended high-concurrency tests.** Bombardier at c≥100 from a single source triggers platform-level connection refusal (`5xx`, dial timeouts) and risks short-term throttling. Keep scripted load below c=80. |
| 154 | +5. **Don't run unattended high-concurrency tests.** Bombardier at c≥100 from a single source still triggers platform-level connection refusal. The `B 50/s` result here (107 × 503) is a milder version of the same edge back-pressure. Keep scripted load below c=80 and below 50 RPS in B-style sweeps. |
104 | 155 |
|
105 | 156 | --- |
106 | 157 |
|
|
0 commit comments