Skip to content

Commit da3a73f

Browse files
bk86aclaude
andcommitted
docs(perf): re-baseline under multi-worker (#68)
PR #71 shipped multi-worker uvicorn behind a shared rate-limit backend. This commit captures the re-run of scripts/perf_test.sh against the post-#68 deployment so the open AC items on #68 ("memory headroom" and "verify approximately N× headroom on /lookup") have measured numbers rather than estimates. Headlines: - Realistic-corpus knee (Scenario B) moved from 30 → 35-38 RPS. Single-worker collapsed at 35 (p99 4.47 s); multi-worker absorbs 35 cleanly (p99 150 ms) and only saturates between 35 and 40. - Hot-key plateau (Scenario A, persistent connections) doubled-ish: ~30 → ~50 RPS, with p99 at saturation 2.5× lower. - Recommended operating point unchanged at 27 RPS — Scenario E (3-min sustained) still meets the p99 ≤ 200 ms SLO. The win is headroom (~10% → ~30-40%), not the operating point itself. The 1.6× rather than 2× scaling is consistent with shared-edge TLS termination and Pydantic GIL contention being part of the cap, not just per-worker compute. Documented in the methodology notes. Also adds a new "Rate-limit shared-storage verification" subsection: 130 anonymous requests against the published 120/minute cap from a single source IP yielded exactly 120 × 200 + 10 × 429 — conclusive evidence the Redis sidecar is reachable from both workers and the cap is enforced globally rather than per-worker (the failure mode the startup validator at app/config.py:42-50 exists to prevent). CHANGELOG entry under [Unreleased] summarises both the re-baseline and the perf_test.sh fix from the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d003102 commit da3a73f

2 files changed

Lines changed: 110 additions & 54 deletions

File tree

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,13 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/).
66

77
## [Unreleased]
88

9+
### Documentation
10+
11+
- **Performance re-baseline under multi-worker** (#68): `docs/performance.md` updated with the post-#68 numbers and a new rate-limit shared-storage verification subsection. Realistic-corpus knee at 35-40 RPS (vs ~30 single-worker), hot-key plateau at ~50 RPS, p99 at the old knee dropped from 4.5 s to 150 ms. Recommended operating point unchanged at 27 RPS — the win is headroom, not the operating point itself. The Redis sidecar shared-storage path is verified end-to-end: 130 anonymous requests against the published `120/minute` cap produced exactly 120 × `200` + 10 × `429`, ruling out per-worker counter divergence.
12+
913
### Fixed
1014

15+
- **`scripts/perf_test.sh` `run_warm`**: indexing the vegeta target file by raw line number landed on a blank line half the time, crashing the script under `set -e`. Now extracts only the GET URLs into an array first.
1116
- **`__version__` was stale at `0.14.0`** since the v0.14 release; openapi.json and FastAPI's `version` field have been reporting the wrong number for every release since then. Bumped to `0.18.0`. Future releases need to update `app/__init__.py` alongside the CHANGELOG until version derivation is automated.
1217

1318
## [0.18.0] - 2026-05-01

docs/performance.md

Lines changed: 105 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,106 +1,157 @@
11
# Performance characterisation
22

3-
**Date:** 2026-04-30
4-
**Commit:** `5e0b6ae`
5-
**Target:** production deployment (single edge region, single uvicorn worker, single container).
6-
**Test client:** Belgian residential connection → DE PoP, single source IP, authenticated via a labeled trusted token (revoked after the run).
7-
**Tools:** `bombardier` v1.2.6, `vegeta` v12.12.0.
8-
**Reproduction:** `scripts/perf_test.sh` (parameterised on `PC2NUTS_TARGET` and `PC2NUTS_TOKEN`).
3+
Two runs on file: a single-worker baseline at `5e0b6ae` and a multi-worker
4+
re-baseline at `18e1908` after #68/#71 shipped (`PC2NUTS_WORKERS=2`,
5+
Redis-backed shared rate-limit storage via a sidecar container).
6+
7+
| | Single-worker baseline | Multi-worker (current) |
8+
|---|---|---|
9+
| Date | 2026-04-30 | 2026-05-01 |
10+
| Commit | `5e0b6ae` | `18e1908` |
11+
| uvicorn workers | 1 | 2 |
12+
| Rate-limit backend | per-process in-memory | Redis sidecar (`redis://localhost:6379/0`), shared across workers |
13+
| Test client | BE residential → DE PoP, single source IP, authenticated via labeled trusted token (revoked after each run) | (same) |
14+
| Tools | `bombardier` v1.2.6, `vegeta` v12.12.0 | (same) |
15+
| Reproduction | `scripts/perf_test.sh` | (same) |
916

1017
---
1118

1219
## Headline
1320

14-
> **Sustained throughput ceiling: ~30 requests/second (~1,800 requests/minute).**
21+
> **Multi-worker plateau under realistic random-corpus load (Scenario B): ~35-38 RPS** before queue saturation. Hot-key with persistent connections (Scenario A) sustains **~50 RPS**. Single-worker baseline plateaued at ~30 RPS in both.
1522
>
16-
> **Recommended operating point: 27 RPS (~1,620/min), p99 < 200 ms.**
23+
> **Recommended operating point: unchanged at 27 RPS.** The 3-minute sustained run holds 100% success, p99 162 ms — well inside the SLO. Multi-worker raises *headroom* above the operating point from ~10% to ~30-40%, not the operating point itself.
24+
>
25+
> **Rate-limit shared-storage verified.** 130 anonymous requests sequentially from a single source IP yielded exactly 120×`200` + 10×`429`. The Redis sidecar is reachable from both workers and the cap is enforced globally, not per-worker.
1726
18-
The per-IP cap is therefore not the system bottleneck — the deployment can serve roughly **15× the default `120/minute` cap in aggregate** before throughput plateaus. A single client could in principle be permitted up to ~1,500/minute (25 RPS) without affecting overall headroom; the per-IP cap is set well below the aggregate ceiling so that ~15 simultaneous full-rate clients can coexist without degradation.
27+
The aggregate ceiling roughly **scales 1.5×** with two workers (not 2×). Likely contributors: GIL contention on Pydantic serialisation, fresh-TLS overhead per request in vegeta's connection pattern, and shared platform-edge serialisation in front of the pod. Scenario A's higher ceiling (50 RPS with persistent connections) implies the per-request TLS handshake is part of the cap, not just per-request CPU.
1928

2029
---
2130

2231
## Latency curve (Scenario B — random valid lookups across 5 countries)
2332

24-
This is the realistic-input scenario and the basis for the headline number.
33+
This is the realistic-input scenario and the basis for the headline numbers.
2534

2635
| Offered RPS | Achieved RPS | Success | p50 | p90 | p95 | p99 | Max |
2736
|------------:|-------------:|--------:|----:|----:|----:|----:|----:|
28-
| 10 | 10.0 | 100% | 46 ms | 53 ms | 63 ms | 74 ms | 104 ms |
29-
| 20 | 20.0 | 100% | 45 ms | 54 ms | 60 ms | 96 ms | 136 ms |
30-
| 25 | 25.1 | 100% | 46 ms | 54 ms | 73 ms | 151 ms | 228 ms |
31-
| **30** | **30.0** |**100%** |**48 ms**|**109 ms**|**137 ms**|**193 ms**|**222 ms**|
32-
| 35 | 32.2 | 100% |2.27 s |3.65 s |4.07 s |4.47 s |5.62 s |
37+
| 10 | 10.0 | 100% | 57 ms | 68 ms | 71 ms | 102 ms | 120 ms |
38+
| 20 | 20.0 | 100% | 60 ms | 70 ms | 83 ms | 111 ms | 211 ms |
39+
| 25 | 25.1 | 100% | 56 ms | 71 ms | 79 ms | 112 ms | 181 ms |
40+
| **30** | **30.0** |**100%** |**62 ms**|**75 ms**|**95 ms**|**122 ms**|**210 ms**|
41+
| 35 | 34.8 | 100% | 63 ms | 97 ms | 110 ms | 150 ms | 170 ms |
42+
| 40 | 38.3 | 100% | 1.71 s | 3.14 s | 3.61 s | 4.24 s | 4.60 s |
43+
| 50 | 36.3 | 89.3% | 3.85 s | 6.63 s | 8.75 s | 9.86 s | 10.7 s |
44+
| 60 | 52.2 | 100% | 1.60 s | 2.44 s | 2.65 s | 2.98 s | 3.14 s |
45+
46+
**The new knee sits between 35 and 40 RPS.** From 35 → 40 the throughput barely moves (38 vs 35) but tail latencies jump 30×. Beyond, behaviour is bimodal: 50 RPS hit transient platform back-pressure (107 × 503), while 60 RPS pushed through cleanly at higher achieved throughput than 50 — the platform-edge layer's overload mode is non-monotonic.
47+
48+
Compared to the single-worker baseline:
3349

34-
The **knee is at 30 RPS**. From 30 → 35 the throughput barely moves (32.2 vs 30.0) but tail latencies jump 12-30×. Beyond the knee, queue depth grows without bound — the curve is sharp, not gradual.
50+
| Offered RPS | p99 (single-worker) | p99 (multi-worker) | Δ |
51+
|------------:|--------------------:|-------------------:|---:|
52+
| 10 | 74 ms | 102 ms | +28 ms (within noise) |
53+
| 20 | 96 ms | 111 ms | +15 ms |
54+
| 25 | 151 ms | 112 ms | **−39 ms** |
55+
| 30 | 193 ms | 122 ms | **−71 ms** |
56+
| 35 | 4.5 s | 150 ms | **−4.3 s — single-worker collapsed here** |
57+
58+
At and below the operating point the curves are similar; the win shows up at and beyond the old knee, where the new system absorbs ~20% more sustained throughput before breaking down.
3559

3660
## Saturation discovery (Scenario A — hot single key, BE 3080)
3761

38-
Throughput plateaus regardless of client concurrency, confirming the bottleneck is per-request work on the server (single event loop / single worker), not concurrency exhaustion on the client.
62+
Throughput plateaus regardless of client concurrency, confirming the bottleneck is per-request work, not concurrency exhaustion on the client. Plateau roughly **1.6× the single-worker baseline.**
3963

40-
| Connections | Reqs/sec | p50 | p95 | p99 |
41-
|------------:|---------:|----:|----:|----:|
42-
| 5 | 29.6 | 169 ms | 225 ms | 267 ms |
43-
| 10 | 31.0 | 325 ms | 443 ms | 479 ms |
44-
| 20 | 31.8 | 617 ms | 795 ms | 1.00 s |
45-
| 40 | 30.9 | 1.21 s | 1.63 s | 2.31 s |
46-
| 80 | 30.4 | 2.30 s | 3.92 s | 6.92 s |
64+
| Connections | Reqs/sec (single) | Reqs/sec (multi) | p99 (single) | p99 (multi) |
65+
|------------:|------------------:|-----------------:|-------------:|------------:|
66+
| 5 | 29.6 | **46.9** | 267 ms | 186 ms |
67+
| 10 | 31.0 | **50.8** | 479 ms | 338 ms |
68+
| 20 | 31.8 | **47.6** | 1.00 s | 746 ms |
69+
| 40 | 30.9 | **51.4** | 2.31 s | 1.20 s |
70+
| 80 | 30.4 | **47.9** | 6.92 s | 2.78 s |
4771

48-
Throughput is bounded; concurrency just queues.
72+
Throughput is bounded around 50 RPS; concurrency just queues. **Tail latency at saturation is ~2.5× lower** under multi-worker — at c=80 the single-worker setup pushed p99 to 6.9 s, multi-worker holds it under 2.8 s.
4973

50-
**At c≥100 the platform pushes back.** An exploratory pre-run at c=100, 200, 400, 800 produced widespread `5xx`, `dial tcp … connection timed out`, and `tls handshake timed out` errors — i.e. the edge platform aggressively refuses connections at very high concurrency from a single source. Stay well below c=100 in any scripted test against this deployment.
74+
**At c≥100 the platform pushes back** (unchanged from single-worker baseline). Stay below c=80 in any scripted test against this deployment.
5175

5276
## Fallback-path cost (Scenario C — 50/50 hit/miss at 25 RPS)
5377

54-
Compared to Scenario B at the same rate (25/s), the 50/50 mix is statistically indistinguishable: p50 45 ms vs 46 ms; p99 136 ms vs 151 ms. The Tier 3 prefix-approximation path (taken on every "miss") imposes **no measurable latency cost** at this load. The hard work is per-request HTTP/TLS framing and JSON serialisation, not the lookup itself.
78+
Compared to Scenario B at the same rate: p50 62 ms vs 56 ms; p99 115 ms vs 112 ms. The Tier 3 prefix-approximation path imposes **no measurable latency cost** at this load (matches the single-worker conclusion).
5579

5680
## FastAPI/uvicorn floor (Scenario D — `/health` at 25 RPS)
5781

5882
| Endpoint | p50 | p95 | p99 | Max |
5983
|---|---:|---:|---:|---:|
60-
| `/health` | **15 ms** | 19 ms | 27 ms | 62 ms |
61-
| `/lookup` (Scenario B at 25/s) | 46 ms | 73 ms | 151 ms | 228 ms |
84+
| `/health` (multi-worker) | 18 ms | 37 ms | 63 ms | 91 ms |
85+
| `/health` (single-worker baseline) | 15 ms | 19 ms | 27 ms | 62 ms |
86+
| `/lookup` (Scenario B at 25 RPS, multi-worker) | 56 ms | 79 ms | 112 ms | 181 ms |
6287

63-
`/health` is roughly **3× faster** than `/lookup`. About 15 ms of every request is the platform/network/TLS/uvicorn floor; the additional ~30 ms on `/lookup` is the endpoint logic plus Pydantic response serialisation. **Optimisation candidates** if a higher ceiling is needed: response serialisation (the dict access itself is microseconds), reducing JSON envelope size, or moving to multi-worker.
88+
`/health` p99 is ~2× higher under multi-worker (63 ms vs 27 ms) — small absolute number, but the only place the second worker is visibly *worse*. Probable cause: process scheduling jitter when the OS load-balances incoming connections across two workers vs one. Worth re-measuring if `/health` ever becomes a hot path; not material for the `/lookup` ceiling.
6489

6590
## Stability (Scenario E — sustained 27 RPS for 3 minutes)
6691

67-
| Metric | Value |
68-
|---|---|
69-
| Total requests | 4,860 |
70-
| Achieved rate | 27.0/s |
71-
| Success | 100.0% (200:4860) |
72-
| p50 / p95 / p99 / max | 46 / 89 / 132 / 324 ms |
73-
| <50 ms | 73.0% |
74-
| <100 ms | 97.4% |
75-
| <200 ms | 99.8% |
76-
| 5xx | 0 |
77-
| 429 | 0 |
78-
79-
No drift over the 3-minute window. p99 stayed well under 200 ms throughout.
92+
| Metric | Single-worker | Multi-worker |
93+
|---|---|---|
94+
| Total requests | 4,860 | 4,860 |
95+
| Achieved rate | 27.0/s | 27.0/s |
96+
| Success | 100.0% | 100.0% |
97+
| p50 / p95 / p99 / max | 46 / 89 / 132 / 324 ms | 63 / 111 / 162 / 391 ms |
98+
| <50 ms | 73.0% | 13.8% |
99+
| <100 ms | 97.4% | 93.2% |
100+
| <200 ms | 99.8% | 99.6% |
101+
| 5xx | 0 | 0 |
102+
| 429 | 0 | 0 |
103+
104+
No drift over the 3-minute window. p99 stayed under 200 ms throughout. Tail-latency distribution is tighter at the median under single-worker (much more <50 ms) but the >100 ms tail is slightly fatter under multi-worker — net p99 is ~30 ms higher. Within the SLO either way.
105+
106+
## Rate-limit shared-storage verification
107+
108+
A separate probe with **no `Authorization` header** was used to exercise the
109+
per-IP cap (the trusted-token bypass turns the cap off, so the perf scenarios
110+
can't observe it). 130 sequential requests from a single source IP, against
111+
the published cap of `120/minute`:
112+
113+
| Outcome | Count |
114+
|---|---:|
115+
| `200` | 120 |
116+
| `429` | 10 |
117+
118+
Result is exact, not approximate. If both workers had used per-process
119+
in-memory storage (the failure mode the startup validator at
120+
`app/config.py:42-50` exists to prevent), the effective cap would have been
121+
240 — and 130 requests from one IP would have produced 130 × `200`, zero
122+
`429`s. The `120 + 10` split is conclusive evidence that:
123+
124+
1. `PC2NUTS_RATE_LIMIT_STORAGE_URI=redis://localhost:6379/0` is being read.
125+
2. The Redis sidecar (`library/redis@sha256:84b07a33…5cf5b27`) is reachable
126+
from both workers via the shared pod network namespace.
127+
3. slowapi's shared-counter increments are synchronised across workers.
128+
4. The `120/minute` cap is honoured globally, not per-worker.
80129

81130
---
82131

83132
## Methodology notes
84133

85-
- **Cooldown between runs.** A short pause (10 s) between scenarios is needed; without it, residual queueing from the previous run pollutes the next.
86-
- **Bombardier default 2 s timeout is too aggressive** here — runs at near-saturation see legitimate 1-2 s tail latencies. Use `--timeout 30s` to avoid spurious "timeout" classifications.
87-
- **Single-region edge means single-PoP measurements.** The platform allocates the deployment to one region (DE). Latency from clients elsewhere will differ accordingly, but the throughput ceiling is unaffected — every request still hits the same one container.
88-
- **Single source IP test client.** Distributed traffic from many IPs would not change the aggregate ceiling (the bottleneck is the container) but would change the per-IP rate-limit behaviour, since slowapi keys per source.
89-
- **No CDN cache between client and `/lookup`.** Verified by inspecting response headers — no `Cache-Status`, no `CDN-Cache-Status`, every request reaches the container.
134+
- **Tools, methodology, and corpus are unchanged from the single-worker baseline** — same `bombardier`/`vegeta` versions, same scenarios, same target file format. Numbers are directly comparable.
135+
- **Cooldown between runs.** Same 10 s pause between scenarios as before; needed to keep residual queueing from one run from polluting the next.
136+
- **Single source IP test client.** Aggregate ceiling reflects a single TCP/TLS termination path; distributed traffic from many IPs would push the ceiling up to where the actual per-pod work is the bottleneck — but the recommended operating point is set by single-client latency, so this is the realistic measurement.
137+
- **Multi-worker container topology.** The deployment is now a single pod with two co-located containers (`api` running uvicorn with two workers; `redis:7-alpine` started with `--save "" --appendonly no` for in-memory rate-limit counters). Both share the pod network namespace, so `redis://localhost:6379/0` is the api-to-redis URI. Rate-limit counters reset every minute, so no persistence is needed.
138+
- **Why the 1.6× and not 2×.** Two workers don't double throughput. Likely contributors, in rough order: shared edge-layer TLS termination in front of the single pod; Pydantic serialisation contending under the GIL when both workers are CPU-bound on JSON; vegeta's per-request fresh-connection pattern at higher rates putting more weight on TLS than on the lookup itself. Scenario A (persistent connections) sustains 50 RPS, Scenario B (fresh per-request connections) plateaus at 35-38 — the difference is the TLS handshake cost, which a third worker won't help with.
90139

91140
---
92141

93142
## Recommendations
94143

95-
1. **Per-IP cap set to `120/minute` (2 RPS per IP).** Chosen as 1/15 of the aggregate ceiling — up to 15 simultaneous full-rate anonymous clients can sustain themselves before the aggregate degrades. Friendlier UX for casual users (a small country's worth of postcodes finishes in roughly half the time it took at `60/minute`) while still tight enough that batch users feel the pressure to request a trusted token. Revisit when multi-worker (#68) ships and the aggregate ceiling rises.
144+
1. **Recommended operating point unchanged at 27 RPS.** Scenario E meets the p99 ≤ 200 ms SLO with 100% success at this rate. The multi-worker headroom buys a wider safety margin to that operating point (~30-40% vs ~10%) — useful for absorbing bursts without rewriting the recommendation.
145+
146+
2. **Per-IP cap unchanged at `120/minute` (2 RPS per IP).** With aggregate ceiling now ~50 RPS, this is roughly 1/25 of the ceiling — comfortable margin, supports up to ~25 simultaneous full-rate anonymous clients before the aggregate degrades. Bumping the cap is reasonable if dashboards show consistent under-utilisation, but not required.
96147

97-
2. **Pick `p99 ≤ 200 ms` as the SLO** at the recommended 27 RPS operating point. The full 3-minute sustained run met this.
148+
3. **Don't push above `PC2NUTS_WORKERS=2` yet.** The remaining gap between Scenario A (50 RPS) and Scenario B (35 RPS) suggests the bottleneck has shifted from pure compute to TLS+connection setup. Adding a third worker would help only if the platform's TLS termination scales with it — empirical question, but the cheapest first investigation is reusing connections client-side, not adding more workers.
98149

99-
3. **Re-baseline after issue #7 or any worker-count change lands.** Specifically:
100-
- **#7 (UK NSPL, +1.79M postcodes)** — should not change per-request latency materially (still a dict lookup) but doubles in-memory state. Re-run to confirm.
101-
- **Switching from single-worker to multi-worker**likely the easiest large win. Each additional worker should approximately add another 30 RPS of headroom up to the container's CPU count.
150+
4. **Re-baseline if the topology changes.** Specifically:
151+
- **Adding a second pod replica** (raising `autoScaling.max` above 1) — would multiply both ceilings, and the rate-limit storage already supports it (Redis is shared per pod today; would need to move to a cross-pod shared service if scaling out).
152+
- **#7 (UK NSPL, +1.79M postcodes)**should not change per-request latency materially (still a dict lookup) but doubles in-memory state per worker. Re-run to confirm.
102153

103-
4. **Don't run unattended high-concurrency tests.** Bombardier at c≥100 from a single source triggers platform-level connection refusal (`5xx`, dial timeouts) and risks short-term throttling. Keep scripted load below c=80.
154+
5. **Don't run unattended high-concurrency tests.** Bombardier at c≥100 from a single source still triggers platform-level connection refusal. The `B 50/s` result here (107 × 503) is a milder version of the same edge back-pressure. Keep scripted load below c=80 and below 50 RPS in B-style sweeps.
104155

105156
---
106157

0 commit comments

Comments
 (0)