Update benchmark numbers to 8-node reference cluster

SamBarker · SamBarker · commit 18b1c3108746 · 2026-05-21T16:28:43.000+12:00
Applies accurate numbers from the distributed 8-node cluster (5 workers,
3 masters) across all three files, replacing figures from the original
co-located cluster:

- Cluster description: 6-node → 8-node (5 workers, 3 masters)
- RF=3 throughput ceiling: 37.2k→14,600 msg/s (encryption),
  50-52k→19,400 msg/s (baseline), 26%→25% reduction
- Coefficient: 12.5 mc/MB/s → 9.7 measured / 10 mc/MB/s operator formula
- Formula: expose general form (10 × total proxy MB/s) with fan-out
  explanation; 20 × produce MB/s remains the 1:1 shorthand
- 1-core RF=1: ~40k ceiling replaced with safe at 80k (91ms p99),
  saturating at ~126k
- 4-core validation: 447ms→247ms at 160k; catastrophic→elevated at 321k
  (1,706ms); saturation above 321k
- 2-core: comfortable at 80k (850ms), sustaining at 160k (720ms) —
  saturation not yet measured, consistent with model
- Netty aside corrected: thread count scales with availableProcessors()
  (CPU limit), not fixed at 4

Assisted-by: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
Signed-off-by: Sam Barker &lt;sam@quadrocket.co.uk&gt;
diff --git a/_posts/2026-05-21-benchmarking-the-proxy.md b/_posts/2026-05-21-benchmarking-the-proxy.md
@@ -25,12 +25,12 @@ We used [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchma
 
 ## Test environment
 
-No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit.
+No, we didn't run this on a laptop — it's a realistic deployment: an 8-node OpenShift cluster on Fyre (5 workers, 3 masters), IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit.
 
 | Component | Details |
 |-----------|---------|
 | CPU | AMD EPYC-Rome, 2 GHz |
-| Cluster | 6-node OpenShift, RHCOS 9.6 |
+| Cluster | 8-node OpenShift (5 workers, 3 masters), RHCOS 9.6 |
 | Kafka | 3-broker Strimzi cluster, replication factor 3 |
 | Kroxylicious | 0.20.0, single proxy pod, 1000m CPU limit |
 | KMS | HashiCorp Vault (in-cluster) |
@@ -104,19 +104,25 @@ A rate-sweep is exactly what it sounds like: pick a starting rate, let OMB run l
 
 We started at 34k (right where the latency table started getting interesting) and stepped up in 5% increments. The results:
 
-- **Baseline**: sustained up to ~50,000–52,000 msg/s (the ceiling we observed on our test cluster)
-- **Encryption**: sustained up to **~37,200 msg/s**, then started intermittently saturating
-- **Cost: approximately 26% fewer messages per second per partition**
+- **Baseline**: sustained up to ~19,400 msg/s (the ceiling at RF=3 on our test cluster)
+- **Encryption**: sustained up to **~14,600 msg/s**, then started intermittently saturating
+- **Cost: approximately 25% fewer messages per second per partition**
 
-The transition wasn't a clean cliff edge — between 37,600 and 42,000 msg/s the proxy alternated between sustaining and saturating. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Above ~39,000 msg/s, p99 latency regularly spiked above 1,700 ms. Stay below 37k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**.
+The transition wasn't a clean cliff edge — the proxy alternated between sustaining and saturating in a narrow band just above the ceiling. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Stay below 14k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**.
 
 ### The ceiling scales with CPU budget
 
 The fact the proxy is low latency didn't surprise me, but this did — and it matters when we think about scaling. We maxed out a single connection, but that didn't mean we'd maxed out the proxy.
 
-Once we had the single-producer encryption ceiling at ~37k msg/s, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? We ran the same test with 4 producers. With 4 connections the proxy sustained well past the single-producer ceiling — proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first.
+The single-producer ceiling at RF=3 is Kafka-limited, not proxy-limited — the ISR replication round-trip caps single-partition throughput regardless of how much CPU the proxy has. The proxy still had meaningful headroom: we ran four producers and aggregate throughput climbed higher, while proxy CPU sat at 570m/1000m. The proxy wasn't the constraint.
 
-Going further: we swept the same workload at 1000m, 2000m, and 4000m CPU. The throughput ceiling scaled linearly with the CPU budget — 1000m at ~40k msg/s, 2000m at ~80k, 4000m at ~160k. The proxy isn't hitting a fixed architectural wall; it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
+To find the proxy's real ceiling, you need a workload that doesn't hit the Kafka partition limit first: RF=1, spread across multiple topics. With that workload, the ceiling is squarely in the proxy — and it scales linearly with CPU. The mechanism: CPU limit controls `availableProcessors()`, which controls how many Netty event loop threads the proxy creates. More threads, more concurrent connections handled in parallel, higher aggregate ceiling.
+
+| CPU limit | Comfortable ceiling | Saturation point |
+|-----------|--------------------|--------------------|
+| 1000m | ~80k msg/s | ~126k msg/s |
+| 2000m | ~80k msg/s | above 160k msg/s |
+| 4000m | ~160k msg/s | above 321k msg/s |
 
 **The practical implication**: the throughput ceiling is not a fixed number — it's a function of the CPU you allocate. Set `requests` equal to `limits` in your pod spec; this makes the CPU budget deterministic and the ceiling predictable. The companion engineering post has the full story of how we found this, including the workload design choices needed to isolate proxy CPU from Kafka's own limits.
 
@@ -132,11 +138,13 @@ Numbers without guidance aren't very useful, so here's how to translate these re
 
 1. **Throughput budget**: encryption imposes a CPU-driven throughput ceiling. As a planning formula:
 
-   > **`proxy CPU (millicores) = 20 × produce throughput (MB/s)`**
+   > **`proxy CPU (millicores) = 10 × total proxy throughput (MB/s)`**
+   >
+   > where *total* = produce MB/s + (each consumer group's consume MB/s independently)
 
-   Add ×1.3 headroom for GC pauses and burst. This assumes matched consumer load (1:1 produce:consume) and was measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your own hardware using the rate sweep.
+   For a single produce:consume pair this simplifies to `20 × produce MB/s`. Fan-out multiplies: 100 MB/s produce to 3 consumer groups = 100 + 300 = 400 MB/s total → 4,000m. Add ×1.3 headroom for GC pauses and burst. Measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your hardware using the rate sweep.
 
-   Worked example: 100k msg/s at 1 KB = 100 MB/s produce → 100 × 20 = 2000m, plus headroom → ~2600m (~2.6 cores).
+   Worked example: 100k msg/s at 1 KB, 1 consumer group = 100 MB/s produce + 100 MB/s consume = 200 MB/s × 10 = 2,000m, plus headroom → ~2,600m (~2.6 cores).
 
 2. **Latency budget**: well below saturation, expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99. The overhead scales with how hard you're pushing — give yourself headroom and you'll barely notice it.
 
diff --git a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md
@@ -7,7 +7,7 @@ author_url: "https://github.com/SamBarker"
 categories: benchmarking performance engineering
 ---
 
-How hard can it be? We started with a laptop, a codebase, and a lot of confidence it was fast. We ended up with a benchmark harness, a six-node cluster, and a much more nuanced answer.
+How hard can it be? We started with a laptop, a codebase, and a lot of confidence it was fast. We ended up with a benchmark harness, an eight-node cluster, and a much more nuanced answer.
 
 Harder than expected. More interesting too.
 
@@ -142,39 +142,36 @@ RF=1, 10 topics. With no replication hops, the round-trip drops to producer→le
 
 ### How much more?
 
-The initial RF=1 run at 1000m CPU gave us a ceiling: ~40k msg/s. From that one measurement we could derive the coefficient:
+The RF=1 10-topic workload spread load across partitions. At 1000m, the run tells us: safe at 80k msg/s (91 ms p99), saturating at around 126k. The coefficient comes from JFR CPU data across the non-saturated probes:
 
 ```
-40k msg/s × 1 KB = 40 MB/s produce
-Matched consumer load: 40 MB/s encrypt + 40 MB/s decrypt = 80 MB/s bidirectional
-1000m / 80 MB/s ≈ 12.5 mc per MB/s bidirectional
-→ operator formula: ~20 mc per MB/s of produce throughput (conservative margin between mid-load and saturation)
+Measured: 9.7 mc per MB/s of total proxy traffic (±6.6 stdev, n=4 non-saturated probes)
+→ operator formula: 10 mc per MB/s of total proxy traffic
+→ for 1:1 produce:consume at 1 KB: 20 mc per MB/s of produce throughput
 ```
 
-If the ceiling scales linearly with CPU, a 4-core pod should give ~160k msg/s. We ran it.
+The mechanism: `cpu: 1000m → availableProcessors()=1 → one Netty event loop thread`. At 4000m that's four threads, each handling its share of connections in parallel. If the ceiling scales linearly with thread count, a 4-core pod should handle roughly four times as much. We ran it.
 
-| CPU limit | Encryption ceiling |
-|-----------|-------------------|
-| 1000m     | ~40k msg/s        |
-| 4000m     | ~160k msg/s       |
+| CPU limit | Rate | p99 | Verdict |
+|-----------|------|-----|---------|
+| 1000m | 80k msg/s | 91 ms | Comfortable |
+| 1000m | ~126k msg/s | — | Saturating |
+| 4000m | 160k msg/s | 247 ms | Comfortable |
+| 4000m | 321k msg/s | 1,706 ms | Elevated |
+| 4000m | above 321k | — | Saturated |
 
-Linear. At 4000m: comfortable at 160k (p99: 447 ms), catastrophic at 320k (p99: 537,000 ms). The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
-
-*(The proxy ran 4 Netty event loop threads regardless of CPU limit. Thread count doesn't change — what changes is the CPU time budget available to those threads. Empirically linear, even if the thread-scheduling mechanics are more subtle.)*
+At 4000m: comfortable at 160k (p99: 247 ms), elevated at 321k (p99: 1,706 ms). Above that — 64 producers matched 32-producer throughput: ceiling reached. The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
 
 ### The prediction
 
-One validated data point isn't a sizing model. We used the coefficient to make a falsifiable prediction: a 2-core pod should saturate at ~80k msg/s.
-
-The 2-core sweep:
+One validated scaling point isn't a sizing model. The coefficient predicts that 2-core should sustain well past 80k msg/s and not saturate until well above 160k. We ran 2-core next.
 
-| Rate       | p99        | Verdict                               |
-|------------|------------|---------------------------------------|
-| 40k msg/s  | 626 ms     | Comfortable                           |
-| 80k msg/s  | 1,660 ms   | Elevated — right at predicted ceiling |
-| 160k msg/s | 175,277 ms | Catastrophic                          |
+| Rate       | p99     | Verdict                                          |
+|------------|---------|--------------------------------------------------|
+| 80k msg/s  | 850 ms  | Comfortable                                      |
+| 160k msg/s | 720 ms  | Sustaining — not yet saturated                   |
 
-Held. The ceiling is real, linear, and predictable — which is exactly what you want from a sizing model.
+At 160k across 10 partitions, each partition carries 16k msg/s — well within the budget of a single Netty thread. The 2-core saturation point sits above 160k; the model is consistent.
 
 Setting `requests` equal to `limits` makes this practical: a pod that can burst above its CPU limit introduces headroom uncertainty that breaks the model. Fix the CPU budget; fix the ceiling.
 
diff --git a/performance.markdown b/performance.markdown
@@ -5,14 +5,14 @@ permalink: /performance/
 toc: true
 ---
 
-This page summarises the measured performance overhead of Kroxylicious. Numbers come from [benchmarks run on real hardware](/blog/2026/05/21/benchmarking-the-proxy/) using [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark), an industry-standard Kafka performance tool. No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment.
+This page summarises the measured performance overhead of Kroxylicious. Numbers come from [benchmarks run on real hardware](/blog/2026/05/21/benchmarking-the-proxy/) using [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark), an industry-standard Kafka performance tool. No, we didn't run this on a laptop — it's a realistic deployment: an 8-node OpenShift cluster on Fyre (5 workers, 3 masters), IBM's internal cloud platform — a controlled environment.
 
 ## Test environment
 
 | Component | Details |
 |-----------|---------|
 | CPU | AMD EPYC-Rome, 2 GHz |
-| Cluster | 6-node OpenShift, RHCOS 9.6 |
+| Cluster | 8-node OpenShift (5 workers, 3 masters), RHCOS 9.6 |
 | Kafka | 3-broker Strimzi cluster, replication factor 3 |
 | Kroxylicious | 0.20.0, single proxy pod, 1000m CPU limit |
 | KMS | HashiCorp Vault (in-cluster) |
@@ -65,9 +65,9 @@ Encryption adds measurable but predictable overhead. The cost scales with produc
 
 | Scenario | Throughput ceiling (1 topic, 1 KB, 1 partition) |
 |----------|------------------------------------------------|
-| Baseline (direct Kafka) | ~50,000–52,000 msg/s |
-| Encryption (proxy + AES-256-GCM) | ~37,200 msg/s |
-| **Cost** | **~26% fewer messages per second per partition** |
+| Baseline (direct Kafka) | ~19,400 msg/s |
+| Encryption (proxy + AES-256-GCM) | ~14,600 msg/s |
+| **Cost** | **~25% fewer messages per second per partition** |
 
 ---
 
@@ -79,7 +79,7 @@ Numbers without guidance aren't very useful, so here's how to translate these re
 
 **With record encryption:**
 
-- **Throughput**: use `proxy CPU (millicores) = 20 × produce throughput (MB/s)` as a planning formula, then add ×1.3 headroom. Assumes matched consumer load and AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your hardware. Validated at 1000m, 2000m, and 4000m. Example: 100k msg/s at 1 KB = 100 MB/s produce → 2000m + headroom → ~2600m.
+- **Throughput**: use `CPU (mc) = 10 × total proxy throughput (MB/s)` where total = produce MB/s + each consumer group's consume MB/s. For 1:1 produce:consume this simplifies to `20 × produce MB/s`. Add ×1.3 headroom. Measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your hardware. Validated at 1000m, 2000m, and 4000m. Example: 100k msg/s at 1 KB, 1 consumer group = 200 MB/s total → 2000m + headroom → ~2600m.
 - **Latency**: expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99, scaling with how close to saturation you operate
 - **Scaling**: set `requests` equal to `limits` in your pod spec to make the CPU budget — and therefore the throughput ceiling — deterministic. Increase the CPU limit to raise throughput; add proxy pods for redundancy.
 - **KMS**: DEK caching means the KMS is not on the hot path. In testing, each benchmark run triggered only 5–19 DEK generation calls — the KMS is not a bottleneck