Skip to content

Commit 18b1c31

Browse files
committed
Update benchmark numbers to 8-node reference cluster
Applies accurate numbers from the distributed 8-node cluster (5 workers, 3 masters) across all three files, replacing figures from the original co-located cluster: - Cluster description: 6-node → 8-node (5 workers, 3 masters) - RF=3 throughput ceiling: 37.2k→14,600 msg/s (encryption), 50-52k→19,400 msg/s (baseline), 26%→25% reduction - Coefficient: 12.5 mc/MB/s → 9.7 measured / 10 mc/MB/s operator formula - Formula: expose general form (10 × total proxy MB/s) with fan-out explanation; 20 × produce MB/s remains the 1:1 shorthand - 1-core RF=1: ~40k ceiling replaced with safe at 80k (91ms p99), saturating at ~126k - 4-core validation: 447ms→247ms at 160k; catastrophic→elevated at 321k (1,706ms); saturation above 321k - 2-core: comfortable at 80k (850ms), sustaining at 160k (720ms) — saturation not yet measured, consistent with model - Netty aside corrected: thread count scales with availableProcessors() (CPU limit), not fixed at 4 Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
1 parent c7a7fbf commit 18b1c31

3 files changed

Lines changed: 45 additions & 40 deletions

File tree

_posts/2026-05-21-benchmarking-the-proxy.md

Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -25,12 +25,12 @@ We used [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchma
2525

2626
## Test environment
2727

28-
No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit.
28+
No, we didn't run this on a laptop — it's a realistic deployment: an 8-node OpenShift cluster on Fyre (5 workers, 3 masters), IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit.
2929

3030
| Component | Details |
3131
|-----------|---------|
3232
| CPU | AMD EPYC-Rome, 2 GHz |
33-
| Cluster | 6-node OpenShift, RHCOS 9.6 |
33+
| Cluster | 8-node OpenShift (5 workers, 3 masters), RHCOS 9.6 |
3434
| Kafka | 3-broker Strimzi cluster, replication factor 3 |
3535
| Kroxylicious | 0.20.0, single proxy pod, 1000m CPU limit |
3636
| KMS | HashiCorp Vault (in-cluster) |
@@ -104,19 +104,25 @@ A rate-sweep is exactly what it sounds like: pick a starting rate, let OMB run l
104104

105105
We started at 34k (right where the latency table started getting interesting) and stepped up in 5% increments. The results:
106106

107-
- **Baseline**: sustained up to ~50,000–52,000 msg/s (the ceiling we observed on our test cluster)
108-
- **Encryption**: sustained up to **~37,200 msg/s**, then started intermittently saturating
109-
- **Cost: approximately 26% fewer messages per second per partition**
107+
- **Baseline**: sustained up to ~19,400 msg/s (the ceiling at RF=3 on our test cluster)
108+
- **Encryption**: sustained up to **~14,600 msg/s**, then started intermittently saturating
109+
- **Cost: approximately 25% fewer messages per second per partition**
110110

111-
The transition wasn't a clean cliff edge — between 37,600 and 42,000 msg/s the proxy alternated between sustaining and saturating. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Above ~39,000 msg/s, p99 latency regularly spiked above 1,700 ms. Stay below 37k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**.
111+
The transition wasn't a clean cliff edge — the proxy alternated between sustaining and saturating in a narrow band just above the ceiling. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Stay below 14k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**.
112112

113113
### The ceiling scales with CPU budget
114114

115115
The fact the proxy is low latency didn't surprise me, but this did — and it matters when we think about scaling. We maxed out a single connection, but that didn't mean we'd maxed out the proxy.
116116

117-
Once we had the single-producer encryption ceiling at ~37k msg/s, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? We ran the same test with 4 producers. With 4 connections the proxy sustained well past the single-producer ceiling — proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first.
117+
The single-producer ceiling at RF=3 is Kafka-limited, not proxy-limited — the ISR replication round-trip caps single-partition throughput regardless of how much CPU the proxy has. The proxy still had meaningful headroom: we ran four producers and aggregate throughput climbed higher, while proxy CPU sat at 570m/1000m. The proxy wasn't the constraint.
118118

119-
Going further: we swept the same workload at 1000m, 2000m, and 4000m CPU. The throughput ceiling scaled linearly with the CPU budget — 1000m at ~40k msg/s, 2000m at ~80k, 4000m at ~160k. The proxy isn't hitting a fixed architectural wall; it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
119+
To find the proxy's real ceiling, you need a workload that doesn't hit the Kafka partition limit first: RF=1, spread across multiple topics. With that workload, the ceiling is squarely in the proxy — and it scales linearly with CPU. The mechanism: CPU limit controls `availableProcessors()`, which controls how many Netty event loop threads the proxy creates. More threads, more concurrent connections handled in parallel, higher aggregate ceiling.
120+
121+
| CPU limit | Comfortable ceiling | Saturation point |
122+
|-----------|--------------------|--------------------|
123+
| 1000m | ~80k msg/s | ~126k msg/s |
124+
| 2000m | ~80k msg/s | above 160k msg/s |
125+
| 4000m | ~160k msg/s | above 321k msg/s |
120126

121127
**The practical implication**: the throughput ceiling is not a fixed number — it's a function of the CPU you allocate. Set `requests` equal to `limits` in your pod spec; this makes the CPU budget deterministic and the ceiling predictable. The companion engineering post has the full story of how we found this, including the workload design choices needed to isolate proxy CPU from Kafka's own limits.
122128

@@ -132,11 +138,13 @@ Numbers without guidance aren't very useful, so here's how to translate these re
132138

133139
1. **Throughput budget**: encryption imposes a CPU-driven throughput ceiling. As a planning formula:
134140

135-
> **`proxy CPU (millicores) = 20 × produce throughput (MB/s)`**
141+
> **`proxy CPU (millicores) = 10 × total proxy throughput (MB/s)`**
142+
>
143+
> where *total* = produce MB/s + (each consumer group's consume MB/s independently)
136144
137-
Add ×1.3 headroom for GC pauses and burst. This assumes matched consumer load (1:1 produce:consume) and was measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your own hardware using the rate sweep.
145+
For a single produce:consume pair this simplifies to `20 × produce MB/s`. Fan-out multiplies: 100 MB/s produce to 3 consumer groups = 100 + 300 = 400 MB/s total → 4,000m. Add ×1.3 headroom for GC pauses and burst. Measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your hardware using the rate sweep.
138146

139-
Worked example: 100k msg/s at 1 KB = 100 MB/s produce 100 × 20 = 2000m, plus headroom → ~2600m (~2.6 cores).
147+
Worked example: 100k msg/s at 1 KB, 1 consumer group = 100 MB/s produce + 100 MB/s consume = 200 MB/s × 10 = 2,000m, plus headroom → ~2,600m (~2.6 cores).
140148

141149
2. **Latency budget**: well below saturation, expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99. The overhead scales with how hard you're pushing — give yourself headroom and you'll barely notice it.
142150

_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md

Lines changed: 20 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ author_url: "https://github.com/SamBarker"
77
categories: benchmarking performance engineering
88
---
99

10-
How hard can it be? We started with a laptop, a codebase, and a lot of confidence it was fast. We ended up with a benchmark harness, a six-node cluster, and a much more nuanced answer.
10+
How hard can it be? We started with a laptop, a codebase, and a lot of confidence it was fast. We ended up with a benchmark harness, an eight-node cluster, and a much more nuanced answer.
1111

1212
Harder than expected. More interesting too.
1313

@@ -142,39 +142,36 @@ RF=1, 10 topics. With no replication hops, the round-trip drops to producer→le
142142

143143
### How much more?
144144

145-
The initial RF=1 run at 1000m CPU gave us a ceiling: ~40k msg/s. From that one measurement we could derive the coefficient:
145+
The RF=1 10-topic workload spread load across partitions. At 1000m, the run tells us: safe at 80k msg/s (91 ms p99), saturating at around 126k. The coefficient comes from JFR CPU data across the non-saturated probes:
146146

147147
```
148-
40k msg/s × 1 KB = 40 MB/s produce
149-
Matched consumer load: 40 MB/s encrypt + 40 MB/s decrypt = 80 MB/s bidirectional
150-
1000m / 80 MB/s ≈ 12.5 mc per MB/s bidirectional
151-
→ operator formula: ~20 mc per MB/s of produce throughput (conservative margin between mid-load and saturation)
148+
Measured: 9.7 mc per MB/s of total proxy traffic (±6.6 stdev, n=4 non-saturated probes)
149+
→ operator formula: 10 mc per MB/s of total proxy traffic
150+
→ for 1:1 produce:consume at 1 KB: 20 mc per MB/s of produce throughput
152151
```
153152

154-
If the ceiling scales linearly with CPU, a 4-core pod should give ~160k msg/s. We ran it.
153+
The mechanism: `cpu: 1000m → availableProcessors()=1 → one Netty event loop thread`. At 4000m that's four threads, each handling its share of connections in parallel. If the ceiling scales linearly with thread count, a 4-core pod should handle roughly four times as much. We ran it.
155154

156-
| CPU limit | Encryption ceiling |
157-
|-----------|-------------------|
158-
| 1000m | ~40k msg/s |
159-
| 4000m | ~160k msg/s |
155+
| CPU limit | Rate | p99 | Verdict |
156+
|-----------|------|-----|---------|
157+
| 1000m | 80k msg/s | 91 ms | Comfortable |
158+
| 1000m | ~126k msg/s || Saturating |
159+
| 4000m | 160k msg/s | 247 ms | Comfortable |
160+
| 4000m | 321k msg/s | 1,706 ms | Elevated |
161+
| 4000m | above 321k || Saturated |
160162

161-
Linear. At 4000m: comfortable at 160k (p99: 447 ms), catastrophic at 320k (p99: 537,000 ms). The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
162-
163-
*(The proxy ran 4 Netty event loop threads regardless of CPU limit. Thread count doesn't change — what changes is the CPU time budget available to those threads. Empirically linear, even if the thread-scheduling mechanics are more subtle.)*
163+
At 4000m: comfortable at 160k (p99: 247 ms), elevated at 321k (p99: 1,706 ms). Above that — 64 producers matched 32-producer throughput: ceiling reached. The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
164164

165165
### The prediction
166166

167-
One validated data point isn't a sizing model. We used the coefficient to make a falsifiable prediction: a 2-core pod should saturate at ~80k msg/s.
168-
169-
The 2-core sweep:
167+
One validated scaling point isn't a sizing model. The coefficient predicts that 2-core should sustain well past 80k msg/s and not saturate until well above 160k. We ran 2-core next.
170168

171-
| Rate | p99 | Verdict |
172-
|------------|------------|---------------------------------------|
173-
| 40k msg/s | 626 ms | Comfortable |
174-
| 80k msg/s | 1,660 ms | Elevated — right at predicted ceiling |
175-
| 160k msg/s | 175,277 ms | Catastrophic |
169+
| Rate | p99 | Verdict |
170+
|------------|---------|--------------------------------------------------|
171+
| 80k msg/s | 850 ms | Comfortable |
172+
| 160k msg/s | 720 ms | Sustaining — not yet saturated |
176173

177-
Held. The ceiling is real, linear, and predictable — which is exactly what you want from a sizing model.
174+
At 160k across 10 partitions, each partition carries 16k msg/s — well within the budget of a single Netty thread. The 2-core saturation point sits above 160k; the model is consistent.
178175

179176
Setting `requests` equal to `limits` makes this practical: a pod that can burst above its CPU limit introduces headroom uncertainty that breaks the model. Fix the CPU budget; fix the ceiling.
180177

performance.markdown

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,14 @@ permalink: /performance/
55
toc: true
66
---
77

8-
This page summarises the measured performance overhead of Kroxylicious. Numbers come from [benchmarks run on real hardware](/blog/2026/05/21/benchmarking-the-proxy/) using [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark), an industry-standard Kafka performance tool. No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment.
8+
This page summarises the measured performance overhead of Kroxylicious. Numbers come from [benchmarks run on real hardware](/blog/2026/05/21/benchmarking-the-proxy/) using [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark), an industry-standard Kafka performance tool. No, we didn't run this on a laptop — it's a realistic deployment: an 8-node OpenShift cluster on Fyre (5 workers, 3 masters), IBM's internal cloud platform — a controlled environment.
99

1010
## Test environment
1111

1212
| Component | Details |
1313
|-----------|---------|
1414
| CPU | AMD EPYC-Rome, 2 GHz |
15-
| Cluster | 6-node OpenShift, RHCOS 9.6 |
15+
| Cluster | 8-node OpenShift (5 workers, 3 masters), RHCOS 9.6 |
1616
| Kafka | 3-broker Strimzi cluster, replication factor 3 |
1717
| Kroxylicious | 0.20.0, single proxy pod, 1000m CPU limit |
1818
| KMS | HashiCorp Vault (in-cluster) |
@@ -65,9 +65,9 @@ Encryption adds measurable but predictable overhead. The cost scales with produc
6565

6666
| Scenario | Throughput ceiling (1 topic, 1 KB, 1 partition) |
6767
|----------|------------------------------------------------|
68-
| Baseline (direct Kafka) | ~50,000–52,000 msg/s |
69-
| Encryption (proxy + AES-256-GCM) | ~37,200 msg/s |
70-
| **Cost** | **~26% fewer messages per second per partition** |
68+
| Baseline (direct Kafka) | ~19,400 msg/s |
69+
| Encryption (proxy + AES-256-GCM) | ~14,600 msg/s |
70+
| **Cost** | **~25% fewer messages per second per partition** |
7171

7272
---
7373

@@ -79,7 +79,7 @@ Numbers without guidance aren't very useful, so here's how to translate these re
7979

8080
**With record encryption:**
8181

82-
- **Throughput**: use `proxy CPU (millicores) = 20 × produce throughput (MB/s)` as a planning formula, then add ×1.3 headroom. Assumes matched consumer load and AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your hardware. Validated at 1000m, 2000m, and 4000m. Example: 100k msg/s at 1 KB = 100 MB/s produce → 2000m + headroom → ~2600m.
82+
- **Throughput**: use `CPU (mc) = 10 × total proxy throughput (MB/s)` where total = produce MB/s + each consumer group's consume MB/s. For 1:1 produce:consume this simplifies to `20 × produce MB/s`. Add ×1.3 headroom. Measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your hardware. Validated at 1000m, 2000m, and 4000m. Example: 100k msg/s at 1 KB, 1 consumer group = 200 MB/s total → 2000m + headroom → ~2600m.
8383
- **Latency**: expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99, scaling with how close to saturation you operate
8484
- **Scaling**: set `requests` equal to `limits` in your pod spec to make the CPU budget — and therefore the throughput ceiling — deterministic. Increase the CPU limit to raise throughput; add proxy pods for redundancy.
8585
- **KMS**: DEK caching means the KMS is not on the hot path. In testing, each benchmark run triggered only 5–19 DEK generation calls — the KMS is not a bottleneck

0 commit comments

Comments
 (0)