You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update benchmark numbers to 8-node reference cluster
Applies accurate numbers from the distributed 8-node cluster (5 workers,
3 masters) across all three files, replacing figures from the original
co-located cluster:
- Cluster description: 6-node → 8-node (5 workers, 3 masters)
- RF=3 throughput ceiling: 37.2k→14,600 msg/s (encryption),
50-52k→19,400 msg/s (baseline), 26%→25% reduction
- Coefficient: 12.5 mc/MB/s → 9.7 measured / 10 mc/MB/s operator formula
- Formula: expose general form (10 × total proxy MB/s) with fan-out
explanation; 20 × produce MB/s remains the 1:1 shorthand
- 1-core RF=1: ~40k ceiling replaced with safe at 80k (91ms p99),
saturating at ~126k
- 4-core validation: 447ms→247ms at 160k; catastrophic→elevated at 321k
(1,706ms); saturation above 321k
- 2-core: comfortable at 80k (850ms), sustaining at 160k (720ms) —
saturation not yet measured, consistent with model
- Netty aside corrected: thread count scales with availableProcessors()
(CPU limit), not fixed at 4
Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Copy file name to clipboardExpand all lines: _posts/2026-05-21-benchmarking-the-proxy.md
+19-11Lines changed: 19 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,12 +25,12 @@ We used [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchma
25
25
26
26
## Test environment
27
27
28
-
No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit.
28
+
No, we didn't run this on a laptop — it's a realistic deployment: an 8-node OpenShift cluster on Fyre (5 workers, 3 masters), IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit.
| Kroxylicious | 0.20.0, single proxy pod, 1000m CPU limit |
36
36
| KMS | HashiCorp Vault (in-cluster) |
@@ -104,19 +104,25 @@ A rate-sweep is exactly what it sounds like: pick a starting rate, let OMB run l
104
104
105
105
We started at 34k (right where the latency table started getting interesting) and stepped up in 5% increments. The results:
106
106
107
-
-**Baseline**: sustained up to ~50,000–52,000 msg/s (the ceiling we observed on our test cluster)
108
-
-**Encryption**: sustained up to **~37,200 msg/s**, then started intermittently saturating
109
-
-**Cost: approximately 26% fewer messages per second per partition**
107
+
-**Baseline**: sustained up to ~19,400 msg/s (the ceiling at RF=3 on our test cluster)
108
+
-**Encryption**: sustained up to **~14,600 msg/s**, then started intermittently saturating
109
+
-**Cost: approximately 25% fewer messages per second per partition**
110
110
111
-
The transition wasn't a clean cliff edge — between 37,600 and 42,000 msg/s the proxy alternated between sustaining and saturating. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Above ~39,000 msg/s, p99 latency regularly spiked above 1,700 ms. Stay below 37k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**.
111
+
The transition wasn't a clean cliff edge — the proxy alternated between sustaining and saturating in a narrow band just above the ceiling. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Stay below 14k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**.
112
112
113
113
### The ceiling scales with CPU budget
114
114
115
115
The fact the proxy is low latency didn't surprise me, but this did — and it matters when we think about scaling. We maxed out a single connection, but that didn't mean we'd maxed out the proxy.
116
116
117
-
Once we had the single-producer encryption ceiling at ~37k msg/s, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? We ran the same test with 4 producers. With 4 connections the proxy sustained well past the single-producer ceiling — proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first.
117
+
The single-producer ceiling at RF=3 is Kafka-limited, not proxy-limited — the ISR replication round-trip caps single-partition throughput regardless of how much CPU the proxy has. The proxy still had meaningful headroom: we ran four producers and aggregate throughput climbed higher, while proxy CPU sat at 570m/1000m. The proxy wasn't the constraint.
118
118
119
-
Going further: we swept the same workload at 1000m, 2000m, and 4000m CPU. The throughput ceiling scaled linearly with the CPU budget — 1000m at ~40k msg/s, 2000m at ~80k, 4000m at ~160k. The proxy isn't hitting a fixed architectural wall; it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
119
+
To find the proxy's real ceiling, you need a workload that doesn't hit the Kafka partition limit first: RF=1, spread across multiple topics. With that workload, the ceiling is squarely in the proxy — and it scales linearly with CPU. The mechanism: CPU limit controls `availableProcessors()`, which controls how many Netty event loop threads the proxy creates. More threads, more concurrent connections handled in parallel, higher aggregate ceiling.
120
+
121
+
| CPU limit | Comfortable ceiling | Saturation point |
**The practical implication**: the throughput ceiling is not a fixed number — it's a function of the CPU you allocate. Set `requests` equal to `limits` in your pod spec; this makes the CPU budget deterministic and the ceiling predictable. The companion engineering post has the full story of how we found this, including the workload design choices needed to isolate proxy CPU from Kafka's own limits.
122
128
@@ -132,11 +138,13 @@ Numbers without guidance aren't very useful, so here's how to translate these re
132
138
133
139
1.**Throughput budget**: encryption imposes a CPU-driven throughput ceiling. As a planning formula:
134
140
135
-
> **`proxy CPU (millicores) = 20 × produce throughput (MB/s)`**
141
+
> **`proxy CPU (millicores) = 10 × total proxy throughput (MB/s)`**
142
+
>
143
+
> where *total* = produce MB/s + (each consumer group's consume MB/s independently)
136
144
137
-
Add ×1.3 headroom for GC pauses and burst. This assumes matched consumer load (1:1 produce:consume) and was measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your own hardware using the rate sweep.
145
+
For a single produce:consume pair this simplifies to `20 × produce MB/s`. Fan-out multiplies: 100 MB/s produce to 3 consumer groups = 100 + 300 = 400 MB/s total → 4,000m. Add ×1.3 headroom for GC pauses and burst. Measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your hardware using the rate sweep.
138
146
139
-
Worked example: 100k msg/s at 1 KB= 100 MB/s produce → 100 × 20 = 2000m, plus headroom → ~2600m (~2.6 cores).
147
+
Worked example: 100k msg/s at 1 KB, 1 consumer group = 100 MB/s produce + 100 MB/s consume = 200 MB/s × 10 = 2,000m, plus headroom → ~2,600m (~2.6 cores).
140
148
141
149
2.**Latency budget**: well below saturation, expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99. The overhead scales with how hard you're pushing — give yourself headroom and you'll barely notice it.
How hard can it be? We started with a laptop, a codebase, and a lot of confidence it was fast. We ended up with a benchmark harness, a six-node cluster, and a much more nuanced answer.
10
+
How hard can it be? We started with a laptop, a codebase, and a lot of confidence it was fast. We ended up with a benchmark harness, an eight-node cluster, and a much more nuanced answer.
11
11
12
12
Harder than expected. More interesting too.
13
13
@@ -142,39 +142,36 @@ RF=1, 10 topics. With no replication hops, the round-trip drops to producer→le
142
142
143
143
### How much more?
144
144
145
-
The initial RF=1 run at 1000m CPU gave us a ceiling: ~40k msg/s. From that one measurement we could derive the coefficient:
145
+
The RF=1 10-topic workload spread load across partitions. At 1000m, the run tells us: safe at 80k msg/s (91 ms p99), saturating at around 126k. The coefficient comes from JFR CPU data across the non-saturated probes:
→ operator formula: ~20 mc per MB/s of produce throughput (conservative margin between mid-load and saturation)
148
+
Measured: 9.7 mc per MB/s of total proxy traffic (±6.6 stdev, n=4 non-saturated probes)
149
+
→ operator formula: 10 mc per MB/s of total proxy traffic
150
+
→ for 1:1 produce:consume at 1 KB: 20 mc per MB/s of produce throughput
152
151
```
153
152
154
-
If the ceiling scales linearly with CPU, a 4-core pod should give ~160k msg/s. We ran it.
153
+
The mechanism: `cpu: 1000m → availableProcessors()=1 → one Netty event loop thread`. At 4000m that's four threads, each handling its share of connections in parallel. If the ceiling scales linearly with thread count, a 4-core pod should handle roughly four times as much. We ran it.
155
154
156
-
| CPU limit | Encryption ceiling |
157
-
|-----------|-------------------|
158
-
| 1000m |~40k msg/s |
159
-
| 4000m |~160k msg/s |
155
+
| CPU limit | Rate | p99 | Verdict |
156
+
|-----------|------|-----|---------|
157
+
| 1000m | 80k msg/s | 91 ms | Comfortable |
158
+
| 1000m |~126k msg/s | — | Saturating |
159
+
| 4000m | 160k msg/s | 247 ms | Comfortable |
160
+
| 4000m | 321k msg/s | 1,706 ms | Elevated |
161
+
| 4000m | above 321k | — | Saturated |
160
162
161
-
Linear. At 4000m: comfortable at 160k (p99: 447 ms), catastrophic at 320k (p99: 537,000 ms). The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
162
-
163
-
*(The proxy ran 4 Netty event loop threads regardless of CPU limit. Thread count doesn't change — what changes is the CPU time budget available to those threads. Empirically linear, even if the thread-scheduling mechanics are more subtle.)*
163
+
At 4000m: comfortable at 160k (p99: 247 ms), elevated at 321k (p99: 1,706 ms). Above that — 64 producers matched 32-producer throughput: ceiling reached. The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
164
164
165
165
### The prediction
166
166
167
-
One validated data point isn't a sizing model. We used the coefficient to make a falsifiable prediction: a 2-core pod should saturate at ~80k msg/s.
168
-
169
-
The 2-core sweep:
167
+
One validated scaling point isn't a sizing model. The coefficient predicts that 2-core should sustain well past 80k msg/s and not saturate until well above 160k. We ran 2-core next.
| 160k msg/s | 720 ms | Sustaining — not yet saturated |
176
173
177
-
Held. The ceiling is real, linear, and predictable — which is exactly what you want from a sizing model.
174
+
At 160k across 10 partitions, each partition carries 16k msg/s — well within the budget of a single Netty thread. The 2-core saturation point sits above 160k; the model is consistent.
178
175
179
176
Setting `requests` equal to `limits` makes this practical: a pod that can burst above its CPU limit introduces headroom uncertainty that breaks the model. Fix the CPU budget; fix the ceiling.
Copy file name to clipboardExpand all lines: performance.markdown
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,14 +5,14 @@ permalink: /performance/
5
5
toc: true
6
6
---
7
7
8
-
This page summarises the measured performance overhead of Kroxylicious. Numbers come from [benchmarks run on real hardware](/blog/2026/05/21/benchmarking-the-proxy/) using [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark), an industry-standard Kafka performance tool. No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment.
8
+
This page summarises the measured performance overhead of Kroxylicious. Numbers come from [benchmarks run on real hardware](/blog/2026/05/21/benchmarking-the-proxy/) using [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark), an industry-standard Kafka performance tool. No, we didn't run this on a laptop — it's a realistic deployment: an 8-node OpenShift cluster on Fyre (5 workers, 3 masters), IBM's internal cloud platform — a controlled environment.
|**Cost**|**~25% fewer messages per second per partition**|
71
71
72
72
---
73
73
@@ -79,7 +79,7 @@ Numbers without guidance aren't very useful, so here's how to translate these re
79
79
80
80
**With record encryption:**
81
81
82
-
-**Throughput**: use `proxy CPU (millicores) = 20 × produce throughput (MB/s)`as a planning formula, then add ×1.3 headroom. Assumes matched consumer load and AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your hardware. Validated at 1000m, 2000m, and 4000m. Example: 100k msg/s at 1 KB = 100 MB/s produce → 2000m + headroom → ~2600m.
82
+
-**Throughput**: use `CPU (mc) = 10 × total proxy throughput (MB/s)`where total = produce MB/s + each consumer group's consume MB/s. For 1:1 produce:consume this simplifies to `20 × produce MB/s`. Add ×1.3 headroom. Measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your hardware. Validated at 1000m, 2000m, and 4000m. Example: 100k msg/s at 1 KB, 1 consumer group = 200 MB/s total → 2000m + headroom → ~2600m.
83
83
-**Latency**: expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99, scaling with how close to saturation you operate
84
84
-**Scaling**: set `requests` equal to `limits` in your pod spec to make the CPU budget — and therefore the throughput ceiling — deterministic. Increase the CPU limit to raise throughput; add proxy pods for redundancy.
85
85
-**KMS**: DEK caching means the KMS is not on the hot path. In testing, each benchmark run triggered only 5–19 DEK generation calls — the KMS is not a bottleneck
0 commit comments