Skip to content

Commit 174ca59

Browse files
committed
Update benchmarking posts with validated coefficient and corrected framing
- Shift publication dates to May 21 and May 28 - Replace speculative per-connection ceiling explanation with empirical finding: encryption throughput ceiling scales linearly with CPU budget (validated at 1000m, 2000m, 4000m) - Add sizing formula: CPU (mc) = 20 × produce_MB_per_s, with worked example - Add RF=3 masking caveat: initial 1-topic sweeps conflated Kafka replication ceiling with proxy CPU ceiling; coefficient derived from RF=1 multi-topic workloads - Post 2: add full investigation narrative — workload isolation approach, coefficient derivation, 4-core confirmation, and 2-core prediction/validation - Drop stale "future work" items that are now complete Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
1 parent 0fac222 commit 174ca59

3 files changed

Lines changed: 83 additions & 41 deletions

File tree

_posts/2026-05-01-benchmarking-the-proxy.md renamed to _posts/2026-05-21-benchmarking-the-proxy.md

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: post
33
title: "Does my proxy look big in this cluster?"
4-
date: 2026-05-01 00:00:00 +0000
4+
date: 2026-05-21 00:00:00 +0000
55
author: "Sam Barker"
66
author_url: "https://github.com/SamBarker"
77
categories: benchmarking performance
@@ -21,11 +21,11 @@ We ran three scenarios against the same Apache Kafka® cluster on the same hardw
2121
- **Passthrough proxy** — traffic routed through Kroxylicious with no filter chain configured
2222
- **Record encryption** — traffic through Kroxylicious with AES-256-GCM record encryption enabled, using HashiCorp Vault as the KMS
2323

24-
We used [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) rather than Kafka's own `kafka-producer-perf-test`. OMB is an industry-standard tool that coordinates producers and consumers together, measures end-to-end latency (not just publish latency), and produces structured JSON that makes comparison straightforward. More on why we built a whole harness around it in the [companion engineering post]({% post_url 2026-05-08-benchmarking-the-proxy-under-the-hood %}).
24+
We used [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) rather than Kafka's own `kafka-producer-perf-test`. OMB is an industry-standard tool that coordinates producers and consumers together, measures end-to-end latency (not just publish latency), and produces structured JSON that makes comparison straightforward. More on why we built a whole harness around it in the [companion engineering post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}).
2525

2626
## Test environment
2727

28-
All results were collected on a 6-node OpenShift cluster on Fyre, IBM's internal cloud environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit — one core.
28+
All results were collected on a 6-node OpenShift cluster on Fyre, IBM's internal cloud environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit.
2929

3030
| Component | Details |
3131
|-----------|---------|
@@ -107,20 +107,15 @@ We started at 34k (right where the latency table started getting interesting) an
107107

108108
The transition wasn't a clean cliff edge — between 37,600 and 42,000 msg/sec the proxy alternated between sustaining and saturating. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Above ~39,000 msg/sec, p99 latency regularly spiked above 1,700 ms. Stay below 37k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**.
109109

110-
### The thing that surprised us: per-connection, not per-pod
110+
### The ceiling scales with CPU budget
111111

112112
The fact the proxy is low latency didn't surprise me, but this did — and it matters when we think about scaling. We maxed out a single connection, but that didn't mean we'd maxed out the proxy.
113113

114-
Once we had the single-producer encryption ceiling at ~37k msg/sec, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? The answer changes how you scale.
114+
Once we had the single-producer encryption ceiling at ~37k msg/sec, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? We ran the same test with 4 producers. With 4 connections the proxy sustained well past the single-producer ceiling — proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first.
115115

116-
We ran the same test with 4 producers sharing the same single partition. With 4 connections the proxy sustained well past the single-producer ceilingthe Netty event loop queues stayed empty throughout, confirming the proxy had capacity to spare. The reason is how Netty works: each client connection gets its own event loop thread, and encryption happens synchronously on that thread. One producer connection saturates at ~37k msg/sec, but a second producer on a different connection gets its own thread and its own headroom. The proxy's aggregate capacity compounds with each connection.
116+
Going further: we swept the same workload at 1000m, 2000m, and 4000m CPU. The throughput ceiling scaled linearly with the CPU budget1000m at ~40k msg/sec, 2000m at ~80k, 4000m at ~160k. The proxy isn't hitting a fixed architectural wall; it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
117117

118-
<!-- TODO: replace with actual per-pod ceiling once connection sweep is complete.
119-
The 4-producer sweep ran to 58k msg/sec with event loops still idle — we stopped
120-
the sweep there, not because the proxy gave out. Need to run connection-sweep.sh
121-
(see CONNECTION-SWEEP-PLAN.md) to find the real per-pod limit. -->
122-
123-
**The practical implication**: if you're hitting the encryption ceiling, add producers before adding proxy pods. We haven't yet measured exactly where the per-pod ceiling sits — that's the next experiment — but the single-connection limit of ~37k is not the whole story.
118+
**The practical implication**: the throughput ceiling is not a fixed number — it's a function of the CPU you allocate. Set `requests` equal to `limits` in your pod spec; this makes the CPU budget deterministic and the ceiling predictable. The companion engineering post has the full story of how we found this, including the workload design choices needed to isolate proxy CPU from Kafka's own limits.
124119

125120
---
126121

@@ -130,24 +125,30 @@ We ran the same test with 4 producers sharing the same single partition. With 4
130125

131126
**With record encryption:**
132127

133-
1. **Throughput budget**: encryption imposes a per-connection throughput ceiling driven by the CPU cost of AES-256-GCM on your hardware. On ours (AMD EPYC-Rome, 2GHz) that ceiling was about 26% lower than Kafka alone could sustain per producer connection — run the rate sweep on your own infrastructure to find yours.
128+
1. **Throughput budget**: encryption imposes a CPU-driven throughput ceiling. As a planning formula:
129+
130+
> **`proxy CPU (millicores) = 20 × produce throughput (MB/s)`**
131+
132+
Add ×1.3 headroom for GC pauses and burst. This assumes matched consumer load (1:1 produce:consume) and was measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your own hardware using the rate sweep.
133+
134+
Worked example: 100k msg/s at 1 KB = 100 MB/s produce → 100 × 20 = 2000m, plus headroom → ~2600m (~2.6 cores).
134135

135136
2. **Latency budget**: well below saturation, expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99. The overhead scales with how hard you're pushing — give yourself headroom and you'll barely notice it.
136137

137-
3. **Scaling**: the bottleneck is per-connection CPU (crypto, buffer management, and network I/O combined). Spread load across more producer connections first; then scale proxy pods horizontally.
138+
3. **Scaling**: set `requests` equal to `limits` in your pod spec — this makes the CPU budget deterministic, which makes the throughput ceiling predictable. To increase throughput, raise the CPU limit. For redundancy, add proxy pods.
138139

139140
4. **KMS overhead**: DEK caching means Vault isn't on the hot path for every record. Our tests triggered only 5–19 DEK generation calls per benchmark run. The KMS is not the thing to worry about.
140141

141142
---
142143

143144
## Caveats and next steps
144145

145-
These results come from a single proxy pod, a single partition, and single-pass measurements at each rate point. We know what the gaps are:
146+
These results come from a single proxy pod and single-pass measurements at each rate point. A few things to keep in mind:
146147

147-
- **Connection sweep**: we saw 1 and 4 producers — we haven't yet swept 2, 8, 16 to characterise the full per-pod ceiling
148-
- **Horizontal scaling**: we expect more proxy pods to scale linearly, but haven't measured it yet
149-
- **Larger message sizes**: encryption overhead is almost certainly smaller in percentage terms for larger messages
148+
- **Message size**: all results use 1 KB messages. The coefficient is message-size-dependent — encryption overhead as a percentage is likely lower for larger messages.
149+
- **Replication factor**: the 1-topic rate sweep ran at RF=3. At that replication factor, Kafka's ISR replication traffic creates a per-partition ceiling that sits close to where proxy CPU also saturates — the two limits are entangled in those results. The sizing coefficient was derived from RF=1 multi-topic workloads specifically to isolate proxy CPU. The [companion engineering post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}) has that detail.
150+
- **Horizontal scaling**: linear scaling has been validated across CPU allocations on a single pod; multi-pod horizontal scaling hasn't been measured but is expected to follow the same coefficient.
150151

151-
For the engineering story — why we built a custom harness on top of OMB, what the CPU flamegraphs actually show, and the bugs we found in our own tooling along the way — that's in the [companion post]({% post_url 2026-05-08-benchmarking-the-proxy-under-the-hood %}).
152+
For the engineering story — why we built a custom harness on top of OMB, what the CPU flamegraphs actually show, and the bugs we found in our own tooling along the way — that's in the [companion post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}).
152153

153154
The full benchmark suite, quickstart guide, and sizing reference are in `kroxylicious-openmessaging-benchmarks/` in the [main Kroxylicious repository](https://github.com/kroxylicious/kroxylicious).

_posts/2026-05-08-benchmarking-the-proxy-under-the-hood.md renamed to _posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md

Lines changed: 58 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
---
22
layout: post
33
title: "Benchmarking a Kafka proxy: the engineering story"
4-
date: 2026-05-08 00:00:00 +0000
4+
date: 2026-05-28 00:00:00 +0000
55
author: "Sam Barker"
66
author_url: "https://github.com/SamBarker"
77
categories: benchmarking performance engineering
88
---
99

10-
The [first post]({% post_url 2026-05-01-benchmarking-the-proxy %}) covered what we measured and what the numbers mean for operators. This one is for the people who want to know how we measured it, what the flamegraphs actually show, and what we found when we started looking carefully at our own tooling.
10+
The [first post]({% post_url 2026-05-21-benchmarking-the-proxy %}) covered what we measured and what the numbers mean for operators. This one is for the people who want to know how we measured it, what the flamegraphs actually show, and what we found when we started looking carefully at our own tooling.
1111

1212
## Why not Kafka's own tools?
1313

@@ -141,15 +141,62 @@ Total additional CPU: ~33%. This aligns closely with the ~26% throughput reducti
141141

142142
If you wanted to optimise this, the highest-impact areas would be: reducing buffer copies (encrypt in-place or use composite buffers), pooling encryption buffers to reduce GC pressure, and caching `Cipher` instances to reduce per-record JDK security overhead.
143143

144-
## The per-connection ceiling
144+
## Following the ceiling
145145

146-
The single-producer encryption ceiling at ~37k msg/sec raised the question of whether that was a per-pod limit or a per-connection limit.
146+
### A problem with the workload
147147

148-
The answer came from a 4-producer rate sweep. Four producers sharing the same partition drove 47k+ msg/sec aggregate through the proxy while proxy CPU held at 570m/1000m — well below pod saturation. The Kafka partition became the bottleneck first.
148+
The single-producer rate sweep hit a ceiling at ~37k msg/sec. Before drawing conclusions, we had to ask whether that was actually a proxy CPU ceiling — or something else.
149149

150-
The explanation: Netty assigns each client connection to its own event loop thread. Encryption happens synchronously on that thread. A single connection is bounded by one event loop's throughput, but additional connections get their own threads. The proxy's aggregate capacity is the sum of its event loop threads' individual capacities — until something else (the Kafka partition, the NIC, pod CPU) saturates first.
150+
Our initial sweeps ran with replication factor 3, the standard production default. At RF=3, every message the Kafka leader receives goes out to 2 follower replicas. With 1 KB messages and 37k msg/sec, that's ~37 MB/s inbound to the leader and ~111 MB/s total replication traffic outbound — and the Fyre cluster nodes had 10 GbE NICs, so the ceiling wasn't the NIC. But RF=3 does create a real per-partition I/O ceiling on the Kafka leader, and it sits right around where we were measuring.
151151

152-
Worth noting: with replication factor 3, every message the Kafka leader receives goes out to 2 follower replicas plus potentially one consumer. At 50k msg/sec with 1 KB messages that's ~1.2 Gbps outbound from the leader alone — confirming why the Fyre cluster nodes need 10 Gbps NICs.
152+
The fix: RF=1, 10-topic workload. Dropping to RF=1 removes replication overhead; spreading across 10 partitions distributes load so no single partition hits its ceiling. We validated the fix with the passthrough proxy scenario: at 160k msg/sec total (16k per topic), proxy-no-filters matched baseline — Kafka was not the bottleneck. The sweep scaled to 640k msg/sec before hitting some uninvestigated ceiling well above where encryption constrains anything.
153+
154+
### Is the encryption ceiling per-pod or per-connection?
155+
156+
With a clean workload that isolates proxy CPU, we re-examined the ~37k figure. Running the same workload with 4 producers: proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first. So the single-producer ceiling is not the pod ceiling.
157+
158+
### The coefficient
159+
160+
With the workload isolation in place, we swept encryption across CPU allocations. The throughput ceiling scaled linearly:
161+
162+
| CPU limit | Encryption ceiling |
163+
|-----------|-------------------|
164+
| 1000m | ~40k msg/sec |
165+
| 2000m | ~80k msg/sec |
166+
| 4000m | ~160k msg/sec |
167+
168+
From the 4-core sweep: safe at 160k msg/sec (p99: 447 ms), catastrophic at 320k msg/sec (p99: 537,000 ms). The saturation point is predictably between those two steps.
169+
170+
Deriving the coefficient: at 4000m and 160k msg/sec with 1 KB messages —
171+
172+
```
173+
160k msg/s × 1 KB = 160 MB/s produce throughput
174+
With matched consumer load: 160 MB/s encrypt + 160 MB/s decrypt
175+
→ 4000 mc / 320 MB/s bidirectional ≈ 12–13 mc per MB/s bidirectional
176+
→ equivalently: 4000 mc / 160 MB/s produce ≈ 25 mc per MB/s produce
177+
```
178+
179+
We measured the coefficient at mid-utilisation (80k msg/sec, 2000m) at ~10 mc/MB/s bidirectional — lower, because of fixed per-connection overhead that's amortised at higher load. The operator-facing formula uses 20 mc/MB/s of produce throughput (= 10 bidirectional × 2 for produce+consume), which sits between mid-utilisation and saturation and provides inherent conservatism.
180+
181+
One thing we observed: the proxy had 4 Netty event loop threads regardless of CPU limit. The throughput scaling isn't explained by thread count changing — it doesn't. What changes is the CPU time budget available to those threads. The detailed relationship between CPU limit, thread scheduling, and throughput ceiling is more subtle than a simple thread-count model; what we can say empirically is that throughput scales linearly with the CPU limit, and the formula holds.
182+
183+
### The prediction
184+
185+
Rather than just reporting the 4-core result, we used the 1-core ceiling to make a falsifiable prediction: if the ceiling scales linearly, a 2-core pod should saturate at ~80k msg/sec.
186+
187+
The 2-core sweep:
188+
189+
| Rate | p99 | Verdict |
190+
|------|-----|---------|
191+
| 40k msg/sec | 626 ms | Comfortable |
192+
| 80k msg/sec | 1,660 ms | Elevated — right at predicted ceiling |
193+
| 160k msg/sec | 175,277 ms | Catastrophic |
194+
195+
The prediction held. The ceiling is real, linear, and predictable — which is exactly what you want from a sizing model.
196+
197+
Setting `requests` equal to `limits` makes this predictability practical: a pod that can burst above its CPU limit introduces headroom uncertainty that breaks the model. With `requests == limits`, the CPU budget is fixed, the ceiling is fixed, and your capacity planning can rely on the coefficient.
198+
199+
Worth noting: with RF=3 in production, every message the Kafka leader receives goes out to 2 follower replicas. At 50k msg/sec with 1 KB messages that's ~1.2 Gbps outbound from the leader alone — confirming why the Fyre cluster nodes need 10 GbE NICs, and why the replication ceiling matters for the benchmarking workload design.
153200

154201
## Bugs we found in our own tooling
155202

@@ -189,16 +236,10 @@ jbang src/main/java/io/kroxylicious/benchmarks/results/ResultComparator.java \
189236

190237
## What's still open
191238

192-
The gaps we know about and plan to fill:
193-
194-
1. **Connection sweep**: run 1, 2, 4, 8, 16 producers simultaneously at a fixed per-producer rate to characterise the per-pod aggregate ceiling with encryption. The plan is in `CONNECTION-SWEEP-PLAN.md`.
195-
196-
2. **Horizontal scaling**: verify that adding proxy pods scales aggregate throughput linearly.
197-
198-
3. **Multi-partition workloads**: isolate encryption cost without being bounded by Kafka's per-partition ceiling.
199-
200-
4. **Multi-pass sweeps**: each rate point was measured once. Running each probe three times and taking the median would give tighter bounds, particularly in the saturation transition zone.
239+
The coefficient is validated at 1, 2, and 4 cores for 1 KB messages. Known gaps:
201240

202-
5. **Message size variation**: larger messages should show lower encryption overhead as a percentage; smaller messages may show higher overhead. 1 KB is a reasonable middle ground but not the whole picture.
241+
- **Message size variation**: larger messages should show lower overhead as a percentage; smaller messages may show higher. 1 KB is a reasonable middle ground but not the whole picture.
242+
- **Horizontal scaling**: multiple proxy pods haven't been measured; linear scaling is expected but not confirmed.
243+
- **Multi-pass sweeps**: each rate point was measured once. Running each probe three times and taking the median would give tighter bounds in the saturation transition zone.
203244

204245
The operator-facing sizing reference and all the key tables are in `SIZING-GUIDE.md` in the benchmarks directory.

0 commit comments

Comments
 (0)