You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Polish voice, fix typos, and mark stale flamegraph references
- Fix punctuation on OMB methodology comparability sentence
- Fix repeated "We leaned towards repeatable" in workload design section
- Fix tense: "will make" -> "makes" for workload design aside
- Fix typo: "died in the wool" -> "dyed in the wool"
- Add closing paragraph to flamegraph section: proxy wins are real but
we aren't going to make AES faster
- Replace stale 36k msg/s flamegraph references with FIXME pending
new profiler runs
Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Copy file name to clipboardExpand all lines: _posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md
+12-10Lines changed: 12 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,7 @@ And critically, it's never heard of Kroxylicious... You have though, you're here
26
26
27
27
[OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) is a better fit. It's an industry-standard tool used by Confluent, the Pulsar team, and others for their published performance comparisons - so who am I to argue? OMB coordinates producers and consumers across separate worker pods, runs a configurable warmup phase before taking measurements, takes latency tracking seriously — correcting for coordinated omission, and outputs structured JSON that's straightforward to process programmatically. What's not to like?
28
28
29
-
Using OMB also means our methodology is directly comparable to other published Kafka benchmarks. The numbers aren't comparable of course it's not the same hardware, network conditions or phase of the moon.
29
+
Using OMB also means our methodology is directly comparable to other published Kafka benchmarks. The numbers aren't comparable, of course — it's not the same hardware, network conditions or phase of the moon.
30
30
31
31
## What we built on top of OMB
32
32
@@ -73,7 +73,7 @@ If you have your own KMS — and you will run this on your own infrastructure, r
73
73
74
74
### JSON always comes in megabytes
75
75
76
-
Each benchmark run produces a blob of structured JSON. Useful in principle; a wall of noise in practice. Three [JBang](https://www.jbang.dev/)-runnable Java programs *(I'm a died in the wool java dev, sue me)* pull out the signal:
76
+
Each benchmark run produces a blob of structured JSON. Useful in principle; a wall of noise in practice. Three [JBang](https://www.jbang.dev/)-runnable Java programs *(I'm a dyed in the wool java dev, sue me)* pull out the signal:
77
77
78
78
-**`RunMetadata`**: captures the run context — git commit, timestamp, cluster node specs (architecture, CPU, RAM), and on OpenShift, NIC speed read from the host via the MachineConfigDaemon pod. Generates `run-metadata.json` alongside each result so you can always tell what conditions produced a number. This is what makes run-to-run comparisons meaningful — and when a run takes 12 hours, trust me, you don't want to re-run it without good reason.
79
79
-**`ResultComparator`**: answers "did this change hurt?" — reads two scenario result directories and produces a markdown comparison table. Baseline vs encryption is the obvious use, but the tool is generic. Already running a proxy? proxy-no-filters vs encryption tells you the cost of the filter itself, not the proxy hop. Building your own filter? That's your comparison — measure the chain with and without it.
@@ -85,11 +85,11 @@ Getting NIC speed from a Kubernetes node turned out to be non-trivial — you ne
85
85
86
86
Benchmarks are artificial constructs. Your traffic patterns are never stable — message sizes vary, topic counts grow, producers burst — so there's always a tension between numbers that are *representative* and numbers that are actually *repeatable*. We leaned towards repeatable.
87
87
88
-
The primary workload will make Kafka experts wince *(I had to squirm to type it)* — **1 topic, 1 partition, 1 KB messages**. Concentrating everything onto a single TopicPartition means we hit the limits earlier, at lower absolute volumes, which makes the proxy's contribution easier to isolate. Isolating the proxy is, after all, the goal.
88
+
The primary workload makes Kafka experts wince *(I had to squirm to type it)* — **1 topic, 1 partition, 1 KB messages**. Concentrating everything onto a single TopicPartition means we hit the limits earlier, at lower absolute volumes, which makes the proxy's contribution easier to isolate. Isolating the proxy is, after all, the goal.
89
89
90
90
But Kafka is often described as a distributed append-only log, and we can't ignore the word "distributed" when it comes to latency. With RF=1, the proxy doubles the sequential hops in the critical path: one becomes two. That's not wrong, but it's not a fair picture either — nobody runs RF=1 in production. With RF=3, the leader waits for ISR acknowledgements before confirming the produce, so there's already replication latency in the critical path. The proxy adds a real, sequential hop — we're not trying to bury that — but it lands alongside a cost that's already there. One extra hop on top of a multi-hop round trip is a different picture from doubling a single-hop one. Three brokers, hot partition replicated across all of them.
91
91
92
-
We leaned towards repeatable — but we didn't abandon representative entirely. The multi-topic runs (10 and 100 topics) are the reconnection point: load spread across more topics, closer to what production actually looks like, at rates well below any saturation point. You're measuring the proxy's baseline tax — the cost you always pay, not just the cost when you're pushing hard. It holds.
92
+
But we didn't abandon representative entirely. The multi-topic runs (10 and 100 topics) are the reconnection point: load spread across more topics, closer to what production actually looks like, at rates well below any saturation point. You're measuring the proxy's baseline tax — the cost you always pay, not just the cost when you're pushing hard. It holds.
93
93
94
94
95
95
That covers the first dimension — the proxy's latency tax at normal load. For the second, throughput, the question is: how much does routing through the proxy reduce your maximum sustainable rate? That needs a different approach. We used rate sweeps: hold the connection count fixed, step the rate up incrementally, and watch what happens. Below the ceiling, achieved throughput tracks the target — the system keeps up. Above it, it can't, and falls behind. The point where achieved throughput diverges from the target rate — where we defined that as dropping below 95% — is the saturation point. That's the knee of the curve, and that's what we were hunting.
@@ -191,9 +191,9 @@ The flamegraphs below are fully interactive: hover over a frame to see its name
@@ -212,15 +212,15 @@ Kroxylicious decodes Kafka RPCs selectively: each filter declares which API keys
212
212
213
213
The 1.4% is the cost of a proxy that is *selectively* L7: doing real Kafka protocol work where it matters, and treating the hot path like a TCP relay where it doesn't. That's not a side-effect — it's what the decode predicate design is for, and this flamegraph validates it.
@@ -248,6 +248,8 @@ Total additional CPU: ~33%. This aligns closely with the ~26% throughput reducti
248
248
249
249
If you wanted to optimise this, the highest-impact areas would be: reducing buffer copies (encrypt in-place or use composite buffers), pooling encryption buffers to reduce GC pressure, and caching `Cipher` instances to reduce per-record JDK security overhead.
250
250
251
+
There are wins inside the proxy we haven't chased yet — serialisation and deserialisation we could avoid, buffer copies imposed by how memory records are structured. Some would be straightforward; others would require rethinking how Kafka records are modelled in memory. We haven't gone after them. But to put it plainly: we can optimise all we like inside the proxy, and we're still not going to make AES faster.
252
+
251
253
## Bugs we found in our own tooling
252
254
253
255
During the 4-producer rate sweep, we noticed that JFR recordings and flamegraphs from probes 2 onwards all looked identical to probe 1. They were stale copies. Three bugs.
0 commit comments