Skip to content

Commit 68774af

Browse files
committed
Polish voice, fix typos, and mark stale flamegraph references
- Fix punctuation on OMB methodology comparability sentence - Fix repeated "We leaned towards repeatable" in workload design section - Fix tense: "will make" -> "makes" for workload design aside - Fix typo: "died in the wool" -> "dyed in the wool" - Add closing paragraph to flamegraph section: proxy wins are real but we aren't going to make AES faster - Replace stale 36k msg/s flamegraph references with FIXME pending new profiler runs Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
1 parent b0a0a90 commit 68774af

1 file changed

Lines changed: 12 additions & 10 deletions

File tree

_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ And critically, it's never heard of Kroxylicious... You have though, you're here
2626

2727
[OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) is a better fit. It's an industry-standard tool used by Confluent, the Pulsar team, and others for their published performance comparisons - so who am I to argue? OMB coordinates producers and consumers across separate worker pods, runs a configurable warmup phase before taking measurements, takes latency tracking seriously — correcting for coordinated omission, and outputs structured JSON that's straightforward to process programmatically. What's not to like?
2828

29-
Using OMB also means our methodology is directly comparable to other published Kafka benchmarks. The numbers aren't comparable of course it's not the same hardware, network conditions or phase of the moon.
29+
Using OMB also means our methodology is directly comparable to other published Kafka benchmarks. The numbers aren't comparable, of course it's not the same hardware, network conditions or phase of the moon.
3030

3131
## What we built on top of OMB
3232

@@ -73,7 +73,7 @@ If you have your own KMS — and you will run this on your own infrastructure, r
7373

7474
### JSON always comes in megabytes
7575

76-
Each benchmark run produces a blob of structured JSON. Useful in principle; a wall of noise in practice. Three [JBang](https://www.jbang.dev/)-runnable Java programs *(I'm a died in the wool java dev, sue me)* pull out the signal:
76+
Each benchmark run produces a blob of structured JSON. Useful in principle; a wall of noise in practice. Three [JBang](https://www.jbang.dev/)-runnable Java programs *(I'm a dyed in the wool java dev, sue me)* pull out the signal:
7777

7878
- **`RunMetadata`**: captures the run context — git commit, timestamp, cluster node specs (architecture, CPU, RAM), and on OpenShift, NIC speed read from the host via the MachineConfigDaemon pod. Generates `run-metadata.json` alongside each result so you can always tell what conditions produced a number. This is what makes run-to-run comparisons meaningful — and when a run takes 12 hours, trust me, you don't want to re-run it without good reason.
7979
- **`ResultComparator`**: answers "did this change hurt?" — reads two scenario result directories and produces a markdown comparison table. Baseline vs encryption is the obvious use, but the tool is generic. Already running a proxy? proxy-no-filters vs encryption tells you the cost of the filter itself, not the proxy hop. Building your own filter? That's your comparison — measure the chain with and without it.
@@ -85,11 +85,11 @@ Getting NIC speed from a Kubernetes node turned out to be non-trivial — you ne
8585

8686
Benchmarks are artificial constructs. Your traffic patterns are never stable — message sizes vary, topic counts grow, producers burst — so there's always a tension between numbers that are *representative* and numbers that are actually *repeatable*. We leaned towards repeatable.
8787

88-
The primary workload will make Kafka experts wince *(I had to squirm to type it)***1 topic, 1 partition, 1 KB messages**. Concentrating everything onto a single TopicPartition means we hit the limits earlier, at lower absolute volumes, which makes the proxy's contribution easier to isolate. Isolating the proxy is, after all, the goal.
88+
The primary workload makes Kafka experts wince *(I had to squirm to type it)***1 topic, 1 partition, 1 KB messages**. Concentrating everything onto a single TopicPartition means we hit the limits earlier, at lower absolute volumes, which makes the proxy's contribution easier to isolate. Isolating the proxy is, after all, the goal.
8989

9090
But Kafka is often described as a distributed append-only log, and we can't ignore the word "distributed" when it comes to latency. With RF=1, the proxy doubles the sequential hops in the critical path: one becomes two. That's not wrong, but it's not a fair picture either — nobody runs RF=1 in production. With RF=3, the leader waits for ISR acknowledgements before confirming the produce, so there's already replication latency in the critical path. The proxy adds a real, sequential hop — we're not trying to bury that — but it lands alongside a cost that's already there. One extra hop on top of a multi-hop round trip is a different picture from doubling a single-hop one. Three brokers, hot partition replicated across all of them.
9191

92-
We leaned towards repeatable — but we didn't abandon representative entirely. The multi-topic runs (10 and 100 topics) are the reconnection point: load spread across more topics, closer to what production actually looks like, at rates well below any saturation point. You're measuring the proxy's baseline tax — the cost you always pay, not just the cost when you're pushing hard. It holds.
92+
But we didn't abandon representative entirely. The multi-topic runs (10 and 100 topics) are the reconnection point: load spread across more topics, closer to what production actually looks like, at rates well below any saturation point. You're measuring the proxy's baseline tax — the cost you always pay, not just the cost when you're pushing hard. It holds.
9393

9494

9595
That covers the first dimension — the proxy's latency tax at normal load. For the second, throughput, the question is: how much does routing through the proxy reduce your maximum sustainable rate? That needs a different approach. We used rate sweeps: hold the connection count fixed, step the rate up incrementally, and watch what happens. Below the ceiling, achieved throughput tracks the target — the system keeps up. Above it, it can't, and falls behind. The point where achieved throughput diverges from the target rate — where we defined that as dropping below 95% — is the saturation point. That's the knee of the curve, and that's what we were hunting.
@@ -191,9 +191,9 @@ The flamegraphs below are fully interactive: hover over a frame to see its name
191191
<iframe src="{{ '/assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html' | relative_url }}"
192192
width="100%" height="600"
193193
style="border: 1px solid #ddd; border-radius: 4px;"
194-
title="CPU flamegraph: no-filter proxy at 36,000 msg/s">
194+
title="CPU flamegraph: no-filter proxy at FIXME msg/s">
195195
</iframe>
196-
<figcaption>CPU flamegraph — passthrough proxy (no filters), 36,000 msg/s, 1 topic, 1 KB messages. <a href="{{ '/assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html' | relative_url }}" target="_blank">Open full screen ↗</a></figcaption>
196+
<figcaption>CPU flamegraph — passthrough proxy (no filters), FIXME msg/s, 1 topic, 1 KB messages. <a href="{{ '/assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html' | relative_url }}" target="_blank">Open full screen ↗</a></figcaption>
197197
</figure>
198198

199199
| Category | CPU share |
@@ -212,15 +212,15 @@ Kroxylicious decodes Kafka RPCs selectively: each filter declares which API keys
212212

213213
The 1.4% is the cost of a proxy that is *selectively* L7: doing real Kafka protocol work where it matters, and treating the hot path like a TCP relay where it doesn't. That's not a side-effect — it's what the decode predicate design is for, and this flamegraph validates it.
214214

215-
### Encryption proxy (same 36,000 msg/s rate)
215+
### Encryption proxy (same FIXME msg/s rate)
216216

217217
<figure>
218-
<iframe src="{{ '/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-36k.html' | relative_url }}"
218+
<iframe src="{{ '/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-FIXME.html' | relative_url }}"
219219
width="100%" height="600"
220220
style="border: 1px solid #ddd; border-radius: 4px;"
221-
title="CPU flamegraph: encryption proxy at 36,000 msg/s">
221+
title="CPU flamegraph: encryption proxy at FIXME msg/s">
222222
</iframe>
223-
<figcaption>CPU flamegraph — encryption proxy (AES-256-GCM), 36,000 msg/s, 1 topic, 1 KB messages. <a href="{{ '/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-36k.html' | relative_url }}" target="_blank">Open full screen ↗</a></figcaption>
223+
<figcaption>CPU flamegraph — encryption proxy (AES-256-GCM), FIXME msg/s, 1 topic, 1 KB messages. <a href="{{ '/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-FIXME.html' | relative_url }}" target="_blank">Open full screen ↗</a></figcaption>
224224
</figure>
225225

226226
| Category | No-filters | Encryption | Delta |
@@ -248,6 +248,8 @@ Total additional CPU: ~33%. This aligns closely with the ~26% throughput reducti
248248

249249
If you wanted to optimise this, the highest-impact areas would be: reducing buffer copies (encrypt in-place or use composite buffers), pooling encryption buffers to reduce GC pressure, and caching `Cipher` instances to reduce per-record JDK security overhead.
250250

251+
There are wins inside the proxy we haven't chased yet — serialisation and deserialisation we could avoid, buffer copies imposed by how memory records are structured. Some would be straightforward; others would require rethinking how Kafka records are modelled in memory. We haven't gone after them. But to put it plainly: we can optimise all we like inside the proxy, and we're still not going to make AES faster.
252+
251253
## Bugs we found in our own tooling
252254

253255
During the 4-producer rate sweep, we noticed that JFR recordings and flamegraphs from probes 2 onwards all looked identical to probe 1. They were stale copies. Three bugs.

0 commit comments

Comments
 (0)