You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Humanise flamegraph section and add OSS data sharing to run it yourself
- Rewrites flamegraph intro with personal motivation: hot path minimalism,
Amdahl's law framing, and honest admission that the full sweep story
didn't come together
- Adds forward reference to bugs section to stitch the structure together
- Moves OSS transparency point into "Run it yourself" where it naturally
belongs, with a TODO placeholder for the raw data link
- Drops duplicate "we share our workings" phrase from flamegraph prose
Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Copy file name to clipboardExpand all lines: _posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md
+8-2Lines changed: 8 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -177,7 +177,11 @@ Setting `requests` equal to `limits` makes this practical: a pod that can burst
177
177
178
178
## The flamegraph: where the CPU actually goes
179
179
180
-
We captured CPU profiles using async-profiler attached to the proxy JVM via `jcmd JVMTI.agent_load`, during the steady-state measurement phase at 36,000 msg/s. These are self-time percentages — where the CPU is actually spending cycles, not inclusive call-tree time.
180
+
I care deeply that the proxy does as little work as possible on the hot path. Optimization is often less about swapping algorithms — if you only ever have five items, who cares how you sort them — and more about realising what work not to do, or finding a better time to do it. [Amdahl's law](https://en.wikipedia.org/wiki/Amdahl%27s_law) governs this: the maximum speedup you can get from optimizing a component is bounded by how much of total execution time that component actually owns. If the proxy accounts for 2% of CPU, you can't optimize your way to a 10% win — not there.
181
+
182
+
That framing is exactly why flamegraphs matter to me. Not as a debugging tool, but as a way of seeing the shape of the work. I was also hoping to tell a fuller story here — profiles across the full rate sweep, watching the mix shift as the proxy approaches saturation. Getting stable, reproducible numbers turned out to be harder than expected, and the bugs described in the next section cost us more runs than I'd like. So these are two snapshots at a single rate, not the sweep-correlated picture I had in mind. Still enough to see where the CPU goes. I hope to revisit this properly in the future — but right now the proxy's performance is good enough that I'm focused on functionality, and the benchmarking harness itself still has room to mature.
183
+
184
+
We captured CPU profiles using async-profiler attached to the proxy JVM via `jcmd JVMTI.agent_load`, during the steady-state measurement phase. These are self-time percentages — where the CPU is actually spending cycles, not inclusive call-tree time.
181
185
182
186
The flamegraphs below are fully interactive: hover over a frame to see its name and percentage, click to zoom in, Ctrl+F to search. Scroll within the frame to explore the full stack depth.
183
187
@@ -258,7 +262,9 @@ Spotting these required noticing that two different probe flamegraphs were pixel
258
262
259
263
## Run it yourself
260
264
261
-
Everything is in `kroxylicious-openmessaging-benchmarks/` in the [main Kroxylicious repository](https://github.com/kroxylicious/kroxylicious). See `QUICKSTART.md` for step-by-step instructions. You'll need a Kubernetes or OpenShift cluster, the Kroxylicious operator installed, and Helm 3. Minikube works for local runs — the quickstart covers recommended CPU and memory settings.
265
+
We're an open source project — we share our workings. The raw OMB result JSON, JFR recordings, and flamegraph files that back this post are available [TODO: link to raw data]. If you want to verify the numbers, reproduce the analysis, or compare against your own runs, everything you need is there.
266
+
267
+
If you want to run it against your own cluster, everything is in `kroxylicious-openmessaging-benchmarks/` in the [main Kroxylicious repository](https://github.com/kroxylicious/kroxylicious). See `QUICKSTART.md` for step-by-step instructions. You'll need a Kubernetes or OpenShift cluster, the Kroxylicious operator installed, and Helm 3. Minikube works for local runs — the quickstart covers recommended CPU and memory settings.
0 commit comments