You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Full investigation arc: spare CPU shock → NIC elimination → 4-producer
test → anti-affinity attempt (3 nodes, 3 brokers, nowhere to go) →
new cluster → baseline shock → RTT math reveals co-location → second
penny drops on OMB scheduling → RF=1 unlocks proxy CPU ceiling →
coefficient → prediction.
Corrects several issues in the prior draft: Netty theory discarded
(proxy metrics showed minimal back pressure); co-location framed at
pod/node level not VM level; 37k flagged as the only figure from the
original cluster; all coefficient and sweep numbers confirmed as coming
from the new distributed cluster.
Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Copy file name to clipboardExpand all lines: _posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md
+84-63Lines changed: 84 additions & 63 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -94,6 +94,90 @@ We leaned towards repeatable — but we didn't abandon representative entirely.
94
94
95
95
That covers the first dimension — the proxy's latency tax at normal load. For the second, throughput, the question is: how much does routing through the proxy reduce your maximum sustainable rate? That needs a different approach. We used rate sweeps: hold the connection count fixed, step the rate up incrementally, and watch what happens. Below the ceiling, achieved throughput tracks the target — the system keeps up. Above it, it can't, and falls behind. The point where achieved throughput diverges from the target rate — where we defined that as dropping below 95% — is the saturation point. That's the knee of the curve, and that's what we were hunting.
96
96
97
+
## False summit
98
+
99
+
The rate-sweep result was in: the encryption scenario hit a ceiling on our original cluster at around 37k msg/s. Summit reached.
100
+
101
+
Except — the proxy had spare CPU cycles. Not a little: meaningful headroom. If the proxy isn't CPU-saturated, whatever we hit isn't the proxy's ceiling.
102
+
103
+
**Was it the NIC?** At 37k msg/s and 1 KB messages, produce traffic alone is 37 MB/s. Add RF=3 replication: the leader ships two copies outbound, ~74 MB/s more. 111 MB/s total — fine for 10 GbE, obviously broken for 1 GbE. If the NICs had been gigabit, replication traffic would have saturated them long before we got to 37k. Network eliminated.
104
+
105
+
**Was it the proxy pod, or just one connection?** The rate sweep runs with a single producer. We ran four at the same per-producer rate. Aggregate throughput climbed higher than one producer alone could push — the pod had headroom the single connection wasn't using. We checked proxy metrics: back pressure was minimal. The proxy wasn't the constraint. Whatever was limiting one connection, it wasn't us.
106
+
107
+
### We tried anti-affinity
108
+
109
+
Then a curveball: could it be node saturation? The original cluster had three worker nodes — and three Kafka brokers. Strimzi, being sensible, spreads brokers evenly: one per node. If the proxy had landed on the same node as a busy broker, that node could be the bottleneck rather than the proxy pod itself.
110
+
111
+
We added a hard anti-affinity rule to keep the proxy off broker nodes. It wouldn't schedule.
112
+
113
+
The penny drops: three worker nodes, three brokers, one per node — there is nowhere for the proxy to go that isn't already co-located with a broker. Obvious in hindsight. We needed a bigger cluster.
114
+
115
+
We provisioned one: five workers, three masters, 16 vCPU per node.
116
+
117
+
### The baseline shock
118
+
119
+
Baseline first. Direct Kafka, no proxy.
120
+
121
+
~17,000 msg/s. The original cluster had been sustaining ~50,000.
122
+
123
+
The proxy wasn't in the picture. We checked the obvious suspects: disk I/O — fine, local and unsaturated. OMB worker scaling — correct. Broker CPU: ~1.2 vCPU. Nothing was at a limit.
124
+
125
+
The answer was in the pipeline arithmetic. A Kafka producer has a maximum number of in-flight requests — batches sent but not yet acknowledged. With real round-trip times between nodes, that in-flight window bounds throughput. We measured: 0.87 ms between worker nodes, with three replication hops before the leader can confirm a produce at RF=3 — roughly 3–4 ms total. Five in-flight requests across that round trip gives a ceiling that matched ~17k msg/s almost exactly.
126
+
127
+
On the original cluster, those nodes were almost certainly co-located on the same physical host. Inter-node RTTs at that scale are sub-millisecond — effectively free. The original cluster's 50k baseline wasn't what a 3-broker Kafka cluster does. It was what a 3-broker Kafka cluster does when the network is a memcpy.
128
+
129
+
The new cluster was genuinely distributed. Real latency, real pipeline limits, real Kafka — and the cluster we used for everything from here.
130
+
131
+
*(The ~37k ceiling is the only figure in this post from the original cluster. Everything that follows — the coefficient, the CPU sweep, the prediction — was measured on the new cluster. The physics are part of what makes those numbers honest.)*
132
+
133
+
Another penny dropped. We'd had the same scheduling problem with OMB all along. The producer and consumer worker pods were landing on broker nodes — and when pods share a node, the SDN detects that traffic doesn't need to leave the node and bypasses the NIC entirely. The producers and consumers weren't paying for network transit at all.
134
+
135
+
The proxy pod was on a different node, but on a 3-node cluster where every node already had a broker, the odds of those nodes sharing a physical host on Fyre were high. Almost certainly getting the same benefit, just one layer down.
136
+
137
+
### Now push harder
138
+
139
+
The new cluster had an honest baseline — but RF=3 pipeline limits meant we couldn't push a single topic past ~17k msg/s. There was no room to find the proxy's CPU ceiling when Kafka's pipeline hits the wall first.
140
+
141
+
RF=1, 10 topics. With no replication hops, the round-trip drops to producer→leader only: 0.87 ms. Spread across 10 partitions, no single one becomes the bottleneck before the proxy does. We validated the workload with the passthrough proxy: throughput scaled well past anything encryption constrains. The ceiling we were now measuring was proxy CPU.
142
+
143
+
### How much more?
144
+
145
+
The initial RF=1 run at 1000m CPU gave us a ceiling: ~40k msg/s. From that one measurement we could derive the coefficient:
→ operator formula: ~20 mc per MB/s of produce throughput (conservative margin between mid-load and saturation)
152
+
```
153
+
154
+
If the ceiling scales linearly with CPU, a 4-core pod should give ~160k msg/s. We ran it.
155
+
156
+
| CPU limit | Encryption ceiling |
157
+
|-----------|-------------------|
158
+
| 1000m |~40k msg/s |
159
+
| 4000m |~160k msg/s |
160
+
161
+
Linear. At 4000m: comfortable at 160k (p99: 447 ms), catastrophic at 320k (p99: 537,000 ms). The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
162
+
163
+
*(The proxy ran 4 Netty event loop threads regardless of CPU limit. Thread count doesn't change — what changes is the CPU time budget available to those threads. Empirically linear, even if the thread-scheduling mechanics are more subtle.)*
164
+
165
+
### The prediction
166
+
167
+
One validated data point isn't a sizing model. We used the coefficient to make a falsifiable prediction: a 2-core pod should saturate at ~80k msg/s.
| 80k msg/s | 1,660 ms | Elevated — right at predicted ceiling |
175
+
| 160k msg/s | 175,277 ms | Catastrophic |
176
+
177
+
Held. The ceiling is real, linear, and predictable — which is exactly what you want from a sizing model.
178
+
179
+
Setting `requests` equal to `limits` makes this practical: a pod that can burst above its CPU limit introduces headroom uncertainty that breaks the model. Fix the CPU budget; fix the ceiling.
180
+
97
181
## The flamegraph: where the CPU actually goes
98
182
99
183
We captured CPU profiles using async-profiler attached to the proxy JVM via `jcmd JVMTI.agent_load`, during the steady-state measurement phase at 36,000 msg/s. These are self-time percentages — where the CPU is actually spending cycles, not inclusive call-tree time.
@@ -163,69 +247,6 @@ Total additional CPU: ~33%. This aligns closely with the ~26% throughput reducti
163
247
164
248
If you wanted to optimise this, the highest-impact areas would be: reducing buffer copies (encrypt in-place or use composite buffers), pooling encryption buffers to reduce GC pressure, and caching `Cipher` instances to reduce per-record JDK security overhead.
165
249
166
-
## Following the ceiling
167
-
168
-
We had a rate-sweep result. On our test cluster, the encryption scenario hit a ceiling — the proxy was saturating around 37k msg/s. We'd maxed out the proxy, right?
169
-
170
-
Well. The proxy had spare CPU cycles.
171
-
172
-
That's interesting. If the proxy isn't CPU-saturated, then whatever we hit isn't the proxy's ceiling — it's something else's. Time to work out what.
173
-
174
-
### What were we actually hitting?
175
-
176
-
Our initial sweeps ran with replication factor 3 — the standard production default, and for good reason. But RF=3 means every message the Kafka leader receives gets replicated to 2 followers. At 37k msg/s with 1 KB messages, that's ~111 MB/s of replication traffic outbound from the leader alone. The Fyre nodes have 10 GbE NICs so the network wasn't saturated, but RF=3 creates a real per-partition I/O ceiling on the Kafka leader — and it sits right around where we were measuring.
177
-
178
-
The ceiling on our hardware wasn't the proxy. It was Kafka.
179
-
180
-
The fix: RF=1, 10-topic workload. Drop replication overhead; spread load across 10 partitions so no single partition hits its ceiling. We validated it with the passthrough proxy: at 160k msg/s total the proxy matched baseline, and the sweep scaled past 640k before hitting some uninvestigated ceiling far above where encryption constrains anything.
181
-
182
-
### We maxed out the proxy, right?
183
-
184
-
With a clean workload that actually isolates proxy CPU, we looked again. The connection sweep answered the question: with 4 producers at a fixed per-producer rate, aggregate throughput climbed well past the single-producer ceiling — and proxy CPU still had headroom. Kafka's partition ran out first.
185
-
186
-
So the single-producer ceiling on our cluster isn't the pod ceiling. It's what one connection could push on that hardware. The proxy had more to give.
187
-
188
-
### How much more?
189
-
190
-
We swept the CPU limit: 1000m, 2000m, 4000m. The throughput ceiling scaled linearly with the CPU budget:
191
-
192
-
| CPU limit | Encryption ceiling |
193
-
|-----------|-------------------|
194
-
| 1000m |~40k msg/s |
195
-
| 2000m |~80k msg/s |
196
-
| 4000m |~160k msg/s |
197
-
198
-
At 4000m: comfortable at 160k msg/s (p99: 447 ms), catastrophic at 320k (p99: 537,000 ms). The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
199
-
200
-
One thing we noticed along the way: the proxy ran 4 Netty event loop threads regardless of CPU limit. The throughput scaling isn't explained by thread count changing — it doesn't. What changes is the CPU time budget available to those threads. The relationship between CPU limit, thread scheduling, and throughput ceiling is more subtle than a simple thread-count model; what we can say empirically is that throughput scales linearly with the CPU limit.
201
-
202
-
Deriving the coefficient: at 4000m and 160k msg/s with 1 KB messages —
→ 4000 mc / 320 MB/s bidirectional ≈ 12–13 mc per MB/s bidirectional
208
-
→ equivalently: 4000 mc / 160 MB/s produce ≈ 25 mc per MB/s produce
209
-
```
210
-
211
-
We measured the coefficient at mid-utilisation (80k msg/s, 2000m) at ~10 mc/MB/s bidirectional — lower, because fixed per-connection overhead gets amortised at higher load. The operator-facing formula uses 20 mc/MB/s of produce throughput, which sits between mid-utilisation and saturation and gives inherent conservatism.
212
-
213
-
### The prediction
214
-
215
-
Rather than just report the results, we used the 1-core ceiling to make a falsifiable prediction: if the ceiling scales linearly with CPU budget, a 2-core pod should saturate at ~80k msg/s.
216
-
217
-
The 2-core sweep:
218
-
219
-
| Rate | p99 | Verdict |
220
-
|------|-----|---------|
221
-
| 40k msg/s | 626 ms | Comfortable |
222
-
| 80k msg/s | 1,660 ms | Elevated — right at predicted ceiling |
223
-
| 160k msg/s | 175,277 ms | Catastrophic |
224
-
225
-
The prediction held. The ceiling is real, linear, and predictable — which is exactly what you want from a sizing model.
226
-
227
-
Setting `requests` equal to `limits` is what makes this practical: a pod that can burst above its CPU limit introduces headroom uncertainty that breaks the model. With `requests == limits`, the CPU budget is fixed, the ceiling is fixed, and your capacity planning can rely on the coefficient.
228
-
229
250
## Bugs we found in our own tooling
230
251
231
252
During the 4-producer rate sweep, we noticed that JFR recordings and flamegraphs from probes 2 onwards all looked identical to probe 1. They were stale copies. Three bugs.
0 commit comments