You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2026-05-21-benchmarking-the-proxy.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -45,7 +45,7 @@ One important caveat: this Kafka cluster is deliberately untuned. We're not tryi
45
45
46
46
Good news first. The proxy itself — with no filter chain, just routing traffic — adds almost nothing.
47
47
48
-
**10 topics, 1 KB messages (5,000 msg/sec per topic):**
48
+
**10 topics, 1 KB messages (5,000 msg/s per topic):**
49
49
50
50
| Metric | Baseline | Proxy | Delta |
51
51
|--------|----------|-------|-------|
@@ -55,7 +55,7 @@ Good news first. The proxy itself — with no filter chain, just routing traffic
55
55
| E2E latency p99 | 185.00 ms | 186.00 ms | +1.00 ms (+0.5%) |
56
56
| Publish rate | 5,002 msg/s | 5,002 msg/s | 0 |
57
57
58
-
**100 topics, 1 KB messages (500 msg/sec per topic):**
58
+
**100 topics, 1 KB messages (500 msg/s per topic):**
59
59
60
60
| Metric | Baseline | Proxy | Delta |
61
61
|--------|----------|-------|-------|
@@ -101,19 +101,19 @@ A rate-sweep is exactly what it sounds like: pick a starting rate, let OMB run l
101
101
102
102
We started at 34k (right where the latency table started getting interesting) and stepped up in 5% increments. The results:
103
103
104
-
-**Baseline**: sustained up to ~50,000–52,000 msg/sec (the ceiling we observed on our test cluster)
105
-
-**Encryption**: sustained up to **~37,200 msg/sec**, then started intermittently saturating
104
+
-**Baseline**: sustained up to ~50,000–52,000 msg/s (the ceiling we observed on our test cluster)
105
+
-**Encryption**: sustained up to **~37,200 msg/s**, then started intermittently saturating
106
106
-**Cost: approximately 26% fewer messages per second per partition**
107
107
108
-
The transition wasn't a clean cliff edge — between 37,600 and 42,000 msg/sec the proxy alternated between sustaining and saturating. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Above ~39,000 msg/sec, p99 latency regularly spiked above 1,700 ms. Stay below 37k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**.
108
+
The transition wasn't a clean cliff edge — between 37,600 and 42,000 msg/s the proxy alternated between sustaining and saturating. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Above ~39,000 msg/s, p99 latency regularly spiked above 1,700 ms. Stay below 37k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**.
109
109
110
110
### The ceiling scales with CPU budget
111
111
112
112
The fact the proxy is low latency didn't surprise me, but this did — and it matters when we think about scaling. We maxed out a single connection, but that didn't mean we'd maxed out the proxy.
113
113
114
-
Once we had the single-producer encryption ceiling at ~37k msg/sec, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? We ran the same test with 4 producers. With 4 connections the proxy sustained well past the single-producer ceiling — proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first.
114
+
Once we had the single-producer encryption ceiling at ~37k msg/s, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? We ran the same test with 4 producers. With 4 connections the proxy sustained well past the single-producer ceiling — proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first.
115
115
116
-
Going further: we swept the same workload at 1000m, 2000m, and 4000m CPU. The throughput ceiling scaled linearly with the CPU budget — 1000m at ~40k msg/sec, 2000m at ~80k, 4000m at ~160k. The proxy isn't hitting a fixed architectural wall; it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
116
+
Going further: we swept the same workload at 1000m, 2000m, and 4000m CPU. The throughput ceiling scaled linearly with the CPU budget — 1000m at ~40k msg/s, 2000m at ~80k, 4000m at ~160k. The proxy isn't hitting a fixed architectural wall; it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
117
117
118
118
**The practical implication**: the throughput ceiling is not a fixed number — it's a function of the CPU you allocate. Set `requests` equal to `limits` in your pod spec; this makes the CPU budget deterministic and the ceiling predictable. The companion engineering post has the full story of how we found this, including the workload design choices needed to isolate proxy CPU from Kafka's own limits.
Copy file name to clipboardExpand all lines: _posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md
+21-21Lines changed: 21 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -85,13 +85,13 @@ Getting NIC speed from a Kubernetes node turned out to be non-trivial — you ne
85
85
86
86
The primary workload used **1 topic, 1 partition, 1 KB messages**. This is deliberate. Concentrating all traffic on a single partition pushes things to their limits at lower absolute rates, which makes the proxy overhead easier to isolate: when the system saturates, it's the proxy, not a spread-out broker fleet.
87
87
88
-
Multi-topic workloads (10 topics, 100 topics) were used to verify that the overhead characteristics hold when load is distributed. At 5,000 msg/sec per topic across 10 topics, every topic-partition pair is well below any saturation point — so what you're measuring is steady-state overhead, not ceiling behaviour.
88
+
Multi-topic workloads (10 topics, 100 topics) were used to verify that the overhead characteristics hold when load is distributed. At 5,000 msg/s per topic across 10 topics, every topic-partition pair is well below any saturation point — so what you're measuring is steady-state overhead, not ceiling behaviour.
89
89
90
-
For throughput ceiling testing we used rate sweeps: start at 34,000 msg/sec, step up by 5% until achieved rate drops below 95% of target. The knee of that curve is the saturation point.
90
+
For throughput ceiling testing we used rate sweeps: start at 34,000 msg/s, step up by 5% until achieved rate drops below 95% of target. The knee of that curve is the saturation point.
91
91
92
92
## The flamegraph: where the CPU actually goes
93
93
94
-
We captured CPU profiles using async-profiler attached to the proxy JVM via `jcmd JVMTI.agent_load`, during the steady-state measurement phase at 36,000 msg/sec. These are self-time percentages — where the CPU is actually spending cycles, not inclusive call-tree time.
94
+
We captured CPU profiles using async-profiler attached to the proxy JVM via `jcmd JVMTI.agent_load`, during the steady-state measurement phase at 36,000 msg/s. These are self-time percentages — where the CPU is actually spending cycles, not inclusive call-tree time.
95
95
96
96
The flamegraphs below are fully interactive: hover over a frame to see its name and percentage, click to zoom in, Ctrl+F to search. Scroll within the frame to explore the full stack depth.
97
97
@@ -101,9 +101,9 @@ The flamegraphs below are fully interactive: hover over a frame to see its name
@@ -122,15 +122,15 @@ Kroxylicious decodes Kafka RPCs selectively: each filter declares which API keys
122
122
123
123
The 1.4% is the cost of a proxy that is *selectively* L7: doing real Kafka protocol work where it matters, and treating the hot path like a TCP relay where it doesn't. That's not a side-effect — it's what the decode predicate design is for, and this flamegraph validates it.
@@ -160,19 +160,19 @@ If you wanted to optimise this, the highest-impact areas would be: reducing buff
160
160
161
161
## Following the ceiling
162
162
163
-
We had a rate-sweep result. On our test cluster, the encryption scenario hit a ceiling — the proxy was saturating around 37k msg/sec. We'd maxed out the proxy, right?
163
+
We had a rate-sweep result. On our test cluster, the encryption scenario hit a ceiling — the proxy was saturating around 37k msg/s. We'd maxed out the proxy, right?
164
164
165
165
Well. The proxy had spare CPU cycles.
166
166
167
167
That's interesting. If the proxy isn't CPU-saturated, then whatever we hit isn't the proxy's ceiling — it's something else's. Time to work out what.
168
168
169
169
### What were we actually hitting?
170
170
171
-
Our initial sweeps ran with replication factor 3 — the standard production default, and for good reason. But RF=3 means every message the Kafka leader receives gets replicated to 2 followers. At 37k msg/sec with 1 KB messages, that's ~111 MB/s of replication traffic outbound from the leader alone. The Fyre nodes have 10 GbE NICs so the network wasn't saturated, but RF=3 creates a real per-partition I/O ceiling on the Kafka leader — and it sits right around where we were measuring.
171
+
Our initial sweeps ran with replication factor 3 — the standard production default, and for good reason. But RF=3 means every message the Kafka leader receives gets replicated to 2 followers. At 37k msg/s with 1 KB messages, that's ~111 MB/s of replication traffic outbound from the leader alone. The Fyre nodes have 10 GbE NICs so the network wasn't saturated, but RF=3 creates a real per-partition I/O ceiling on the Kafka leader — and it sits right around where we were measuring.
172
172
173
173
The ceiling on our hardware wasn't the proxy. It was Kafka.
174
174
175
-
The fix: RF=1, 10-topic workload. Drop replication overhead; spread load across 10 partitions so no single partition hits its ceiling. We validated it with the passthrough proxy: at 160k msg/sec total the proxy matched baseline, and the sweep scaled past 640k before hitting some uninvestigated ceiling far above where encryption constrains anything.
175
+
The fix: RF=1, 10-topic workload. Drop replication overhead; spread load across 10 partitions so no single partition hits its ceiling. We validated it with the passthrough proxy: at 160k msg/s total the proxy matched baseline, and the sweep scaled past 640k before hitting some uninvestigated ceiling far above where encryption constrains anything.
176
176
177
177
### We maxed out the proxy, right?
178
178
@@ -186,15 +186,15 @@ We swept the CPU limit: 1000m, 2000m, 4000m. The throughput ceiling scaled linea
186
186
187
187
| CPU limit | Encryption ceiling |
188
188
|-----------|-------------------|
189
-
| 1000m |~40k msg/sec|
190
-
| 2000m |~80k msg/sec|
191
-
| 4000m |~160k msg/sec|
189
+
| 1000m |~40k msg/s|
190
+
| 2000m |~80k msg/s|
191
+
| 4000m |~160k msg/s|
192
192
193
-
At 4000m: comfortable at 160k msg/sec (p99: 447 ms), catastrophic at 320k (p99: 537,000 ms). The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
193
+
At 4000m: comfortable at 160k msg/s (p99: 447 ms), catastrophic at 320k (p99: 537,000 ms). The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
194
194
195
195
One thing we noticed along the way: the proxy ran 4 Netty event loop threads regardless of CPU limit. The throughput scaling isn't explained by thread count changing — it doesn't. What changes is the CPU time budget available to those threads. The relationship between CPU limit, thread scheduling, and throughput ceiling is more subtle than a simple thread-count model; what we can say empirically is that throughput scales linearly with the CPU limit.
196
196
197
-
Deriving the coefficient: at 4000m and 160k msg/sec with 1 KB messages —
197
+
Deriving the coefficient: at 4000m and 160k msg/s with 1 KB messages —
→ equivalently: 4000 mc / 160 MB/s produce ≈ 25 mc per MB/s produce
204
204
```
205
205
206
-
We measured the coefficient at mid-utilisation (80k msg/sec, 2000m) at ~10 mc/MB/s bidirectional — lower, because fixed per-connection overhead gets amortised at higher load. The operator-facing formula uses 20 mc/MB/s of produce throughput, which sits between mid-utilisation and saturation and gives inherent conservatism.
206
+
We measured the coefficient at mid-utilisation (80k msg/s, 2000m) at ~10 mc/MB/s bidirectional — lower, because fixed per-connection overhead gets amortised at higher load. The operator-facing formula uses 20 mc/MB/s of produce throughput, which sits between mid-utilisation and saturation and gives inherent conservatism.
207
207
208
208
### The prediction
209
209
210
-
Rather than just report the results, we used the 1-core ceiling to make a falsifiable prediction: if the ceiling scales linearly with CPU budget, a 2-core pod should saturate at ~80k msg/sec.
210
+
Rather than just report the results, we used the 1-core ceiling to make a falsifiable prediction: if the ceiling scales linearly with CPU budget, a 2-core pod should saturate at ~80k msg/s.
211
211
212
212
The 2-core sweep:
213
213
214
214
| Rate | p99 | Verdict |
215
215
|------|-----|---------|
216
-
| 40k msg/sec| 626 ms | Comfortable |
217
-
| 80k msg/sec| 1,660 ms | Elevated — right at predicted ceiling |
218
-
| 160k msg/sec| 175,277 ms | Catastrophic |
216
+
| 40k msg/s| 626 ms | Comfortable |
217
+
| 80k msg/s| 1,660 ms | Elevated — right at predicted ceiling |
218
+
| 160k msg/s| 175,277 ms | Catastrophic |
219
219
220
220
The prediction held. The ceiling is real, linear, and predictable — which is exactly what you want from a sizing model.
Copy file name to clipboardExpand all lines: performance.markdown
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@ All primary results used 1 KB messages on a single partition. Multi-topic worklo
25
25
26
26
The proxy layer itself adds negligible overhead. At sub-saturation rates the additional latency is sub-millisecond on average, with no measurable throughput impact.
27
27
28
-
**10 topics, 1 KB messages (5,000 msg/sec per topic):**
28
+
**10 topics, 1 KB messages (5,000 msg/s per topic):**
29
29
30
30
| Metric | Baseline | Proxy | Delta |
31
31
|--------|----------|-------|-------|
@@ -34,7 +34,7 @@ The proxy layer itself adds negligible overhead. At sub-saturation rates the add
34
34
| E2E latency avg | 94.87 ms | 95.34 ms | +0.47 ms (+0.5%) |
0 commit comments