Skip to content

Commit d282b37

Browse files
committed
Standardise on msg/s throughout (was mixed msg/s and msg/sec)
Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
1 parent 2eb9d90 commit d282b37

3 files changed

Lines changed: 32 additions & 32 deletions

File tree

_posts/2026-05-21-benchmarking-the-proxy.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ One important caveat: this Kafka cluster is deliberately untuned. We're not tryi
4545

4646
Good news first. The proxy itself — with no filter chain, just routing traffic — adds almost nothing.
4747

48-
**10 topics, 1 KB messages (5,000 msg/sec per topic):**
48+
**10 topics, 1 KB messages (5,000 msg/s per topic):**
4949

5050
| Metric | Baseline | Proxy | Delta |
5151
|--------|----------|-------|-------|
@@ -55,7 +55,7 @@ Good news first. The proxy itself — with no filter chain, just routing traffic
5555
| E2E latency p99 | 185.00 ms | 186.00 ms | +1.00 ms (+0.5%) |
5656
| Publish rate | 5,002 msg/s | 5,002 msg/s | 0 |
5757

58-
**100 topics, 1 KB messages (500 msg/sec per topic):**
58+
**100 topics, 1 KB messages (500 msg/s per topic):**
5959

6060
| Metric | Baseline | Proxy | Delta |
6161
|--------|----------|-------|-------|
@@ -101,19 +101,19 @@ A rate-sweep is exactly what it sounds like: pick a starting rate, let OMB run l
101101

102102
We started at 34k (right where the latency table started getting interesting) and stepped up in 5% increments. The results:
103103

104-
- **Baseline**: sustained up to ~50,000–52,000 msg/sec (the ceiling we observed on our test cluster)
105-
- **Encryption**: sustained up to **~37,200 msg/sec**, then started intermittently saturating
104+
- **Baseline**: sustained up to ~50,000–52,000 msg/s (the ceiling we observed on our test cluster)
105+
- **Encryption**: sustained up to **~37,200 msg/s**, then started intermittently saturating
106106
- **Cost: approximately 26% fewer messages per second per partition**
107107

108-
The transition wasn't a clean cliff edge — between 37,600 and 42,000 msg/sec the proxy alternated between sustaining and saturating. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Above ~39,000 msg/sec, p99 latency regularly spiked above 1,700 ms. Stay below 37k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**.
108+
The transition wasn't a clean cliff edge — between 37,600 and 42,000 msg/s the proxy alternated between sustaining and saturating. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Above ~39,000 msg/s, p99 latency regularly spiked above 1,700 ms. Stay below 37k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**.
109109

110110
### The ceiling scales with CPU budget
111111

112112
The fact the proxy is low latency didn't surprise me, but this did — and it matters when we think about scaling. We maxed out a single connection, but that didn't mean we'd maxed out the proxy.
113113

114-
Once we had the single-producer encryption ceiling at ~37k msg/sec, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? We ran the same test with 4 producers. With 4 connections the proxy sustained well past the single-producer ceiling — proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first.
114+
Once we had the single-producer encryption ceiling at ~37k msg/s, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? We ran the same test with 4 producers. With 4 connections the proxy sustained well past the single-producer ceiling — proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first.
115115

116-
Going further: we swept the same workload at 1000m, 2000m, and 4000m CPU. The throughput ceiling scaled linearly with the CPU budget — 1000m at ~40k msg/sec, 2000m at ~80k, 4000m at ~160k. The proxy isn't hitting a fixed architectural wall; it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
116+
Going further: we swept the same workload at 1000m, 2000m, and 4000m CPU. The throughput ceiling scaled linearly with the CPU budget — 1000m at ~40k msg/s, 2000m at ~80k, 4000m at ~160k. The proxy isn't hitting a fixed architectural wall; it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
117117

118118
**The practical implication**: the throughput ceiling is not a fixed number — it's a function of the CPU you allocate. Set `requests` equal to `limits` in your pod spec; this makes the CPU budget deterministic and the ceiling predictable. The companion engineering post has the full story of how we found this, including the workload design choices needed to isolate proxy CPU from Kafka's own limits.
119119

_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -85,13 +85,13 @@ Getting NIC speed from a Kubernetes node turned out to be non-trivial — you ne
8585

8686
The primary workload used **1 topic, 1 partition, 1 KB messages**. This is deliberate. Concentrating all traffic on a single partition pushes things to their limits at lower absolute rates, which makes the proxy overhead easier to isolate: when the system saturates, it's the proxy, not a spread-out broker fleet.
8787

88-
Multi-topic workloads (10 topics, 100 topics) were used to verify that the overhead characteristics hold when load is distributed. At 5,000 msg/sec per topic across 10 topics, every topic-partition pair is well below any saturation point — so what you're measuring is steady-state overhead, not ceiling behaviour.
88+
Multi-topic workloads (10 topics, 100 topics) were used to verify that the overhead characteristics hold when load is distributed. At 5,000 msg/s per topic across 10 topics, every topic-partition pair is well below any saturation point — so what you're measuring is steady-state overhead, not ceiling behaviour.
8989

90-
For throughput ceiling testing we used rate sweeps: start at 34,000 msg/sec, step up by 5% until achieved rate drops below 95% of target. The knee of that curve is the saturation point.
90+
For throughput ceiling testing we used rate sweeps: start at 34,000 msg/s, step up by 5% until achieved rate drops below 95% of target. The knee of that curve is the saturation point.
9191

9292
## The flamegraph: where the CPU actually goes
9393

94-
We captured CPU profiles using async-profiler attached to the proxy JVM via `jcmd JVMTI.agent_load`, during the steady-state measurement phase at 36,000 msg/sec. These are self-time percentages — where the CPU is actually spending cycles, not inclusive call-tree time.
94+
We captured CPU profiles using async-profiler attached to the proxy JVM via `jcmd JVMTI.agent_load`, during the steady-state measurement phase at 36,000 msg/s. These are self-time percentages — where the CPU is actually spending cycles, not inclusive call-tree time.
9595

9696
The flamegraphs below are fully interactive: hover over a frame to see its name and percentage, click to zoom in, Ctrl+F to search. Scroll within the frame to explore the full stack depth.
9797

@@ -101,9 +101,9 @@ The flamegraphs below are fully interactive: hover over a frame to see its name
101101
<iframe src="{{ '/assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html' | relative_url }}"
102102
width="100%" height="600"
103103
style="border: 1px solid #ddd; border-radius: 4px;"
104-
title="CPU flamegraph: no-filter proxy at 36,000 msg/sec">
104+
title="CPU flamegraph: no-filter proxy at 36,000 msg/s">
105105
</iframe>
106-
<figcaption>CPU flamegraph — passthrough proxy (no filters), 36,000 msg/sec, 1 topic, 1 KB messages. <a href="{{ '/assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html' | relative_url }}" target="_blank">Open full screen ↗</a></figcaption>
106+
<figcaption>CPU flamegraph — passthrough proxy (no filters), 36,000 msg/s, 1 topic, 1 KB messages. <a href="{{ '/assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html' | relative_url }}" target="_blank">Open full screen ↗</a></figcaption>
107107
</figure>
108108

109109
| Category | CPU share |
@@ -122,15 +122,15 @@ Kroxylicious decodes Kafka RPCs selectively: each filter declares which API keys
122122

123123
The 1.4% is the cost of a proxy that is *selectively* L7: doing real Kafka protocol work where it matters, and treating the hot path like a TCP relay where it doesn't. That's not a side-effect — it's what the decode predicate design is for, and this flamegraph validates it.
124124

125-
### Encryption proxy (same 36,000 msg/sec rate)
125+
### Encryption proxy (same 36,000 msg/s rate)
126126

127127
<figure>
128128
<iframe src="{{ '/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-36k.html' | relative_url }}"
129129
width="100%" height="600"
130130
style="border: 1px solid #ddd; border-radius: 4px;"
131-
title="CPU flamegraph: encryption proxy at 36,000 msg/sec">
131+
title="CPU flamegraph: encryption proxy at 36,000 msg/s">
132132
</iframe>
133-
<figcaption>CPU flamegraph — encryption proxy (AES-256-GCM), 36,000 msg/sec, 1 topic, 1 KB messages. <a href="{{ '/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-36k.html' | relative_url }}" target="_blank">Open full screen ↗</a></figcaption>
133+
<figcaption>CPU flamegraph — encryption proxy (AES-256-GCM), 36,000 msg/s, 1 topic, 1 KB messages. <a href="{{ '/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-36k.html' | relative_url }}" target="_blank">Open full screen ↗</a></figcaption>
134134
</figure>
135135

136136
| Category | No-filters | Encryption | Delta |
@@ -160,19 +160,19 @@ If you wanted to optimise this, the highest-impact areas would be: reducing buff
160160

161161
## Following the ceiling
162162

163-
We had a rate-sweep result. On our test cluster, the encryption scenario hit a ceiling — the proxy was saturating around 37k msg/sec. We'd maxed out the proxy, right?
163+
We had a rate-sweep result. On our test cluster, the encryption scenario hit a ceiling — the proxy was saturating around 37k msg/s. We'd maxed out the proxy, right?
164164

165165
Well. The proxy had spare CPU cycles.
166166

167167
That's interesting. If the proxy isn't CPU-saturated, then whatever we hit isn't the proxy's ceiling — it's something else's. Time to work out what.
168168

169169
### What were we actually hitting?
170170

171-
Our initial sweeps ran with replication factor 3 — the standard production default, and for good reason. But RF=3 means every message the Kafka leader receives gets replicated to 2 followers. At 37k msg/sec with 1 KB messages, that's ~111 MB/s of replication traffic outbound from the leader alone. The Fyre nodes have 10 GbE NICs so the network wasn't saturated, but RF=3 creates a real per-partition I/O ceiling on the Kafka leader — and it sits right around where we were measuring.
171+
Our initial sweeps ran with replication factor 3 — the standard production default, and for good reason. But RF=3 means every message the Kafka leader receives gets replicated to 2 followers. At 37k msg/s with 1 KB messages, that's ~111 MB/s of replication traffic outbound from the leader alone. The Fyre nodes have 10 GbE NICs so the network wasn't saturated, but RF=3 creates a real per-partition I/O ceiling on the Kafka leader — and it sits right around where we were measuring.
172172

173173
The ceiling on our hardware wasn't the proxy. It was Kafka.
174174

175-
The fix: RF=1, 10-topic workload. Drop replication overhead; spread load across 10 partitions so no single partition hits its ceiling. We validated it with the passthrough proxy: at 160k msg/sec total the proxy matched baseline, and the sweep scaled past 640k before hitting some uninvestigated ceiling far above where encryption constrains anything.
175+
The fix: RF=1, 10-topic workload. Drop replication overhead; spread load across 10 partitions so no single partition hits its ceiling. We validated it with the passthrough proxy: at 160k msg/s total the proxy matched baseline, and the sweep scaled past 640k before hitting some uninvestigated ceiling far above where encryption constrains anything.
176176

177177
### We maxed out the proxy, right?
178178

@@ -186,15 +186,15 @@ We swept the CPU limit: 1000m, 2000m, 4000m. The throughput ceiling scaled linea
186186

187187
| CPU limit | Encryption ceiling |
188188
|-----------|-------------------|
189-
| 1000m | ~40k msg/sec |
190-
| 2000m | ~80k msg/sec |
191-
| 4000m | ~160k msg/sec |
189+
| 1000m | ~40k msg/s |
190+
| 2000m | ~80k msg/s |
191+
| 4000m | ~160k msg/s |
192192

193-
At 4000m: comfortable at 160k msg/sec (p99: 447 ms), catastrophic at 320k (p99: 537,000 ms). The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
193+
At 4000m: comfortable at 160k msg/s (p99: 447 ms), catastrophic at 320k (p99: 537,000 ms). The proxy isn't hitting a fixed architectural wall — it's hitting a CPU budget wall, and that wall moves when you give it more CPU.
194194

195195
One thing we noticed along the way: the proxy ran 4 Netty event loop threads regardless of CPU limit. The throughput scaling isn't explained by thread count changing — it doesn't. What changes is the CPU time budget available to those threads. The relationship between CPU limit, thread scheduling, and throughput ceiling is more subtle than a simple thread-count model; what we can say empirically is that throughput scales linearly with the CPU limit.
196196

197-
Deriving the coefficient: at 4000m and 160k msg/sec with 1 KB messages —
197+
Deriving the coefficient: at 4000m and 160k msg/s with 1 KB messages —
198198

199199
```
200200
160k msg/s × 1 KB = 160 MB/s produce throughput
@@ -203,19 +203,19 @@ With matched consumer load: 160 MB/s encrypt + 160 MB/s decrypt
203203
→ equivalently: 4000 mc / 160 MB/s produce ≈ 25 mc per MB/s produce
204204
```
205205

206-
We measured the coefficient at mid-utilisation (80k msg/sec, 2000m) at ~10 mc/MB/s bidirectional — lower, because fixed per-connection overhead gets amortised at higher load. The operator-facing formula uses 20 mc/MB/s of produce throughput, which sits between mid-utilisation and saturation and gives inherent conservatism.
206+
We measured the coefficient at mid-utilisation (80k msg/s, 2000m) at ~10 mc/MB/s bidirectional — lower, because fixed per-connection overhead gets amortised at higher load. The operator-facing formula uses 20 mc/MB/s of produce throughput, which sits between mid-utilisation and saturation and gives inherent conservatism.
207207

208208
### The prediction
209209

210-
Rather than just report the results, we used the 1-core ceiling to make a falsifiable prediction: if the ceiling scales linearly with CPU budget, a 2-core pod should saturate at ~80k msg/sec.
210+
Rather than just report the results, we used the 1-core ceiling to make a falsifiable prediction: if the ceiling scales linearly with CPU budget, a 2-core pod should saturate at ~80k msg/s.
211211

212212
The 2-core sweep:
213213

214214
| Rate | p99 | Verdict |
215215
|------|-----|---------|
216-
| 40k msg/sec | 626 ms | Comfortable |
217-
| 80k msg/sec | 1,660 ms | Elevated — right at predicted ceiling |
218-
| 160k msg/sec | 175,277 ms | Catastrophic |
216+
| 40k msg/s | 626 ms | Comfortable |
217+
| 80k msg/s | 1,660 ms | Elevated — right at predicted ceiling |
218+
| 160k msg/s | 175,277 ms | Catastrophic |
219219

220220
The prediction held. The ceiling is real, linear, and predictable — which is exactly what you want from a sizing model.
221221

performance.markdown

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ All primary results used 1 KB messages on a single partition. Multi-topic worklo
2525

2626
The proxy layer itself adds negligible overhead. At sub-saturation rates the additional latency is sub-millisecond on average, with no measurable throughput impact.
2727

28-
**10 topics, 1 KB messages (5,000 msg/sec per topic):**
28+
**10 topics, 1 KB messages (5,000 msg/s per topic):**
2929

3030
| Metric | Baseline | Proxy | Delta |
3131
|--------|----------|-------|-------|
@@ -34,7 +34,7 @@ The proxy layer itself adds negligible overhead. At sub-saturation rates the add
3434
| E2E latency avg | 94.87 ms | 95.34 ms | +0.47 ms (+0.5%) |
3535
| Publish rate | 5,002 msg/s | 5,002 msg/s | no change |
3636

37-
**100 topics, 1 KB messages (500 msg/sec per topic):**
37+
**100 topics, 1 KB messages (500 msg/s per topic):**
3838

3939
| Metric | Baseline | Proxy | Delta |
4040
|--------|----------|-------|-------|
@@ -65,8 +65,8 @@ Encryption adds measurable but predictable overhead. The cost scales with produc
6565

6666
| Scenario | Throughput ceiling (1 topic, 1 KB, 1 partition) |
6767
|----------|------------------------------------------------|
68-
| Baseline (direct Kafka) | ~50,000–52,000 msg/sec |
69-
| Encryption (proxy + AES-256-GCM) | ~37,200 msg/sec |
68+
| Baseline (direct Kafka) | ~50,000–52,000 msg/s |
69+
| Encryption (proxy + AES-256-GCM) | ~37,200 msg/s |
7070
| **Cost** | **~26% fewer messages per second per partition** |
7171

7272
---

0 commit comments

Comments
 (0)