Skip to content

Commit 0cf0858

Browse files
committed
quick wins
1 parent bf13997 commit 0cf0858

44 files changed

Lines changed: 966 additions & 397 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 265 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,265 @@
1+
---
2+
title: TurboHttp Performance Bottleneck Analysis
3+
date: '2026-04-08'
4+
type: analysis
5+
status: actionable
6+
tags:
7+
- performance
8+
- bottlenecks
9+
- throughput
10+
- allocations
11+
- flow-control
12+
---
13+
# TurboHttp Performance Bottleneck Analysis
14+
15+
> **Date:** 2026-04-08
16+
> **Scope:** Full pipeline deep-dive — Encoding, Decoding, Transport, Flow Control, Memory/Allocations
17+
> **Method:** 5 parallel code analysis agents covering all hot paths
18+
19+
---
20+
21+
## CRITICAL — Highest Impact
22+
23+
### 1. HPACK/QPACK Dynamic Table: LinkedList O(n) Lookup
24+
25+
Both dynamic tables use `LinkedList<T>` with linear search per header reference. For 100 headers this means **~5,050 pointer dereferences** per response.
26+
27+
| File | Lines | Issue |
28+
|------|-------|-------|
29+
| `Protocol/Http2/Hpack/HpackDecoder.cs` | 71-85 | `GetEntry()` — O(n) LinkedList walk per index |
30+
| `Protocol/Http3/Qpack/QpackDynamicTable.cs` | 118-133 | `GetEntry()` — O(n) LinkedList walk per absolute index |
31+
| `Protocol/Http3/Qpack/QpackEncoder.cs` | 509-550 | `FindDynamicExact()`/`FindDynamicName()` — linear search |
32+
33+
**Fix:** Replace with `List<T>` (index-based O(1)) or ring buffer with hash index.
34+
35+
---
36+
37+
### 2. HTTP/2 Request Body: Triple-Copy Pattern
38+
39+
A 10MB POST body gets **copied 3 times** before landing in frames:
40+
41+
| File | Line | Copy |
42+
|------|------|------|
43+
| `Protocol/Http2/Http2RequestEncoder.cs` | 70 | HttpContent → MemoryStream |
44+
| `Protocol/Http2/Http2RequestEncoder.cs` | 74 | MemoryStream → `new byte[bodyLen]` |
45+
| `Protocol/Http2/Http2RequestEncoder.cs` | 93-100 | byte[] → 16KB frame chunks |
46+
47+
**Impact:** ~7x memory overhead for large bodies.
48+
**Fix:** Stream directly from HttpContent into frame chunks without intermediate buffers.
49+
50+
---
51+
52+
### 3. HTTP/3 Encoding: Allocation per Header
53+
54+
QPACK encoder allocates `Encoding.UTF8.GetBytes()` **per header field per request** (5-30 allocations/request):
55+
56+
| File | Lines |
57+
|------|-------|
58+
| `Protocol/Http3/Qpack/QpackEncoder.cs` | 247, 254, 493, 502 |
59+
| `Protocol/Http3/Qpack/QpackEncoderInstructionWriter.cs` | 77, 113-114 |
60+
61+
**Fix:** `ArrayPool<byte>` with Span overload `GetBytes(string, Span<byte>)`.
62+
63+
---
64+
65+
### 4. Graph Materialization per Substream
66+
67+
`VersionDispatchStage` materializes the **entire engine pipeline** for every new endpoint group:
68+
69+
| File | Lines | Issue |
70+
|------|-------|-------|
71+
| `Streams/Stages/Internal/VersionDispatchStage.cs` | 112-121 | `SubFusingMaterializer` creates all stage logics from scratch |
72+
73+
**Impact:** 10 different endpoints = 10x full pipeline allocation (Encoder, Decoder, Correlation, Features).
74+
**Fix:** Flow caching per (Version, Endpoint).
75+
76+
---
77+
78+
### 5. HTTP/3 QUIC: Sequential Stream Opening
79+
80+
`SemaphoreSlim(1)` serializes QUIC stream opening — destroys multiplexing benefit:
81+
82+
| File | Lines | Issue |
83+
|------|-------|-------|
84+
| `Transport/Quic/QuicConnectionManager.cs` | 54-76 | `_spawnLock.WaitAsync()` blocks concurrent stream creation |
85+
86+
**Fix:** Remove lock.
87+
88+
---
89+
90+
## HIGH — Significant Impact
91+
92+
### 6. HTTP/2 Flow Control: Receive Window Too Small
93+
94+
Default `initialRecvWindowSize = 65535` bytes — at 50ms RTT this caps at **max ~1.3 Mbps per stream**.
95+
96+
| File | Line |
97+
|------|------|
98+
| `Streams/Stages/Decoding/Http20ConnectionStage.cs` | 81 |
99+
100+
**Fix:** Default to 1MB+, adapt based on BDP (Bandwidth-Delay Product).
101+
102+
---
103+
104+
### 7. HTTP/2 Stream State Pool Too Small
105+
106+
`StatePoolCapacity = 32`, but `maxConcurrentStreams = 100`. At CL>32 states are not recycled → GC churn:
107+
108+
| File | Line |
109+
|------|------|
110+
| `Streams/Stages/Decoding/Http20ConnectionStage.cs` | 208 |
111+
112+
**Fix:** Use direct "maxConcurrentStreams".
113+
114+
---
115+
116+
### 8. HPACK/QPACK: Repeated UTF-8 GetByteCount Calls
117+
118+
`EntrySize()` calls `Encoding.UTF8.GetByteCount()` **multiple times** for the same header (Add, Eviction, CheckSize):
119+
120+
| File | Lines |
121+
|------|-------|
122+
| `Protocol/Http2/Hpack/HpackDecoder.cs` | 108, 215, 322 |
123+
| `Protocol/Http3/Qpack/QpackDynamicTable.cs` | 164 |
124+
125+
**Fix:** Cache byte-length at insertion time (store in header struct).
126+
127+
---
128+
129+
### 9. HTTP/3 Frame Decoder: No Buffer Pooling
130+
131+
Every fragmented frame allocates `new byte[]` without ArrayPool:
132+
133+
| File | Lines | Issue |
134+
|------|-------|-------|
135+
| `Protocol/Http3/Http3FrameDecoder.cs` | 44, 62, 79 | `new byte[]` for combined/remainder |
136+
| `Protocol/Http3/Http3FrameDecoder.cs` | 199, 204, 235 | `.ToArray()` for frame payloads |
137+
| `Protocol/Http3/Http3ResponseDecoder.cs` | 123-149 | `List<byte[]>` body assembly with O(n²) copying |
138+
| `Protocol/Http3/Qpack/QpackInstructionDecoder.cs` | 332 | `new byte[]` for combined buffer |
139+
140+
**Fix:** `ArrayPool<byte>.Shared.Rent()` + `Memory<byte>` slices instead of `.ToArray()`.
141+
142+
---
143+
144+
### 10. HTTP/1.0 Decoder: Excessive ToArray()
145+
146+
Every response parse allocates multiple times via `.ToArray()`:
147+
148+
| File | Lines |
149+
|------|-------|
150+
| `Protocol/Http10/Http10Decoder.cs` | 79, 111, 116, 141, 155, 165, 207, 247, 252 |
151+
| `Protocol/Http10/Http10Decoder.cs` | 485 | `Combine()``new byte[]` without pooling |
152+
153+
---
154+
155+
### 11. HuffmanCodec: MemoryStream + ToArray()
156+
157+
Every encode/decode allocates MemoryStream and copies via `.ToArray()`:
158+
159+
| File | Lines |
160+
|------|-------|
161+
| `Protocol/HuffmanCodec.cs` | 110-112 | `new MemoryStream()` + `.ToArray()` in Encode |
162+
| `Protocol/HuffmanCodec.cs` | 138 | `new MemoryStream()` in Decode |
163+
164+
**Fix:** Span-based with pre-sized buffer.
165+
166+
---
167+
168+
## MEDIUM — Noticeable Under Load
169+
170+
### 12. Batch Weight Too Conservative
171+
172+
`MaxBatchWeight = 65536` (64KB) — at high throughput causes too many scheduler ticks:
173+
174+
| File | Line |
175+
|------|------|
176+
| `Streams/Http20Engine.cs` | 16 |
177+
178+
**Fix:** 256KB-512KB for high-throughput, adaptive.
179+
180+
---
181+
182+
### 13. MemoryStream Allocations Scattered Everywhere
183+
184+
~9+ locations create `new MemoryStream()` without pooling:
185+
186+
| File | Context |
187+
|------|---------|
188+
| `Protocol/Http3/Http3RequestEncoder.cs:77` | Per-request body |
189+
| `Protocol/Http10/Http10Encoder.cs:149` | Unknown-length body |
190+
| `Protocol/Semantics/ContentEncodingEncoder.cs:52,63,74` | Compression |
191+
| `Protocol/Semantics/ContentEncodingDecoder.cs:185` | Decompression |
192+
| `Streams/Stages/Features/ContentEncodingBidiStage.cs:299-332` | Multiple instances |
193+
194+
**Fix:** `RecyclableMemoryStreamManager`.
195+
196+
---
197+
198+
### 14. Per-Request Collection Allocations
199+
200+
`new List<T>` / `new Dictionary<T,V>` in hot paths:
201+
202+
| File | Lines | What |
203+
| -------------------------------------- | ------- | ------------------------------------------ |
204+
| `Protocol/Http2/Http2FrameDecoder.cs` | 109 | `new List<Http2Frame>()` per decode |
205+
| `Protocol/Http3/Http3FrameDecoder.cs` | 98 | `new List<Http3Frame>()` per decode |
206+
| `Protocol/Http2/Hpack/HpackDecoder.cs` | 193 | `new List<HpackHeader>()` per header block |
207+
| `Protocol/Http3/Qpack/QpackDecoder.cs` | 95, 140 | `new List<(string,string)>()` per decode |
208+
| `Protocol/Cookies/CookieJar.cs` | 112 | `new List<CookieEntry>()` per request |
209+
210+
**Fix:** `ArrayPool`-backed lists.
211+
212+
---
213+
214+
### 15. TcpConnectionStage: Task.Run per Connection
215+
216+
Every TCP connection spawns `Task.Run()` for the inbound pump:
217+
218+
| File | Line |
219+
|------|------|
220+
| `Transport/Tcp/TcpConnectionStage.cs` | 523 |
221+
| `Transport/Quic/QuicConnectionStage.cs` | 459 |
222+
223+
---
224+
225+
### 16. QPACK Encoder Instruction Blocking
226+
227+
When encoder instructions cannot be flushed, this **serializes all** subsequent requests:
228+
229+
| File | Lines |
230+
|------|-------|
231+
| `Streams/Stages/Encoding/Http30Request2FrameStage.cs` | 92-96 |
232+
233+
---
234+
235+
## LOW — Nice-to-Have
236+
237+
| # | Issue | File:Line |
238+
|---|-------|-----------|
239+
| 17 | `QpackStringCodec` allocates Huffman-Encode just to check length | `Qpack/QpackStringCodec.cs:29` |
240+
| 18 | `DateTime.UtcNow` per connection in eviction loop | `ConnectionManagerActor.cs:306` |
241+
| 19 | `GroupByRequestEndpointStage.RemoveDead()` allocates `List<int>` even when empty | `GroupByRequestEndpointStage.cs:159` |
242+
| 20 | Socket buffer sizes not configurable | `IClientProvider.cs:100` |
243+
| 21 | `HuffmanCodec._root` volatile instead of static initializer | `HuffmanCodec.cs:115` |
244+
| 22 | NetworkBuffer pool unbounded (no cap) | `Messages.cs:80` |
245+
246+
---
247+
248+
## Top 5 Quick Wins (Effort vs Impact)
249+
250+
| # | Fix | Expected Impact | Effort |
251+
|---|-----|-----------------|--------|
252+
| 1 | HPACK/QPACK `LinkedList``List<T>` | **~30% faster header decode** | 2-3h |
253+
| 2 | HTTP/2 body: direct streaming instead of triple-copy | **~7x less memory for POST** | 4-6h |
254+
| 3 | QPACK Encoder: `stackalloc`/`ArrayPool` instead of `GetBytes()` | **~20-30 fewer allocs/request** | 2-3h |
255+
| 4 | HTTP/3 FrameDecoder: `ArrayPool` instead of `new byte[]` | **GC pressure significantly reduced** | 1-2h |
256+
| 5 | Receive window → 1MB+ | **Throughput x10+ at latency >10ms** | 30min |
257+
258+
---
259+
260+
## Next Steps
261+
262+
- [ ] Create feature plans for top 5 quick wins
263+
- [ ] Run BenchmarkDotNet baselines before changes
264+
- [ ] Implement fixes in priority order
265+
- [ ] Re-benchmark after each fix to measure actual impact

src/TurboHttp.Benchmarks/Internal/BenchmarkBaseClass.cs

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,12 @@ public static byte[] GeneratePayload(int sizeBytes)
9292
/// </summary>
9393
public virtual void GlobalSetup()
9494
{
95+
// Ensure both clients start with the same ThreadPool minimum — without this,
96+
// BDN benchmark processes have very few IO threads and TurboHttp's Akka dispatchers
97+
// starve while HttpClient's SocketsHttpHandler does not, skewing the comparison.
98+
ThreadPool.GetMinThreads(out var w, out var io);
99+
ThreadPool.SetMinThreads(Math.Max(w, 256), Math.Max(io, 256));
100+
95101
lock (_serverLock)
96102
{
97103
if (_sharedServer is null)

src/TurboHttp.Benchmarks/TurboHttpComparativeBenchmarks.cs

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -49,11 +49,11 @@ public static ClientHelper CreateClient(int port, Version version)
4949
// H1.x: 512 connections × MaxPipelineDepth(2) = 1024 in-flight capacity.
5050
MaxH1ConnectionsPerServer = 512,
5151
MaxPipelineDepth = 2,
52-
// H2: 2 connections × 200 streams = 400 in-flight capacity.
53-
// Mirrors HttpClient's strategy: multiplex deeply over few connections
54-
// rather than opening many connections each with shallow stream counts.
52+
// H2: 2 connections × 500 streams = 1000 in-flight capacity.
53+
// Server is configured with Http2.MaxStreamsPerConnection = 512; stay just
54+
// under that limit so we saturate each connection without triggering resets.
5555
MaxH2ConnectionsPerServer = 2,
56-
MaxH2ConcurrentStreams = 200,
56+
MaxH2ConcurrentStreams = 500,
5757
};
5858

5959
// Create and register the ActorSystem explicitly so it can be terminated on disposal.
@@ -210,11 +210,6 @@ public class TurboHttpConcurrentBenchmarks : BenchmarkBaseClass
210210
[GlobalSetup]
211211
public override void GlobalSetup()
212212
{
213-
// BDN spawns benchmark processes with minimal ThreadPool. Set before base.GlobalSetup()
214-
// so Akka's dispatchers start with sufficient threads from the first actor creation.
215-
ThreadPool.GetMinThreads(out var w, out var io);
216-
ThreadPool.SetMinThreads(Math.Max(w, 256), Math.Max(io, 256));
217-
218213
base.GlobalSetup();
219214
_clientHelper = ClientHelper.CreateClient(KestrelPort, HttpVersionValue);
220215
_tasks = new Task[ConcurrencyLevel];

src/TurboHttp.StreamTests/Http11/Http11CorrelationStageSpec.cs

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ namespace TurboHttp.StreamTests.Http11;
1212
/// Verifies that responses are matched to requests in FIFO order and that RequestMessage is correctly set.
1313
/// </summary>
1414
/// <remarks>
15-
/// Stage under test: <see cref="Http1XCorrelationStage"/>.
15+
/// Stage under test: <see cref="Http11CorrelationStage"/>.
1616
/// RFC 9112 §9: HTTP/1.x strict one-request-in-flight ordering and request-response pairing.
1717
/// </remarks>
1818
public sealed class Http11CorrelationStageSpec : StreamTestBase
@@ -30,7 +30,7 @@ private async Task<List<HttpResponseMessage>> RunStageAsync(
3030

3131
var graph = RunnableGraph.FromGraph(GraphDsl.Create(sink, (b, s) =>
3232
{
33-
var corr = b.Add(new Http1XCorrelationStage());
33+
var corr = b.Add(new Http11CorrelationStage());
3434
var reqSrc = b.Add(requestSource);
3535
var resSrc = b.Add(responseSource);
3636
var signalSink = b.Add(Sink.Ignore<IOutputItem>().MapMaterializedValue(_ => NotUsed.Instance));
@@ -163,7 +163,7 @@ public async Task Http11CorrelationStage_should_stay_alive_when_queues_empty_but
163163

164164
var graph = RunnableGraph.FromGraph(GraphDsl.Create(sink, (b, s) =>
165165
{
166-
var corr = b.Add(new Http1XCorrelationStage());
166+
var corr = b.Add(new Http11CorrelationStage());
167167
var reqSrc = b.Add(requestSource);
168168
var resSrc = b.Add(responseSource);
169169
var signalSink = b.Add(Sink.Ignore<IOutputItem>().MapMaterializedValue(_ => NotUsed.Instance));
@@ -201,7 +201,7 @@ public async Task Http11CorrelationStage_should_remain_open_when_in_flight_reque
201201

202202
var graph = RunnableGraph.FromGraph(GraphDsl.Create(sink, (b, s) =>
203203
{
204-
var corr = b.Add(new Http1XCorrelationStage());
204+
var corr = b.Add(new Http11CorrelationStage());
205205
var reqSrc = b.Add(Source.From([request1, request2]));
206206
var resSrc = b.Add(neverEndingResponses);
207207
var signalSink = b.Add(Sink.Ignore<IOutputItem>().MapMaterializedValue(_ => NotUsed.Instance));
@@ -233,7 +233,7 @@ public async Task Http11CorrelationStage_should_emit_one_stream_acquire_item_whe
233233

234234
var graph = RunnableGraph.FromGraph(GraphDsl.Create(signalSink, (b, s) =>
235235
{
236-
var corr = b.Add(new Http1XCorrelationStage());
236+
var corr = b.Add(new Http11CorrelationStage());
237237
var reqSrc = b.Add(Source.Single(request));
238238
var resSrc = b.Add(Source.Single(response));
239239
var responseSink = b.Add(Sink.Ignore<HttpResponseMessage>().MapMaterializedValue(_ => NotUsed.Instance));
@@ -275,7 +275,7 @@ public async Task Http11CorrelationStage_should_emit_two_stream_acquire_items_an
275275

276276
var graph = RunnableGraph.FromGraph(GraphDsl.Create(signalSink, (b, s) =>
277277
{
278-
var corr = b.Add(new Http1XCorrelationStage());
278+
var corr = b.Add(new Http11CorrelationStage());
279279
var reqSrc = b.Add(Source.From(requests));
280280
var resSrc = b.Add(Source.From(responses));
281281
var responseSink = b.Add(Sink.Ignore<HttpResponseMessage>().MapMaterializedValue(_ => NotUsed.Instance));

src/TurboHttp.StreamTests/Http11/Http11PipelineReconnectSpec.cs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ namespace TurboHttp.StreamTests.Http11;
1212
/// Verifies that each request waits for its own response before the next request is pulled.
1313
/// </summary>
1414
/// <remarks>
15-
/// Stage under test: <see cref="Http1XCorrelationStage"/>.
15+
/// Stage under test: <see cref="Http11CorrelationStage"/>.
1616
/// RFC 9112 §9: HTTP/1.x requests and responses MUST be sent and received in order.
1717
/// The InReset inlet has been removed; strict serial back-pressure is always enforced.
1818
/// </remarks>
@@ -31,7 +31,7 @@ private static HttpResponseMessage OkResponse()
3131

3232
var graph = RunnableGraph.FromGraph(GraphDsl.Create(signalSink, responseSink, (mat1, mat2) => (mat1, mat2), (b, sigSink, resSink) =>
3333
{
34-
var corr = b.Add(new Http1XCorrelationStage());
34+
var corr = b.Add(new Http11CorrelationStage());
3535
var reqSrc = b.Add(requestSource);
3636
var resSrc = b.Add(responseSource);
3737

0 commit comments

Comments
 (0)