|
| 1 | +--- |
| 2 | +title: TurboHttp Performance Bottleneck Analysis |
| 3 | +date: '2026-04-08' |
| 4 | +type: analysis |
| 5 | +status: actionable |
| 6 | +tags: |
| 7 | + - performance |
| 8 | + - bottlenecks |
| 9 | + - throughput |
| 10 | + - allocations |
| 11 | + - flow-control |
| 12 | +--- |
| 13 | +# TurboHttp Performance Bottleneck Analysis |
| 14 | + |
| 15 | +> **Date:** 2026-04-08 |
| 16 | +> **Scope:** Full pipeline deep-dive — Encoding, Decoding, Transport, Flow Control, Memory/Allocations |
| 17 | +> **Method:** 5 parallel code analysis agents covering all hot paths |
| 18 | +
|
| 19 | +--- |
| 20 | + |
| 21 | +## CRITICAL — Highest Impact |
| 22 | + |
| 23 | +### 1. HPACK/QPACK Dynamic Table: LinkedList O(n) Lookup |
| 24 | + |
| 25 | +Both dynamic tables use `LinkedList<T>` with linear search per header reference. For 100 headers this means **~5,050 pointer dereferences** per response. |
| 26 | + |
| 27 | +| File | Lines | Issue | |
| 28 | +|------|-------|-------| |
| 29 | +| `Protocol/Http2/Hpack/HpackDecoder.cs` | 71-85 | `GetEntry()` — O(n) LinkedList walk per index | |
| 30 | +| `Protocol/Http3/Qpack/QpackDynamicTable.cs` | 118-133 | `GetEntry()` — O(n) LinkedList walk per absolute index | |
| 31 | +| `Protocol/Http3/Qpack/QpackEncoder.cs` | 509-550 | `FindDynamicExact()`/`FindDynamicName()` — linear search | |
| 32 | + |
| 33 | +**Fix:** Replace with `List<T>` (index-based O(1)) or ring buffer with hash index. |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +### 2. HTTP/2 Request Body: Triple-Copy Pattern |
| 38 | + |
| 39 | +A 10MB POST body gets **copied 3 times** before landing in frames: |
| 40 | + |
| 41 | +| File | Line | Copy | |
| 42 | +|------|------|------| |
| 43 | +| `Protocol/Http2/Http2RequestEncoder.cs` | 70 | HttpContent → MemoryStream | |
| 44 | +| `Protocol/Http2/Http2RequestEncoder.cs` | 74 | MemoryStream → `new byte[bodyLen]` | |
| 45 | +| `Protocol/Http2/Http2RequestEncoder.cs` | 93-100 | byte[] → 16KB frame chunks | |
| 46 | + |
| 47 | +**Impact:** ~7x memory overhead for large bodies. |
| 48 | +**Fix:** Stream directly from HttpContent into frame chunks without intermediate buffers. |
| 49 | + |
| 50 | +--- |
| 51 | + |
| 52 | +### 3. HTTP/3 Encoding: Allocation per Header |
| 53 | + |
| 54 | +QPACK encoder allocates `Encoding.UTF8.GetBytes()` **per header field per request** (5-30 allocations/request): |
| 55 | + |
| 56 | +| File | Lines | |
| 57 | +|------|-------| |
| 58 | +| `Protocol/Http3/Qpack/QpackEncoder.cs` | 247, 254, 493, 502 | |
| 59 | +| `Protocol/Http3/Qpack/QpackEncoderInstructionWriter.cs` | 77, 113-114 | |
| 60 | + |
| 61 | +**Fix:** `ArrayPool<byte>` with Span overload `GetBytes(string, Span<byte>)`. |
| 62 | + |
| 63 | +--- |
| 64 | + |
| 65 | +### 4. Graph Materialization per Substream |
| 66 | + |
| 67 | +`VersionDispatchStage` materializes the **entire engine pipeline** for every new endpoint group: |
| 68 | + |
| 69 | +| File | Lines | Issue | |
| 70 | +|------|-------|-------| |
| 71 | +| `Streams/Stages/Internal/VersionDispatchStage.cs` | 112-121 | `SubFusingMaterializer` creates all stage logics from scratch | |
| 72 | + |
| 73 | +**Impact:** 10 different endpoints = 10x full pipeline allocation (Encoder, Decoder, Correlation, Features). |
| 74 | +**Fix:** Flow caching per (Version, Endpoint). |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +### 5. HTTP/3 QUIC: Sequential Stream Opening |
| 79 | + |
| 80 | +`SemaphoreSlim(1)` serializes QUIC stream opening — destroys multiplexing benefit: |
| 81 | + |
| 82 | +| File | Lines | Issue | |
| 83 | +|------|-------|-------| |
| 84 | +| `Transport/Quic/QuicConnectionManager.cs` | 54-76 | `_spawnLock.WaitAsync()` blocks concurrent stream creation | |
| 85 | + |
| 86 | +**Fix:** Remove lock. |
| 87 | + |
| 88 | +--- |
| 89 | + |
| 90 | +## HIGH — Significant Impact |
| 91 | + |
| 92 | +### 6. HTTP/2 Flow Control: Receive Window Too Small |
| 93 | + |
| 94 | +Default `initialRecvWindowSize = 65535` bytes — at 50ms RTT this caps at **max ~1.3 Mbps per stream**. |
| 95 | + |
| 96 | +| File | Line | |
| 97 | +|------|------| |
| 98 | +| `Streams/Stages/Decoding/Http20ConnectionStage.cs` | 81 | |
| 99 | + |
| 100 | +**Fix:** Default to 1MB+, adapt based on BDP (Bandwidth-Delay Product). |
| 101 | + |
| 102 | +--- |
| 103 | + |
| 104 | +### 7. HTTP/2 Stream State Pool Too Small |
| 105 | + |
| 106 | +`StatePoolCapacity = 32`, but `maxConcurrentStreams = 100`. At CL>32 states are not recycled → GC churn: |
| 107 | + |
| 108 | +| File | Line | |
| 109 | +|------|------| |
| 110 | +| `Streams/Stages/Decoding/Http20ConnectionStage.cs` | 208 | |
| 111 | + |
| 112 | +**Fix:** Use direct "maxConcurrentStreams". |
| 113 | + |
| 114 | +--- |
| 115 | + |
| 116 | +### 8. HPACK/QPACK: Repeated UTF-8 GetByteCount Calls |
| 117 | + |
| 118 | +`EntrySize()` calls `Encoding.UTF8.GetByteCount()` **multiple times** for the same header (Add, Eviction, CheckSize): |
| 119 | + |
| 120 | +| File | Lines | |
| 121 | +|------|-------| |
| 122 | +| `Protocol/Http2/Hpack/HpackDecoder.cs` | 108, 215, 322 | |
| 123 | +| `Protocol/Http3/Qpack/QpackDynamicTable.cs` | 164 | |
| 124 | + |
| 125 | +**Fix:** Cache byte-length at insertion time (store in header struct). |
| 126 | + |
| 127 | +--- |
| 128 | + |
| 129 | +### 9. HTTP/3 Frame Decoder: No Buffer Pooling |
| 130 | + |
| 131 | +Every fragmented frame allocates `new byte[]` without ArrayPool: |
| 132 | + |
| 133 | +| File | Lines | Issue | |
| 134 | +|------|-------|-------| |
| 135 | +| `Protocol/Http3/Http3FrameDecoder.cs` | 44, 62, 79 | `new byte[]` for combined/remainder | |
| 136 | +| `Protocol/Http3/Http3FrameDecoder.cs` | 199, 204, 235 | `.ToArray()` for frame payloads | |
| 137 | +| `Protocol/Http3/Http3ResponseDecoder.cs` | 123-149 | `List<byte[]>` body assembly with O(n²) copying | |
| 138 | +| `Protocol/Http3/Qpack/QpackInstructionDecoder.cs` | 332 | `new byte[]` for combined buffer | |
| 139 | + |
| 140 | +**Fix:** `ArrayPool<byte>.Shared.Rent()` + `Memory<byte>` slices instead of `.ToArray()`. |
| 141 | + |
| 142 | +--- |
| 143 | + |
| 144 | +### 10. HTTP/1.0 Decoder: Excessive ToArray() |
| 145 | + |
| 146 | +Every response parse allocates multiple times via `.ToArray()`: |
| 147 | + |
| 148 | +| File | Lines | |
| 149 | +|------|-------| |
| 150 | +| `Protocol/Http10/Http10Decoder.cs` | 79, 111, 116, 141, 155, 165, 207, 247, 252 | |
| 151 | +| `Protocol/Http10/Http10Decoder.cs` | 485 | `Combine()` — `new byte[]` without pooling | |
| 152 | + |
| 153 | +--- |
| 154 | + |
| 155 | +### 11. HuffmanCodec: MemoryStream + ToArray() |
| 156 | + |
| 157 | +Every encode/decode allocates MemoryStream and copies via `.ToArray()`: |
| 158 | + |
| 159 | +| File | Lines | |
| 160 | +|------|-------| |
| 161 | +| `Protocol/HuffmanCodec.cs` | 110-112 | `new MemoryStream()` + `.ToArray()` in Encode | |
| 162 | +| `Protocol/HuffmanCodec.cs` | 138 | `new MemoryStream()` in Decode | |
| 163 | + |
| 164 | +**Fix:** Span-based with pre-sized buffer. |
| 165 | + |
| 166 | +--- |
| 167 | + |
| 168 | +## MEDIUM — Noticeable Under Load |
| 169 | + |
| 170 | +### 12. Batch Weight Too Conservative |
| 171 | + |
| 172 | +`MaxBatchWeight = 65536` (64KB) — at high throughput causes too many scheduler ticks: |
| 173 | + |
| 174 | +| File | Line | |
| 175 | +|------|------| |
| 176 | +| `Streams/Http20Engine.cs` | 16 | |
| 177 | + |
| 178 | +**Fix:** 256KB-512KB for high-throughput, adaptive. |
| 179 | + |
| 180 | +--- |
| 181 | + |
| 182 | +### 13. MemoryStream Allocations Scattered Everywhere |
| 183 | + |
| 184 | +~9+ locations create `new MemoryStream()` without pooling: |
| 185 | + |
| 186 | +| File | Context | |
| 187 | +|------|---------| |
| 188 | +| `Protocol/Http3/Http3RequestEncoder.cs:77` | Per-request body | |
| 189 | +| `Protocol/Http10/Http10Encoder.cs:149` | Unknown-length body | |
| 190 | +| `Protocol/Semantics/ContentEncodingEncoder.cs:52,63,74` | Compression | |
| 191 | +| `Protocol/Semantics/ContentEncodingDecoder.cs:185` | Decompression | |
| 192 | +| `Streams/Stages/Features/ContentEncodingBidiStage.cs:299-332` | Multiple instances | |
| 193 | + |
| 194 | +**Fix:** `RecyclableMemoryStreamManager`. |
| 195 | + |
| 196 | +--- |
| 197 | + |
| 198 | +### 14. Per-Request Collection Allocations |
| 199 | + |
| 200 | +`new List<T>` / `new Dictionary<T,V>` in hot paths: |
| 201 | + |
| 202 | +| File | Lines | What | |
| 203 | +| -------------------------------------- | ------- | ------------------------------------------ | |
| 204 | +| `Protocol/Http2/Http2FrameDecoder.cs` | 109 | `new List<Http2Frame>()` per decode | |
| 205 | +| `Protocol/Http3/Http3FrameDecoder.cs` | 98 | `new List<Http3Frame>()` per decode | |
| 206 | +| `Protocol/Http2/Hpack/HpackDecoder.cs` | 193 | `new List<HpackHeader>()` per header block | |
| 207 | +| `Protocol/Http3/Qpack/QpackDecoder.cs` | 95, 140 | `new List<(string,string)>()` per decode | |
| 208 | +| `Protocol/Cookies/CookieJar.cs` | 112 | `new List<CookieEntry>()` per request | |
| 209 | + |
| 210 | +**Fix:** `ArrayPool`-backed lists. |
| 211 | + |
| 212 | +--- |
| 213 | + |
| 214 | +### 15. TcpConnectionStage: Task.Run per Connection |
| 215 | + |
| 216 | +Every TCP connection spawns `Task.Run()` for the inbound pump: |
| 217 | + |
| 218 | +| File | Line | |
| 219 | +|------|------| |
| 220 | +| `Transport/Tcp/TcpConnectionStage.cs` | 523 | |
| 221 | +| `Transport/Quic/QuicConnectionStage.cs` | 459 | |
| 222 | + |
| 223 | +--- |
| 224 | + |
| 225 | +### 16. QPACK Encoder Instruction Blocking |
| 226 | + |
| 227 | +When encoder instructions cannot be flushed, this **serializes all** subsequent requests: |
| 228 | + |
| 229 | +| File | Lines | |
| 230 | +|------|-------| |
| 231 | +| `Streams/Stages/Encoding/Http30Request2FrameStage.cs` | 92-96 | |
| 232 | + |
| 233 | +--- |
| 234 | + |
| 235 | +## LOW — Nice-to-Have |
| 236 | + |
| 237 | +| # | Issue | File:Line | |
| 238 | +|---|-------|-----------| |
| 239 | +| 17 | `QpackStringCodec` allocates Huffman-Encode just to check length | `Qpack/QpackStringCodec.cs:29` | |
| 240 | +| 18 | `DateTime.UtcNow` per connection in eviction loop | `ConnectionManagerActor.cs:306` | |
| 241 | +| 19 | `GroupByRequestEndpointStage.RemoveDead()` allocates `List<int>` even when empty | `GroupByRequestEndpointStage.cs:159` | |
| 242 | +| 20 | Socket buffer sizes not configurable | `IClientProvider.cs:100` | |
| 243 | +| 21 | `HuffmanCodec._root` volatile instead of static initializer | `HuffmanCodec.cs:115` | |
| 244 | +| 22 | NetworkBuffer pool unbounded (no cap) | `Messages.cs:80` | |
| 245 | + |
| 246 | +--- |
| 247 | + |
| 248 | +## Top 5 Quick Wins (Effort vs Impact) |
| 249 | + |
| 250 | +| # | Fix | Expected Impact | Effort | |
| 251 | +|---|-----|-----------------|--------| |
| 252 | +| 1 | HPACK/QPACK `LinkedList` → `List<T>` | **~30% faster header decode** | 2-3h | |
| 253 | +| 2 | HTTP/2 body: direct streaming instead of triple-copy | **~7x less memory for POST** | 4-6h | |
| 254 | +| 3 | QPACK Encoder: `stackalloc`/`ArrayPool` instead of `GetBytes()` | **~20-30 fewer allocs/request** | 2-3h | |
| 255 | +| 4 | HTTP/3 FrameDecoder: `ArrayPool` instead of `new byte[]` | **GC pressure significantly reduced** | 1-2h | |
| 256 | +| 5 | Receive window → 1MB+ | **Throughput x10+ at latency >10ms** | 30min | |
| 257 | + |
| 258 | +--- |
| 259 | + |
| 260 | +## Next Steps |
| 261 | + |
| 262 | +- [ ] Create feature plans for top 5 quick wins |
| 263 | +- [ ] Run BenchmarkDotNet baselines before changes |
| 264 | +- [ ] Implement fixes in priority order |
| 265 | +- [ ] Re-benchmark after each fix to measure actual impact |
0 commit comments