You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|**Load Generator**|`src/inference_endpoint/load_generator/`| Central orchestrator: `BenchmarkSession` owns the lifecycle, `Scheduler` controls timing, `LoadGenerator` issues queries |
49
-
|**Endpoint Client**|`src/inference_endpoint/endpoint_client/`| Multi-process HTTP workers communicating via ZMQ IPC. `HTTPEndpointClient` is the main entry point |
50
-
|**Dataset Manager**|`src/inference_endpoint/dataset_manager/`| Loads pickle, HuggingFace, JSONL datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface |
51
-
|**Metrics**|`src/inference_endpoint/metrics/`|`EventRecorder` writes to SQLite, `MetricsReporter` reads and aggregates (QPS, latency, TTFT, TPOT) |
|**Load Generator**|`src/inference_endpoint/load_generator/`| Central orchestrator: `BenchmarkSession` owns the lifecycle, `Scheduler` controls timing, `LoadGenerator` issues queries |
49
+
|**Endpoint Client**|`src/inference_endpoint/endpoint_client/`| Multi-process HTTP workers communicating via ZMQ IPC. `HTTPEndpointClient` is the main entry point |
50
+
|**Dataset Manager**|`src/inference_endpoint/dataset_manager/`| Loads pickle, HuggingFace, JSONL datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface |
51
+
|**Metrics**|`src/inference_endpoint/metrics/`|`EventRecorder` writes to SQLite, `MetricsReporter` reads and aggregates (QPS, latency, TTFT, TPOT) |
52
+
|**Config**|`src/inference_endpoint/config/`, `endpoint_client/config.py`| Pydantic-based YAML schema (`schema.py`), `HTTPClientConfig` (single Pydantic model for CLI/YAML/runtime), `RuntimeSettings`|
53
+
|**CLI**|`src/inference_endpoint/main.py`, `commands/benchmark/cli.py`| cyclopts-based, auto-generated from `schema.py`and `HTTPClientConfig`Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)`|
Copy file name to clipboardExpand all lines: docs/CLIENT_PERFORMANCE_TUNING.md
+30Lines changed: 30 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -95,6 +95,36 @@ For streaming workloads, also watch **SSE-pkts/s** — a small stream interval (
95
95
96
96
---
97
97
98
+
## IPC Transport Buffer Sizes
99
+
100
+
The ZMQ transport uses a pre-allocated receive buffer (`bytearray`) for zero-copy message deserialization. If a serialized message exceeds this buffer, the worker crashes with:
|`recv_buffer_size`| 16 MB | Application receive buffer per socket |
109
+
|`send_buffer_size`| 16 MB | Kernel send buffer hint (advisory on IPC) |
110
+
111
+
**When to increase:** Multimodal workloads with large base64-encoded images in the request payload. A single VLM request with a high-resolution image can easily exceed 16 MB after msgspec serialization.
112
+
113
+
**When the default is fine:** Text-only workloads. A 32K-token prompt serializes to ~150 KB — well within the 16 MB buffer.
114
+
115
+
```yaml
116
+
settings:
117
+
client:
118
+
transport:
119
+
type: zmq
120
+
recv_buffer_size: 67108864# 64 MB for large multimodal payloads
121
+
send_buffer_size: 67108864
122
+
```
123
+
124
+
**Note:** `recv_buffer_size` sets the application-level `recv_into` buffer, not a kernel limit. IPC (Unix domain) sockets ignore `SO_RCVBUF`/`SO_SNDBUF` — the OS handles arbitrarily large messages regardless. The `send_buffer_size` is passed to `zmq.SNDBUF` as a kernel hint but has no effect on IPC transport.
125
+
126
+
---
127
+
98
128
## Test Servers
99
129
100
130
Two built-in servers for benchmarking without a real GPU endpoint.
0 commit comments