Skip to content

Commit 465f559

Browse files
authored
Merge branch 'main' into arekay/doc-refactor
2 parents ac8d591 + 613a2ac commit 465f559

53 files changed

Lines changed: 1653 additions & 1568 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

AGENTS.md

Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -43,16 +43,16 @@ Dataset Manager --> Load Generator --> Endpoint Client --> External Endpoint
4343

4444
### Key Components
4545

46-
| Component | Location | Purpose |
47-
| ------------------- | ------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
48-
| **Load Generator** | `src/inference_endpoint/load_generator/` | Central orchestrator: `BenchmarkSession` owns the lifecycle, `Scheduler` controls timing, `LoadGenerator` issues queries |
49-
| **Endpoint Client** | `src/inference_endpoint/endpoint_client/` | Multi-process HTTP workers communicating via ZMQ IPC. `HTTPEndpointClient` is the main entry point |
50-
| **Dataset Manager** | `src/inference_endpoint/dataset_manager/` | Loads pickle, HuggingFace, JSONL datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface |
51-
| **Metrics** | `src/inference_endpoint/metrics/` | `EventRecorder` writes to SQLite, `MetricsReporter` reads and aggregates (QPS, latency, TTFT, TPOT) |
52-
| **Config** | `src/inference_endpoint/config/` | Pydantic-based YAML schema (`schema.py`), ruleset registry for MLCommons compliance, `RuntimeSettings` for runtime state |
53-
| **CLI** | `src/inference_endpoint/main.py`, `commands/benchmark/cli.py` | cyclopts-based, auto-generated from `schema.py` Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)` |
54-
| **Async Utils** | `src/inference_endpoint/async_utils/` | `LoopManager` (uvloop + eager_task_factory), ZMQ transport layer, event publisher |
55-
| **OpenAI/SGLang** | `src/inference_endpoint/openai/`, `sglang/` | Protocol adapters and response accumulators for different API formats |
46+
| Component | Location | Purpose |
47+
| ------------------- | ------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
48+
| **Load Generator** | `src/inference_endpoint/load_generator/` | Central orchestrator: `BenchmarkSession` owns the lifecycle, `Scheduler` controls timing, `LoadGenerator` issues queries |
49+
| **Endpoint Client** | `src/inference_endpoint/endpoint_client/` | Multi-process HTTP workers communicating via ZMQ IPC. `HTTPEndpointClient` is the main entry point |
50+
| **Dataset Manager** | `src/inference_endpoint/dataset_manager/` | Loads pickle, HuggingFace, JSONL datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface |
51+
| **Metrics** | `src/inference_endpoint/metrics/` | `EventRecorder` writes to SQLite, `MetricsReporter` reads and aggregates (QPS, latency, TTFT, TPOT) |
52+
| **Config** | `src/inference_endpoint/config/`, `endpoint_client/config.py` | Pydantic-based YAML schema (`schema.py`), `HTTPClientConfig` (single Pydantic model for CLI/YAML/runtime), `RuntimeSettings` |
53+
| **CLI** | `src/inference_endpoint/main.py`, `commands/benchmark/cli.py` | cyclopts-based, auto-generated from `schema.py` and `HTTPClientConfig` Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)` |
54+
| **Async Utils** | `src/inference_endpoint/async_utils/` | `LoopManager` (uvloop + eager_task_factory), ZMQ transport layer, event publisher |
55+
| **OpenAI/SGLang** | `src/inference_endpoint/openai/`, `sglang/` | Protocol adapters and response accumulators for different API formats |
5656

5757
### Hot-Path Architecture
5858

@@ -114,7 +114,7 @@ src/inference_endpoint/
114114
│ ├── info.py # execute_info()
115115
│ ├── validate.py # execute_validate()
116116
│ └── init.py # execute_init()
117-
├── core/types.py # Query, QueryResult, StreamChunk, QueryStatus (msgspec Structs)
117+
├── core/types.py # APIType, Query, QueryResult, StreamChunk, QueryStatus (msgspec Structs)
118118
├── load_generator/
119119
│ ├── session.py # BenchmarkSession - top-level orchestrator
120120
│ ├── load_generator.py # LoadGenerator, SchedulerBasedLoadGenerator
@@ -127,7 +127,7 @@ src/inference_endpoint/
127127
│ ├── worker_manager.py # Manages worker lifecycle
128128
│ ├── http.py # ConnectionPool, HttpRequestTemplate, raw HTTP
129129
│ ├── http_sample_issuer.py # Bridges load generator to HTTP client
130-
│ ├── config.py # HTTPClientConfig
130+
│ ├── config.py # HTTPClientConfig (single Pydantic model — CLI/YAML/runtime)
131131
│ ├── adapter_protocol.py # HttpRequestAdapter protocol
132132
│ ├── accumulator_protocol.py # Response accumulation protocol
133133
│ ├── cpu_affinity.py # CPU pinning
@@ -140,8 +140,9 @@ src/inference_endpoint/
140140
│ │ ├── event_logger/ # EventLoggerService: writes EventRecords to JSONL/SQLite
141141
│ │ └── metrics_aggregator/ # MetricsAggregatorService: real-time metrics (TTFT, TPOT, ISL, OSL)
142142
│ └── transport/ # ZMQ-based IPC transport layer
143-
│ ├── protocol.py # Transport protocol definitions
144-
│ └── zmq/ # ZMQ implementation (context, pubsub, transport)
143+
│ ├── protocol.py # Transport protocols + TransportConfig base
144+
│ ├── record.py # Transport records
145+
│ └── zmq/ # ZMQ implementation (context, pubsub, transport, ZMQTransportConfig)
145146
├── dataset_manager/
146147
│ ├── dataset.py # Dataset base class, DatasetFormat enum
147148
│ ├── factory.py # Dataset factory

docs/CLIENT_PERFORMANCE_TUNING.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,36 @@ For streaming workloads, also watch **SSE-pkts/s** — a small stream interval (
9595

9696
---
9797

98+
## IPC Transport Buffer Sizes
99+
100+
The ZMQ transport uses a pre-allocated receive buffer (`bytearray`) for zero-copy message deserialization. If a serialized message exceeds this buffer, the worker crashes with:
101+
102+
```
103+
RuntimeError: ZMQ message truncated (18874368 > 16777216 bytes). Increase client.transport.recv_buffer_size in config.
104+
```
105+
106+
| Setting | Default | Description |
107+
| ------------------ | ------- | ----------------------------------------- |
108+
| `recv_buffer_size` | 16 MB | Application receive buffer per socket |
109+
| `send_buffer_size` | 16 MB | Kernel send buffer hint (advisory on IPC) |
110+
111+
**When to increase:** Multimodal workloads with large base64-encoded images in the request payload. A single VLM request with a high-resolution image can easily exceed 16 MB after msgspec serialization.
112+
113+
**When the default is fine:** Text-only workloads. A 32K-token prompt serializes to ~150 KB — well within the 16 MB buffer.
114+
115+
```yaml
116+
settings:
117+
client:
118+
transport:
119+
type: zmq
120+
recv_buffer_size: 67108864 # 64 MB for large multimodal payloads
121+
send_buffer_size: 67108864
122+
```
123+
124+
**Note:** `recv_buffer_size` sets the application-level `recv_into` buffer, not a kernel limit. IPC (Unix domain) sockets ignore `SO_RCVBUF`/`SO_SNDBUF` — the OS handles arbitrarily large messages regardless. The `send_buffer_size` is passed to `zmq.SNDBUF` as a kernel hint but has no effect on IPC transport.
125+
126+
---
127+
98128
## Test Servers
99129

100130
Two built-in servers for benchmarking without a real GPU endpoint.

docs/CLI_DESIGN.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ Validation is layered, executing in order:
133133
3. Sub-model validators:
134134
├── RuntimeConfig._validate_durations → max >= min duration
135135
├── LoadPattern._validate_completeness → poisson needs qps, concurrency needs target
136-
└── ClientSettings._workers_not_zero → workers != 0
136+
└── HTTPClientConfig._workers_not_zero → num_workers != 0
137137
4. BenchmarkConfig._resolve_and_validate:
138138
├── resolve defaults (name, streaming, model name from submission_ref)
139139
├── load pattern type vs test type (offline→max_throughput, online→poisson/concurrency)
@@ -156,7 +156,7 @@ $ inference-endpoint benchmark offline
156156
157157
$ inference-endpoint benchmark offline --endpoints x --model M --dataset D --workers abc
158158
╭── Error ─────────────────────────────────────────────────────────────────────╮
159-
│ settings.client.workers: Input should be a valid integer, unable to parse
159+
│ settings.client.num_workers: Input should be a valid integer, unable to
160160
│ string as an integer │
161161
╰──────────────────────────────────────────────────────────────────────────────╯
162162
```
@@ -182,7 +182,7 @@ NotImplementedError 1 Unimplemented command (eval)
182182
Annotate the schema field — zero CLI code changes:
183183

184184
```python
185-
class ClientSettings(BaseModel):
185+
class HTTPClientConfig(WithUpdatesMixin, BaseModel):
186186
buffer_size: Annotated[
187187
int,
188188
cyclopts.Parameter(alias="--buffer-size", help="Socket buffer size"),

docs/CLI_QUICK_REFERENCE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ Flag names shown as `--full.dotted.path --alias`. Both forms work.
9797
- `--model-params.streaming --streaming` - Streaming mode: auto/on/off (default: auto)
9898
- `--runtime.min-duration-ms --duration` - Min duration: ms default, or with suffix (600s, 10m) (default: 600000)
9999
- `--runtime.n-samples-to-issue --num-samples` - Explicit sample count override
100-
- `--client.workers --workers` - HTTP workers (-1=auto, default: -1)
100+
- `--client.num-workers --workers` - HTTP workers (-1=auto, default: -1)
101101
- `--client.max-connections --max-connections` - Max TCP connections (-1=unlimited)
102102
- `--endpoint-config.api-key --api-key` - API authentication
103103
- `--endpoint-config.api-type --api-type` - API type: openai/sglang (default: openai)
@@ -288,7 +288,7 @@ settings:
288288
type: "max_throughput"
289289
target_qps: 10.0
290290
client:
291-
workers: -1 # auto
291+
num_workers: -1 # auto
292292
293293
metrics:
294294
collect: ["throughput", "latency", "ttft", "tpot"]

0 commit comments

Comments
 (0)