mlcommons
diff --git a/‎AGENTS.md‎
Lines changed: 15 additions & 14 deletions b/‎AGENTS.md‎
Lines changed: 15 additions & 14 deletions
diff --git a/‎docs/CLIENT_PERFORMANCE_TUNING.md‎
Lines changed: 30 additions & 0 deletions b/‎docs/CLIENT_PERFORMANCE_TUNING.md‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎docs/CLI_DESIGN.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/CLI_DESIGN.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/CLI_QUICK_REFERENCE.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/CLI_QUICK_REFERENCE.md‎
Lines changed: 2 additions & 2 deletions
@@ -43,16 +43,16 @@ Dataset Manager --> Load Generator --> Endpoint Client --> External Endpoint
 
 ### Key Components
 
-| Component           | Location                                                      | Purpose                                                                                                                  |
-| ------------------- | ------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
-| **Load Generator**  | `src/inference_endpoint/load_generator/`                      | Central orchestrator: `BenchmarkSession` owns the lifecycle, `Scheduler` controls timing, `LoadGenerator` issues queries |
-| **Endpoint Client** | `src/inference_endpoint/endpoint_client/`                     | Multi-process HTTP workers communicating via ZMQ IPC. `HTTPEndpointClient` is the main entry point                       |
-| **Dataset Manager** | `src/inference_endpoint/dataset_manager/`                     | Loads pickle, HuggingFace, JSONL datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface           |
-| **Metrics**         | `src/inference_endpoint/metrics/`                             | `EventRecorder` writes to SQLite, `MetricsReporter` reads and aggregates (QPS, latency, TTFT, TPOT)                      |
-| **Config**          | `src/inference_endpoint/config/`                              | Pydantic-based YAML schema (`schema.py`), ruleset registry for MLCommons compliance, `RuntimeSettings` for runtime state |
-| **CLI**             | `src/inference_endpoint/main.py`, `commands/benchmark/cli.py` | cyclopts-based, auto-generated from `schema.py` Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)`     |
-| **Async Utils**     | `src/inference_endpoint/async_utils/`                         | `LoopManager` (uvloop + eager_task_factory), ZMQ transport layer, event publisher                                        |
-| **OpenAI/SGLang**   | `src/inference_endpoint/openai/`, `sglang/`                   | Protocol adapters and response accumulators for different API formats                                                    |
+| Component           | Location                                                      | Purpose                                                                                                                                     |
+| ------------------- | ------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Load Generator**  | `src/inference_endpoint/load_generator/`                      | Central orchestrator: `BenchmarkSession` owns the lifecycle, `Scheduler` controls timing, `LoadGenerator` issues queries                    |
+| **Endpoint Client** | `src/inference_endpoint/endpoint_client/`                     | Multi-process HTTP workers communicating via ZMQ IPC. `HTTPEndpointClient` is the main entry point                                          |
+| **Dataset Manager** | `src/inference_endpoint/dataset_manager/`                     | Loads pickle, HuggingFace, JSONL datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface                              |
+| **Metrics**         | `src/inference_endpoint/metrics/`                             | `EventRecorder` writes to SQLite, `MetricsReporter` reads and aggregates (QPS, latency, TTFT, TPOT)                                         |
+| **Config**          | `src/inference_endpoint/config/`, `endpoint_client/config.py` | Pydantic-based YAML schema (`schema.py`), `HTTPClientConfig` (single Pydantic model for CLI/YAML/runtime), `RuntimeSettings`                |
+| **CLI**             | `src/inference_endpoint/main.py`, `commands/benchmark/cli.py` | cyclopts-based, auto-generated from `schema.py` and `HTTPClientConfig` Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)` |
+| **Async Utils**     | `src/inference_endpoint/async_utils/`                         | `LoopManager` (uvloop + eager_task_factory), ZMQ transport layer, event publisher                                                           |
+| **OpenAI/SGLang**   | `src/inference_endpoint/openai/`, `sglang/`                   | Protocol adapters and response accumulators for different API formats                                                                       |
 
 ### Hot-Path Architecture
 
@@ -114,7 +114,7 @@ src/inference_endpoint/
 │   ├── info.py                # execute_info()
 │   ├── validate.py            # execute_validate()
 │   └── init.py                # execute_init()
-├── core/types.py              # Query, QueryResult, StreamChunk, QueryStatus (msgspec Structs)
+├── core/types.py              # APIType, Query, QueryResult, StreamChunk, QueryStatus (msgspec Structs)
 ├── load_generator/
 │   ├── session.py             # BenchmarkSession - top-level orchestrator
 │   ├── load_generator.py      # LoadGenerator, SchedulerBasedLoadGenerator
@@ -127,7 +127,7 @@ src/inference_endpoint/
 │   ├── worker_manager.py      # Manages worker lifecycle
 │   ├── http.py                # ConnectionPool, HttpRequestTemplate, raw HTTP
 │   ├── http_sample_issuer.py  # Bridges load generator to HTTP client
-│   ├── config.py              # HTTPClientConfig
+│   ├── config.py              # HTTPClientConfig (single Pydantic model — CLI/YAML/runtime)
 │   ├── adapter_protocol.py    # HttpRequestAdapter protocol
 │   ├── accumulator_protocol.py # Response accumulation protocol
 │   ├── cpu_affinity.py        # CPU pinning
@@ -140,8 +140,9 @@ src/inference_endpoint/
 │   │   ├── event_logger/      # EventLoggerService: writes EventRecords to JSONL/SQLite
 │   │   └── metrics_aggregator/ # MetricsAggregatorService: real-time metrics (TTFT, TPOT, ISL, OSL)
 │   └── transport/             # ZMQ-based IPC transport layer
-│       ├── protocol.py        # Transport protocol definitions
-│       └── zmq/               # ZMQ implementation (context, pubsub, transport)
+│       ├── protocol.py        # Transport protocols + TransportConfig base
+│       ├── record.py          # Transport records
+│       └── zmq/               # ZMQ implementation (context, pubsub, transport, ZMQTransportConfig)
 ├── dataset_manager/
 │   ├── dataset.py             # Dataset base class, DatasetFormat enum
 │   ├── factory.py             # Dataset factory
 
@@ -95,6 +95,36 @@ For streaming workloads, also watch **SSE-pkts/s** — a small stream interval (
 
 ---
 
+## IPC Transport Buffer Sizes
+
+The ZMQ transport uses a pre-allocated receive buffer (`bytearray`) for zero-copy message deserialization. If a serialized message exceeds this buffer, the worker crashes with:
+
+```
+RuntimeError: ZMQ message truncated (18874368 > 16777216 bytes). Increase client.transport.recv_buffer_size in config.
+```
+
+| Setting            | Default | Description                               |
+| ------------------ | ------- | ----------------------------------------- |
+| `recv_buffer_size` | 16 MB   | Application receive buffer per socket     |
+| `send_buffer_size` | 16 MB   | Kernel send buffer hint (advisory on IPC) |
+
+**When to increase:** Multimodal workloads with large base64-encoded images in the request payload. A single VLM request with a high-resolution image can easily exceed 16 MB after msgspec serialization.
+
+**When the default is fine:** Text-only workloads. A 32K-token prompt serializes to ~150 KB — well within the 16 MB buffer.
+
+```yaml
+settings:
+  client:
+    transport:
+      type: zmq
+      recv_buffer_size: 67108864 # 64 MB for large multimodal payloads
+      send_buffer_size: 67108864
+```
+
+**Note:** `recv_buffer_size` sets the application-level `recv_into` buffer, not a kernel limit. IPC (Unix domain) sockets ignore `SO_RCVBUF`/`SO_SNDBUF` — the OS handles arbitrarily large messages regardless. The `send_buffer_size` is passed to `zmq.SNDBUF` as a kernel hint but has no effect on IPC transport.
+
+---
+
 ## Test Servers
 
 Two built-in servers for benchmarking without a real GPU endpoint.
 
@@ -133,7 +133,7 @@ Validation is layered, executing in order:
  3. Sub-model validators:
     ├── RuntimeConfig._validate_durations    → max >= min duration
     ├── LoadPattern._validate_completeness   → poisson needs qps, concurrency needs target
-    └── ClientSettings._workers_not_zero     → workers != 0
+    └── HTTPClientConfig._workers_not_zero   → num_workers != 0
  4. BenchmarkConfig._resolve_and_validate:
     ├── resolve defaults (name, streaming, model name from submission_ref)
     ├── load pattern type vs test type (offline→max_throughput, online→poisson/concurrency)
@@ -156,7 +156,7 @@ $ inference-endpoint benchmark offline
 
 $ inference-endpoint benchmark offline --endpoints x --model M --dataset D --workers abc
 ╭── Error ─────────────────────────────────────────────────────────────────────╮
-│   settings.client.workers: Input should be a valid integer, unable to parse  │
+│   settings.client.num_workers: Input should be a valid integer, unable to    │
 │ string as an integer                                                         │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ```
@@ -182,7 +182,7 @@ NotImplementedError     1           Unimplemented command (eval)
 Annotate the schema field — zero CLI code changes:
 
 ```python
-class ClientSettings(BaseModel):
+class HTTPClientConfig(WithUpdatesMixin, BaseModel):
     buffer_size: Annotated[
         int,
         cyclopts.Parameter(alias="--buffer-size", help="Socket buffer size"),
 
@@ -97,7 +97,7 @@ Flag names shown as `--full.dotted.path --alias`. Both forms work.
 - `--model-params.streaming --streaming` - Streaming mode: auto/on/off (default: auto)
 - `--runtime.min-duration-ms --duration` - Min duration: ms default, or with suffix (600s, 10m) (default: 600000)
 - `--runtime.n-samples-to-issue --num-samples` - Explicit sample count override
-- `--client.workers --workers` - HTTP workers (-1=auto, default: -1)
+- `--client.num-workers --workers` - HTTP workers (-1=auto, default: -1)
 - `--client.max-connections --max-connections` - Max TCP connections (-1=unlimited)
 - `--endpoint-config.api-key --api-key` - API authentication
 - `--endpoint-config.api-type --api-type` - API type: openai/sglang (default: openai)
@@ -288,7 +288,7 @@ settings:
     type: "max_throughput"
     target_qps: 10.0
   client:
-    workers: -1 # auto
+    num_workers: -1 # auto
 
 metrics:
   collect: ["throughput", "latency", "ttft", "tpot"]