Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 17 additions & 16 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,17 +74,17 @@ Dataset Manager --> Load Generator --> Endpoint Client --> External Endpoint

### Key Components

| Component | Location | Purpose |
| ------------------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Load Generator** | `src/inference_endpoint/load_generator/` | Central orchestrator: `BenchmarkSession` owns the lifecycle, `Scheduler` controls timing, `LoadGenerator` issues queries |
| **Endpoint Client** | `src/inference_endpoint/endpoint_client/` | Multi-process HTTP workers communicating via ZMQ IPC. `HTTPEndpointClient` is the main entry point |
| **Dataset Manager** | `src/inference_endpoint/dataset_manager/` | Loads JSONL, HuggingFace, CSV, JSON, Parquet datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface |
| **Metrics** | `src/inference_endpoint/metrics/` | `EventRecorder` writes to SQLite, `MetricsReporter` reads and aggregates (QPS, latency, TTFT, TPOT) |
| **Config** | `src/inference_endpoint/config/`, `endpoint_client/config.py` | Pydantic-based YAML schema (`schema.py`), `HTTPClientConfig` (single Pydantic model for CLI/YAML/runtime), `RuntimeSettings` |
| **CLI** | `src/inference_endpoint/main.py`, `commands/benchmark/cli.py` | cyclopts-based, auto-generated from `schema.py` and `HTTPClientConfig` Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)` |
| **Async Utils** | `src/inference_endpoint/async_utils/` | `LoopManager` (uvloop + eager_task_factory), ZMQ transport layer, event publisher |
| **OpenAI/SGLang** | `src/inference_endpoint/openai/`, `sglang/` | Protocol adapters and response accumulators for different API formats |
| **VideoGen** | `src/inference_endpoint/videogen/` | Adapter for video-generation endpoints (e.g. trtllm-serve `POST /v1/videos/generations`, used by MLPerf WAN2.2-T2V-A14B). Defaults to `response_format=video_path` (server saves video to shared storage and returns path) to avoid large byte payloads; switch to `video_bytes` for accuracy mode. Dataset is ingested via the generic JSONL loader. |
| Component | Location | Purpose |
| ------------------- | ------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Load Generator** | `src/inference_endpoint/load_generator/` | Central orchestrator: `BenchmarkSession` owns the lifecycle, `Scheduler` controls timing, `LoadGenerator` issues queries |
| **Endpoint Client** | `src/inference_endpoint/endpoint_client/` | Multi-process HTTP workers communicating via ZMQ IPC. `HTTPEndpointClient` is the main entry point |
| **Dataset Manager** | `src/inference_endpoint/dataset_manager/` | Loads JSONL, HuggingFace, CSV, JSON, Parquet datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface |
| **Metrics** | `src/inference_endpoint/metrics/` | `EventRecorder` writes to SQLite, `MetricsReporter` reads and aggregates (QPS, latency, TTFT, TPOT) |
| **Config** | `src/inference_endpoint/config/`, `endpoint_client/config.py` | Pydantic-based YAML schema (`schema.py`), `HTTPClientConfig` (single Pydantic model for CLI/YAML/runtime), `RuntimeSettings` |
| **CLI** | `src/inference_endpoint/main.py`, `commands/benchmark/cli.py` | cyclopts-based, auto-generated from `schema.py` and `HTTPClientConfig` Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)` |
| **Async Utils** | `src/inference_endpoint/async_utils/` | `LoopManager` (uvloop + eager_task_factory), ZMQ transport layer, event publisher |
| **OpenAI/SGLang** | `src/inference_endpoint/openai/`, `sglang/` | Protocol adapters and response accumulators for different API formats. `openai_completions` adapter (`completions_adapter.py`) sends pre-tokenized token IDs to `/v1/completions`, bypassing the server chat template — required for gpt-oss-120b on vLLM. `sglang` adapter sends to `/generate` via `input_ids`. Both apply `Harmonize()` client-side. |
| **VideoGen** | `src/inference_endpoint/videogen/` | Adapter for video-generation endpoints (e.g. trtllm-serve `POST /v1/videos/generations`, used by MLPerf WAN2.2-T2V-A14B). Defaults to `response_format=video_path` (server saves video to shared storage and returns path) to avoid large byte payloads; switch to `video_bytes` for accuracy mode. Dataset is ingested via the generic JSONL loader. |

### Hot-Path Architecture

Expand Down Expand Up @@ -194,12 +194,13 @@ src/inference_endpoint/
│ ├── rulesets/mlcommons/ # MLCommons-specific rules, datasets, models
│ └── templates/ # YAML config templates (_template.yaml minimal, _template_full.yaml all defaults)
├── openai/ # OpenAI-compatible API types and adapters
│ ├── types.py # OpenAI response types
│ ├── openai_adapter.py # Request/response adapter
│ ├── openai_msgspec_adapter.py # msgspec-based adapter (fast path)
│ ├── accumulator.py # Streaming response accumulator
│ ├── types.py # OpenAI response types (chat + text completion)
│ ├── openai_adapter.py # Chat completions adapter (/v1/chat/completions)
Comment thread
arekay-nv marked this conversation as resolved.
│ ├── openai_msgspec_adapter.py # msgspec-based chat completions adapter (fast path)
│ ├── completions_adapter.py # Text completions adapter (/v1/completions, pre-tokenized input)
Comment thread
arekay-nv marked this conversation as resolved.
│ ├── accumulator.py # Streaming response accumulator (shared by chat + completions)
│ └── harmony.py # openai_harmony integration
├── sglang/ # SGLang API adapter
├── sglang/ # SGLang API adapter (/generate with input_ids)
├── videogen/ # Video generation adapter (e.g. WAN2.2 T2V workload)
│ ├── __init__.py
│ ├── types.py # Pydantic: VideoPathRequest, VideoPathResponse, VideoPayloadResponse
Expand Down
12 changes: 6 additions & 6 deletions examples/04_GPTOSS120B_Example/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,17 +45,17 @@ docker run --runtime nvidia --gpus all \

### Run Benchmark

[`vllm_gptoss_120b_example.yaml`](vllm_gptoss_120b_example.yaml) runs performance + AIME25 + GPQA accuracy at concurrency 512:
[`vllm_gptoss_120b_example.yaml`](vllm_gptoss_120b_example.yaml) runs performance + AIME25 + GPQA + LiveCodeBench accuracy at concurrency 512:

```bash
uv run inference-endpoint benchmark from-config \
-c examples/04_GPTOSS120B_Example/vllm_gptoss_120b_example.yaml \
--timeout 60
```

> **Note:** The dataset's `prompt` column is mapped to the benchmark's `prompt` field and sent through the
> chat completions API. vLLM does not support pre-tokenized input via this endpoint, unlike SGLang's
> `input_tokens` path.
The config uses `api_type: openai_completions`, which routes to `/v1/completions` with pre-tokenized
token IDs (`prompt: [id, id, ...]`). This applies the Harmony format client-side and bypasses vLLM's
chat template, producing the same token sequence as the SGLang path and matching SGLang accuracy scores.

### vllm bench serve (Reference Comparison)

Expand Down Expand Up @@ -145,8 +145,8 @@ uv run inference-endpoint benchmark from-config \
For a performance-only run, use [`gptoss_120b_example.yaml`](gptoss_120b_example.yaml). It is
configured for SGLang on `http://localhost:30000` by default. To target vLLM instead, update
`endpoint_config.endpoints` to your server (e.g. `http://localhost:8000`) **and** change
`endpoint_config.api_type` to `"openai"` so requests route to `/v1/chat/completions` rather than
`/generate`.
`endpoint_config.api_type` to `"openai_completions"` so requests route to `/v1/completions`
with pre-tokenized input rather than SGLang's `/generate`.

### LiveCodeBench Setup

Expand Down
20 changes: 9 additions & 11 deletions examples/04_GPTOSS120B_Example/vllm_gptoss_120b_example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,13 @@ datasets:
type: "performance"
path: "examples/04_GPTOSS120B_Example/data/perf_eval_ref.parquet"
parser:
# input_tokens: "input_tokens"
prompt: "prompt"
# livecodebench not included: vLLM chat completions endpoint does not support
# pre-tokenized input required by this scorer. Use sglang_gptoss_120b_example.yaml instead.
# - name: "livecodebench::gptoss"
# type: "accuracy"
# accuracy_config:
# eval_method: "code_bench_scorer"
# extractor: "python_code_extractor"
# num_repeats: 3
input_tokens: "input_tokens"
- name: "livecodebench::gptoss"
type: "accuracy"
accuracy_config:
eval_method: "code_bench_scorer"
extractor: "python_code_extractor"
num_repeats: 3
- name: "aime25::gptoss"
type: "accuracy"
accuracy_config:
Expand All @@ -40,6 +37,7 @@ datasets:
extractor: "abcd_extractor"
ground_truth: "ground_truth"
num_repeats: 5

settings:
runtime:
min_duration_ms: 3000
Expand All @@ -59,6 +57,6 @@ endpoint_config:
endpoints:
- "http://localhost:8000"
api_key: null
api_type: "openai"
api_type: "openai_completions"
Comment thread
arekay-nv marked this conversation as resolved.

report_dir: "results/vllm_gptoss_120b_benchmark_full/"
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ dependencies = [
"colorama==0.4.6",
# Fix pytz-2024 import warning
"pytz==2026.1.post1",
"urllib3==2.7.0",
]

[project.optional-dependencies]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ endpoint_config:
endpoints: # Endpoint URL(s). Must include scheme, e.g. 'http://host:port'.
- http://localhost:8000
api_key: null # API key
api_type: openai # API type: openai, sglang, or videogen | options: openai, sglang, videogen
api_type: openai # API type: openai, sglang, or videogen | options: openai, openai_completions, sglang, videogen
report_dir: null # Report output directory
timeout: null # Global timeout in seconds
verbose: false # Enable verbose logging
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ endpoint_config:
endpoints: # Endpoint URL(s). Must include scheme, e.g. 'http://host:port'.
- http://localhost:8000
api_key: null # API key
api_type: openai # API type: openai, sglang, or videogen | options: openai, sglang, videogen
api_type: openai # API type: openai, sglang, or videogen | options: openai, openai_completions, sglang, videogen
report_dir: null # Report output directory
timeout: null # Global timeout in seconds
verbose: false # Enable verbose logging
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ endpoint_config:
endpoints: # Endpoint URL(s). Must include scheme, e.g. 'http://host:port'.
- http://localhost:8000
api_key: null # API key
api_type: openai # API type: openai, sglang, or videogen | options: openai, sglang, videogen
api_type: openai # API type: openai, sglang, or videogen | options: openai, openai_completions, sglang, videogen
report_dir: null # Report output directory
timeout: null # Global timeout in seconds
verbose: false # Enable verbose logging
Expand Down
3 changes: 3 additions & 0 deletions src/inference_endpoint/core/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ class APIType(str, Enum):
"""

OPENAI = "openai"
OPENAI_COMPLETIONS = "openai_completions"
Comment thread
arekay-nv marked this conversation as resolved.
SGLANG = "sglang"
VIDEOGEN = "videogen"

Expand All @@ -43,6 +44,8 @@ def default_route(self) -> str:
match self:
case APIType.OPENAI:
return "/v1/chat/completions"
case APIType.OPENAI_COMPLETIONS:
return "/v1/completions"
case APIType.SGLANG:
return "/generate"
case APIType.VIDEOGEN:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,8 @@
UserPromptFormatter,
)

_FORMAT = "{question}\nPlease reason step by step, and put your final answer within \\boxed{{}}."


def gptoss() -> list[Transform]:
return [
UserPromptFormatter(
user_prompt_format="{question}\nPlease reason step by step, and put your final answer within \\boxed{{}}.",
),
]
return [UserPromptFormatter(user_prompt_format=_FORMAT)]
23 changes: 10 additions & 13 deletions src/inference_endpoint/dataset_manager/predefined/gpqa/presets.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,18 +21,15 @@
UserPromptFormatter,
)

_FORMAT = (
"{question}\n\n"
"(A) {choice1}\n"
"(B) {choice2}\n"
"(C) {choice3}\n"
"(D) {choice4}\n\n"
"Express your final answer as the corresponding option 'A', 'B', 'C', or 'D'."
)


def gptoss() -> list[Transform]:
return [
# Step 1: Format the prompt from question and choices
Comment thread
arekay-nv marked this conversation as resolved.
UserPromptFormatter(
user_prompt_format=(
"{question}\n\n"
"(A) {choice1}\n"
"(B) {choice2}\n"
"(C) {choice3}\n"
"(D) {choice4}\n\n"
"Express your final answer as the corresponding option 'A', 'B', 'C', or 'D'."
),
),
]
return [UserPromptFormatter(user_prompt_format=_FORMAT)]
Original file line number Diff line number Diff line change
Expand Up @@ -21,19 +21,17 @@
UserPromptFormatter,
)

_FORMAT = (
"You are a python coding expert that solves problems step-by-step.\n"
"You must provide the reasoning to arriving at your solution and the code to solve the problem.\n"
"Do not try simulating the code execution. The code must be enclosed within ```python delimiters.\n\n\n"
"{question}\n"
"### Format: You will use the following starter code to write the solution to the problem and enclose your code within delimiters.\n"
"```python\n"
"{starter_code}\n"
"```\n"
)


def gptoss() -> list[Transform]:
return [
UserPromptFormatter(
user_prompt_format=(
"You are a python coding expert that solves problems step-by-step.\n"
"You must provide the reasoning to arriving at your solution and the code to solve the problem.\n"
"Do not try simulating the code execution. The code must be enclosed within ```python delimiters.\n\n\n"
"{question}\n"
"### Format: You will use the following starter code to write the solution to the problem and enclose your code within delimiters.\n"
"```python\n"
"{starter_code}\n"
"```\n"
),
),
]
return [UserPromptFormatter(user_prompt_format=_FORMAT)]
8 changes: 4 additions & 4 deletions src/inference_endpoint/endpoint_client/adapter_protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

import re
from abc import ABC, abstractmethod
from typing import TYPE_CHECKING
from typing import TYPE_CHECKING, Any

from inference_endpoint.core.types import Query, QueryResult

Expand Down Expand Up @@ -93,15 +93,15 @@ def decode_response(cls, response_bytes: bytes, query_id: str) -> QueryResult:

@classmethod
@abstractmethod
def decode_sse_message(cls, json_bytes: bytes) -> str:
def decode_sse_message(cls, json_bytes: bytes) -> Any:
"""
Decode SSE message and extract content string.
Decode SSE message and extract content.

Args:
json_bytes: Raw JSON bytes from SSE stream

Returns:
Content string from the SSE message
Decoded SSE content (type depends on the adapter implementation)
"""
raise NotImplementedError("decode_sse_message not implemented")

Expand Down
Loading
Loading