mlcommons · arekay-nv · May 12, 2026 · May 8, 2026 · May 12, 2026 · May 12, 2026
@@ -74,17 +74,17 @@ Dataset Manager --> Load Generator --> Endpoint Client --> External Endpoint
 
 ### Key Components
 
-| Component           | Location                                                      | Purpose                                                                                                                                                                                                                                                                                                                                               |
-| ------------------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| **Load Generator**  | `src/inference_endpoint/load_generator/`                      | Central orchestrator: `BenchmarkSession` owns the lifecycle, `Scheduler` controls timing, `LoadGenerator` issues queries                                                                                                                                                                                                                              |
-| **Endpoint Client** | `src/inference_endpoint/endpoint_client/`                     | Multi-process HTTP workers communicating via ZMQ IPC. `HTTPEndpointClient` is the main entry point                                                                                                                                                                                                                                                    |
-| **Dataset Manager** | `src/inference_endpoint/dataset_manager/`                     | Loads JSONL, HuggingFace, CSV, JSON, Parquet datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface                                                                                                                                                                                                                            |
-| **Metrics**         | `src/inference_endpoint/metrics/`                             | `EventRecorder` writes to SQLite, `MetricsReporter` reads and aggregates (QPS, latency, TTFT, TPOT)                                                                                                                                                                                                                                                   |
-| **Config**          | `src/inference_endpoint/config/`, `endpoint_client/config.py` | Pydantic-based YAML schema (`schema.py`), `HTTPClientConfig` (single Pydantic model for CLI/YAML/runtime), `RuntimeSettings`                                                                                                                                                                                                                          |
-| **CLI**             | `src/inference_endpoint/main.py`, `commands/benchmark/cli.py` | cyclopts-based, auto-generated from `schema.py` and `HTTPClientConfig` Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)`                                                                                                                                                                                                           |
-| **Async Utils**     | `src/inference_endpoint/async_utils/`                         | `LoopManager` (uvloop + eager_task_factory), ZMQ transport layer, event publisher                                                                                                                                                                                                                                                                     |
-| **OpenAI/SGLang**   | `src/inference_endpoint/openai/`, `sglang/`                   | Protocol adapters and response accumulators for different API formats                                                                                                                                                                                                                                                                                 |
-| **VideoGen**        | `src/inference_endpoint/videogen/`                            | Adapter for video-generation endpoints (e.g. trtllm-serve `POST /v1/videos/generations`, used by MLPerf WAN2.2-T2V-A14B). Defaults to `response_format=video_path` (server saves video to shared storage and returns path) to avoid large byte payloads; switch to `video_bytes` for accuracy mode. Dataset is ingested via the generic JSONL loader. |
+| Component           | Location                                                      | Purpose                                                                                                                                                                                                                                                                                                                                                 |
+| ------------------- | ------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Load Generator**  | `src/inference_endpoint/load_generator/`                      | Central orchestrator: `BenchmarkSession` owns the lifecycle, `Scheduler` controls timing, `LoadGenerator` issues queries                                                                                                                                                                                                                                |
+| **Endpoint Client** | `src/inference_endpoint/endpoint_client/`                     | Multi-process HTTP workers communicating via ZMQ IPC. `HTTPEndpointClient` is the main entry point                                                                                                                                                                                                                                                      |
+| **Dataset Manager** | `src/inference_endpoint/dataset_manager/`                     | Loads JSONL, HuggingFace, CSV, JSON, Parquet datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface                                                                                                                                                                                                                              |
+| **Metrics**         | `src/inference_endpoint/metrics/`                             | `EventRecorder` writes to SQLite, `MetricsReporter` reads and aggregates (QPS, latency, TTFT, TPOT)                                                                                                                                                                                                                                                     |
+| **Config**          | `src/inference_endpoint/config/`, `endpoint_client/config.py` | Pydantic-based YAML schema (`schema.py`), `HTTPClientConfig` (single Pydantic model for CLI/YAML/runtime), `RuntimeSettings`                                                                                                                                                                                                                            |
+| **CLI**             | `src/inference_endpoint/main.py`, `commands/benchmark/cli.py` | cyclopts-based, auto-generated from `schema.py` and `HTTPClientConfig` Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)`                                                                                                                                                                                                             |
+| **Async Utils**     | `src/inference_endpoint/async_utils/`                         | `LoopManager` (uvloop + eager_task_factory), ZMQ transport layer, event publisher                                                                                                                                                                                                                                                                       |
+| **OpenAI/SGLang**   | `src/inference_endpoint/openai/`, `sglang/`                   | Protocol adapters and response accumulators for different API formats. `openai_completions` adapter (`completions_adapter.py`) sends pre-tokenized token IDs to `/v1/completions`, bypassing the server chat template — required for gpt-oss-120b on vLLM. `sglang` adapter sends to `/generate` via `input_ids`. Both apply `Harmonize()` client-side. |
+| **VideoGen**        | `src/inference_endpoint/videogen/`                            | Adapter for video-generation endpoints (e.g. trtllm-serve `POST /v1/videos/generations`, used by MLPerf WAN2.2-T2V-A14B). Defaults to `response_format=video_path` (server saves video to shared storage and returns path) to avoid large byte payloads; switch to `video_bytes` for accuracy mode. Dataset is ingested via the generic JSONL loader.   |
 
 ### Hot-Path Architecture
 
@@ -194,12 +194,13 @@ src/inference_endpoint/
 │   ├── rulesets/mlcommons/    # MLCommons-specific rules, datasets, models
 │   └── templates/             # YAML config templates (_template.yaml minimal, _template_full.yaml all defaults)
 ├── openai/                    # OpenAI-compatible API types and adapters
-│   ├── types.py               # OpenAI response types
-│   ├── openai_adapter.py      # Request/response adapter
-│   ├── openai_msgspec_adapter.py  # msgspec-based adapter (fast path)
-│   ├── accumulator.py         # Streaming response accumulator
+│   ├── types.py               # OpenAI response types (chat + text completion)
+│   ├── openai_adapter.py      # Chat completions adapter (/v1/chat/completions)
+│   ├── openai_msgspec_adapter.py  # msgspec-based chat completions adapter (fast path)
+│   ├── completions_adapter.py # Text completions adapter (/v1/completions, pre-tokenized input)
+│   ├── accumulator.py         # Streaming response accumulator (shared by chat + completions)
 │   └── harmony.py             # openai_harmony integration
-├── sglang/                    # SGLang API adapter
+├── sglang/                    # SGLang API adapter (/generate with input_ids)
 ├── videogen/                  # Video generation adapter (e.g. WAN2.2 T2V workload)
 │   ├── __init__.py
 │   ├── types.py               # Pydantic: VideoPathRequest, VideoPathResponse, VideoPayloadResponse

@@ -45,17 +45,17 @@ docker run --runtime nvidia --gpus all \
 
 ### Run Benchmark
 
-[`vllm_gptoss_120b_example.yaml`](vllm_gptoss_120b_example.yaml) runs performance + AIME25 + GPQA accuracy at concurrency 512:
+[`vllm_gptoss_120b_example.yaml`](vllm_gptoss_120b_example.yaml) runs performance + AIME25 + GPQA + LiveCodeBench accuracy at concurrency 512:
 
 ```bash
 uv run inference-endpoint benchmark from-config \
   -c examples/04_GPTOSS120B_Example/vllm_gptoss_120b_example.yaml \
   --timeout 60
 ```
 
-> **Note:** The dataset's `prompt` column is mapped to the benchmark's `prompt` field and sent through the
-> chat completions API. vLLM does not support pre-tokenized input via this endpoint, unlike SGLang's
-> `input_tokens` path.
+The config uses `api_type: openai_completions`, which routes to `/v1/completions` with pre-tokenized
+token IDs (`prompt: [id, id, ...]`). This applies the Harmony format client-side and bypasses vLLM's
+chat template, producing the same token sequence as the SGLang path and matching SGLang accuracy scores.
 
 ### vllm bench serve (Reference Comparison)
 
@@ -145,8 +145,8 @@ uv run inference-endpoint benchmark from-config \
 For a performance-only run, use [`gptoss_120b_example.yaml`](gptoss_120b_example.yaml). It is
 configured for SGLang on `http://localhost:30000` by default. To target vLLM instead, update
 `endpoint_config.endpoints` to your server (e.g. `http://localhost:8000`) **and** change
-`endpoint_config.api_type` to `"openai"` so requests route to `/v1/chat/completions` rather than
-`/generate`.
+`endpoint_config.api_type` to `"openai_completions"` so requests route to `/v1/completions`
+with pre-tokenized input rather than SGLang's `/generate`.
 
 ### LiveCodeBench Setup
 

@@ -16,16 +16,13 @@ datasets:
     type: "performance"
     path: "examples/04_GPTOSS120B_Example/data/perf_eval_ref.parquet"
     parser:
-      # input_tokens: "input_tokens"
-      prompt: "prompt"
-  # livecodebench not included: vLLM chat completions endpoint does not support
-  # pre-tokenized input required by this scorer. Use sglang_gptoss_120b_example.yaml instead.
-  # - name: "livecodebench::gptoss"
-  #   type: "accuracy"
-  #   accuracy_config:
-  #     eval_method: "code_bench_scorer"
-  #     extractor: "python_code_extractor"
-  #     num_repeats: 3
+      input_tokens: "input_tokens"
+  - name: "livecodebench::gptoss"
+    type: "accuracy"
+    accuracy_config:
+      eval_method: "code_bench_scorer"
+      extractor: "python_code_extractor"
+      num_repeats: 3
   - name: "aime25::gptoss"
     type: "accuracy"
     accuracy_config:
@@ -40,6 +37,7 @@ datasets:
       extractor: "abcd_extractor"
       ground_truth: "ground_truth"
       num_repeats: 5
+
 settings:
   runtime:
     min_duration_ms: 3000
@@ -59,6 +57,6 @@ endpoint_config:
   endpoints:
     - "http://localhost:8000"
   api_key: null
-  api_type: "openai"
+  api_type: "openai_completions"
 
 report_dir: "results/vllm_gptoss_120b_benchmark_full/"
@@ -67,6 +67,7 @@ dependencies = [
     "colorama==0.4.6",
     # Fix pytz-2024 import warning
     "pytz==2026.1.post1",
+    "urllib3==2.7.0",
 ]
 
 [project.optional-dependencies]

@@ -72,7 +72,7 @@ endpoint_config:
   endpoints:  # Endpoint URL(s). Must include scheme, e.g. 'http://host:port'.
   - http://localhost:8000
   api_key: null  # API key
-  api_type: openai  # API type: openai, sglang, or videogen | options: openai, sglang, videogen
+  api_type: openai  # API type: openai, sglang, or videogen | options: openai, openai_completions, sglang, videogen
 report_dir: null  # Report output directory
 timeout: null  # Global timeout in seconds
 verbose: false  # Enable verbose logging

@@ -72,7 +72,7 @@ endpoint_config:
   endpoints:  # Endpoint URL(s). Must include scheme, e.g. 'http://host:port'.
   - http://localhost:8000
   api_key: null  # API key
-  api_type: openai  # API type: openai, sglang, or videogen | options: openai, sglang, videogen
+  api_type: openai  # API type: openai, sglang, or videogen | options: openai, openai_completions, sglang, videogen
 report_dir: null  # Report output directory
 timeout: null  # Global timeout in seconds
 verbose: false  # Enable verbose logging

@@ -72,7 +72,7 @@ endpoint_config:
   endpoints:  # Endpoint URL(s). Must include scheme, e.g. 'http://host:port'.
   - http://localhost:8000
   api_key: null  # API key
-  api_type: openai  # API type: openai, sglang, or videogen | options: openai, sglang, videogen
+  api_type: openai  # API type: openai, sglang, or videogen | options: openai, openai_completions, sglang, videogen
 report_dir: null  # Report output directory
 timeout: null  # Global timeout in seconds
 verbose: false  # Enable verbose logging

@@ -35,6 +35,7 @@ class APIType(str, Enum):
     """
 
     OPENAI = "openai"
+    OPENAI_COMPLETIONS = "openai_completions"
     SGLANG = "sglang"
     VIDEOGEN = "videogen"
 
@@ -43,6 +44,8 @@ def default_route(self) -> str:
         match self:
             case APIType.OPENAI:
                 return "/v1/chat/completions"
+            case APIType.OPENAI_COMPLETIONS:
+                return "/v1/completions"
             case APIType.SGLANG:
                 return "/generate"
             case APIType.VIDEOGEN:

@@ -21,10 +21,8 @@
     UserPromptFormatter,
 )
 
+_FORMAT = "{question}\nPlease reason step by step, and put your final answer within \\boxed{{}}."
+
 
 def gptoss() -> list[Transform]:
-    return [
-        UserPromptFormatter(
-            user_prompt_format="{question}\nPlease reason step by step, and put your final answer within \\boxed{{}}.",
-        ),
-    ]
+    return [UserPromptFormatter(user_prompt_format=_FORMAT)]
@@ -21,18 +21,15 @@
     UserPromptFormatter,
 )
 
+_FORMAT = (
+    "{question}\n\n"
+    "(A) {choice1}\n"
+    "(B) {choice2}\n"
+    "(C) {choice3}\n"
+    "(D) {choice4}\n\n"
+    "Express your final answer as the corresponding option 'A', 'B', 'C', or 'D'."
+)
+
 
 def gptoss() -> list[Transform]:
-    return [
-        # Step 1: Format the prompt from question and choices
-        UserPromptFormatter(
-            user_prompt_format=(
-                "{question}\n\n"
-                "(A) {choice1}\n"
-                "(B) {choice2}\n"
-                "(C) {choice3}\n"
-                "(D) {choice4}\n\n"
-                "Express your final answer as the corresponding option 'A', 'B', 'C', or 'D'."
-            ),
-        ),
-    ]
+    return [UserPromptFormatter(user_prompt_format=_FORMAT)]
@@ -21,19 +21,17 @@
     UserPromptFormatter,
 )
 
+_FORMAT = (
+    "You are a python coding expert that solves problems step-by-step.\n"
+    "You must provide the reasoning to arriving at your solution and the code to solve the problem.\n"
+    "Do not try simulating the code execution. The code must be enclosed within ```python delimiters.\n\n\n"
+    "{question}\n"
+    "### Format: You will use the following starter code to write the solution to the problem and enclose your code within delimiters.\n"
+    "```python\n"
+    "{starter_code}\n"
+    "```\n"
+)
+
 
 def gptoss() -> list[Transform]:
-    return [
-        UserPromptFormatter(
-            user_prompt_format=(
-                "You are a python coding expert that solves problems step-by-step.\n"
-                "You must provide the reasoning to arriving at your solution and the code to solve the problem.\n"
-                "Do not try simulating the code execution. The code must be enclosed within ```python delimiters.\n\n\n"
-                "{question}\n"
-                "### Format: You will use the following starter code to write the solution to the problem and enclose your code within delimiters.\n"
-                "```python\n"
-                "{starter_code}\n"
-                "```\n"
-            ),
-        ),
-    ]
+    return [UserPromptFormatter(user_prompt_format=_FORMAT)]
@@ -19,7 +19,7 @@
 
 import re
 from abc import ABC, abstractmethod
-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING, Any
 
 from inference_endpoint.core.types import Query, QueryResult
 
@@ -93,15 +93,15 @@ def decode_response(cls, response_bytes: bytes, query_id: str) -> QueryResult:
 
     @classmethod
     @abstractmethod
-    def decode_sse_message(cls, json_bytes: bytes) -> str:
+    def decode_sse_message(cls, json_bytes: bytes) -> Any:
         """
-        Decode SSE message and extract content string.
+        Decode SSE message and extract content.
 
         Args:
             json_bytes: Raw JSON bytes from SSE stream
 
         Returns:
-            Content string from the SSE message
+            Decoded SSE content (type depends on the adapter implementation)
         """
         raise NotImplementedError("decode_sse_message not implemented")