mlcommons
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 17 additions & 5 deletions b/‎README.md‎
Lines changed: 17 additions & 5 deletions
diff --git a/‎examples/03_BenchmarkComparison/README.md‎
Lines changed: 92 additions & 0 deletions b/‎examples/03_BenchmarkComparison/README.md‎
Lines changed: 92 additions & 0 deletions
@@ -186,5 +186,8 @@ data/
 results/
 outputs/
 
+# Example vLLM virtualenv
+examples/03_BenchmarkComparison/vllm_venv/
+
 # Cursor artifacts (local development only)
 .cursor_artifacts/
@@ -62,7 +62,7 @@ inference-endpoint benchmark offline \
   --num-samples 5000
 ```
 
-### Local Testing
+### Running Locally
 
 ```bash
 # Start local echo server
@@ -80,6 +80,18 @@ pkill -f echo_server
 
 See [Local Testing Guide](docs/LOCAL_TESTING.md) for detailed instructions.
 
+### Running Tests and Examples
+
+```bash
+# Install tests/ and examples/ dependencies
+pip install -r requirements/test.txt
+
+# Run tests (excluding performance and explicit-run tests)
+pytest -m "not performance and not run_explicitly"
+
+# Run examples: follow instructions in examples/*/README.md
+```
+
 ## 📚 Documentation
 
 - [CLI Quick Reference](docs/CLI_QUICK_REFERENCE.md) - Command-line interface guide
@@ -93,14 +105,14 @@ The system follows a modular, event-driven architecture:
 
 ```
 ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
-│   Dataset      │    │   Load          │    │   Endpoint      │
-│   Manager      │───▶│   Generator     │───▶│   Client        │
+│   Dataset       │    │   Load          │    │   Endpoint      │
+│   Manager       │───▶│   Generator     │───▶│   Client        │
 └─────────────────┘    └─────────────────┘    └─────────────────┘
          │                       │                       │
          ▼                       ▼                       ▼
 ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
-│   Metrics      │    │   Configuration │    │   Endpoint      │
-│   Collector    │◄───│   Manager       │    │   (External)    │
+│   Metrics       │    │   Configuration │    │   Endpoint      │
+│   Collector     │◄───│   Manager       │    │   (External)    │
 └─────────────────┘    └─────────────────┘    └─────────────────┘
 ```
 
 
@@ -0,0 +1,92 @@
+# Benchmark Comparison Example
+
+Compare `inference-endpoint` with vLLM's benchmarking tool using identical prompts.
+
+## Prerequisites
+
+**Setup vLLM virtualenv** (isolates vLLM dependencies from inference-endpoint):
+
+```bash
+cd examples/03_BenchmarkComparison
+./setup_vllm_venv.sh
+```
+
+This creates a `vllm_venv` directory with vLLM installed. You can specify a custom location:
+
+```bash
+./setup_vllm_venv.sh /path/to/custom/venv
+```
+
+**Running inference server** (OpenAI-compatible):
+
+```bash
+/path/to/custom/venv/bin/vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000
+# default: examples/03_BenchmarkComparison/vllm_venv
+```
+
+## Usage
+
+```bash
+cd examples/03_BenchmarkComparison
+python compare_with_vllm.py --model "Qwen/Qwen2.5-0.5B-Instruct" --endpoint http://localhost:8000
+```
+
+### Options
+
+| Option                | Description                      | Default                 |
+| --------------------- | -------------------------------- | ----------------------- |
+| `--model`, `-m`       | Model name (required)            | -                       |
+| `--num-prompts`, `-n` | Number of prompts                | 100                     |
+| `--endpoint`          | Server URL                       | `http://localhost:8000` |
+| `--max-output-tokens` | Max output tokens                | 2000                    |
+| `--timeout`           | Timeout in seconds               | 900                     |
+| `--workers`           | Number of workers                | 1                       |
+| `--verbose`, `-v`     | Show full output from each run   | -                       |
+| `--dry`               | Print commands without executing | -                       |
+| `--vllm-venv-dir`     | Path to vLLM virtualenv          | `./vllm_venv`           |
+
+### Example
+
+```bash
+python compare_with_vllm.py \
+    --model "Qwen/Qwen2.5-0.5B-Instruct" \
+    --num-prompts 200 \
+    --max-output-tokens 1000
+```
+
+## Output
+
+The script runs both benchmarks and displays a comparison table:
+
+```
+
+$ python examples/03_BenchmarkComparison/compare_with_vllm.py --model Qwen/Qwen2.5-0.5B-Instruct --num-prompts 10000
+
+====================================================================================================
+Metric                              | Inference Endpoint        | vLLM Benchmark
+----------------------------------------------------------------------------------------------------
+----------------------------------------------------------------------------------------------------
+Test Duration (s)                   | 284.28                    | 300.69
+----------------------------------------------------------------------------------------------------
+Throughput (req/s)                  | 35.18                     | 33.26
+Total Generated Tokens              | 4446263                   | 4626060
+Output Token Throughput (tok/s)     | 15640.65                  | 15384.64
+----------------------------------------------------------------------------------------------------
+Mean TTFT (ms)                      | 137112.88                 | 146093.86
+Median TTFT (ms)                    | 137092.46                 | 145656.40
+P99 TTFT (ms)                       | 270902.92                 | 281810.49
+----------------------------------------------------------------------------------------------------
+Mean TPOT (ms)                      | 15.85                     | 15.60
+Median TPOT (ms)                    | 15.56                     | 15.61
+P99 TPOT (ms)                       | 36.47                     | 23.49
+----------------------------------------------------------------------------------------------------
+Mean ITL (ms)                       | 15.85                     | 15.42
+Median ITL (ms)                     | 15.56                     | 12.17
+P99 ITL (ms)                        | 36.47                     | 35.96
+----------------------------------------------------------------------------------------------------
+Mean Output Length (tokens)         | 444                       | 462
+Median Output Length (tokens)       | 401                       | 406
+P99 Output Length (tokens)          | 2000                      | 2000
+====================================================================================================
+
+```