[doc] Doc and example to run via config (#11) (#22)

arekay-nv · web-flow · commit 99429261bdbf · 2025-11-12T17:50:57.000-06:00
* Doc and example to run via config

Signed-off-by: Rashid Kaleem &lt;230885705+arekay-nv@users.noreply.github.com&gt;

---------

Signed-off-by: Rashid Kaleem &lt;230885705+arekay-nv@users.noreply.github.com&gt;
diff --git a/examples/02_ServerBenchmarking/README.md b/examples/02_ServerBenchmarking/README.md
@@ -0,0 +1,86 @@
+# Benchmarking a HF model via vLLM or SgLang
+
+This document describes how we can benchmark an inference server using the inference endpoints.
+
+## Model
+
+We are going to use [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) from Huggingface to demonstrate how to benchmark vLLM and SgLang via inference endpoints.
+
+## Launch the server
+
+The following environment variables are used by the commands below to make the scripts easier to run
+
+```
+export HF_TOKEN=<your Hugging Face token>
+export HF_HOME=<Path to your hf_home, usually /USERNAME/.cache/huggingface>
+export MODEL_NAME=<model to run, for instance meta-llama/Llama-3.1-8B-Instruct>
+```
+
+It is convenient to download the model prior to launch so that the container can reuse the model instead of having to download it post-launch. This can be done via `hf download $MODEL_NAME`. The models downloaded can be verified via `hf cache scan`
+
+### [vLLM](https://github.com/vllm-project/vllm)
+
+We can launch the latest docker image for vllm using the command below:
+
+```
+docker run --runtime nvidia --gpus all -v ${HF_HOME}:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" -p 8000:8000 --ipc=host vllm/vllm-openai:latest  --model ${MODEL_NAME}
+
+```
+
+### [SGLang](https://github.com/sgl-project/sglang)
+
+For SGLang, we use a similar docker command:
+
+```
+docker run --gpus all --shm-size 32g --net host -v ${HF_HOME}:/root/.cache/huggingface --env HF_TOKEN=${HF_TOKEN} --ipc=host lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path ${MODEL_NAME} --host 0.0.0.0 --port 8000 --tp-size 1 --enable-metrics
+```
+
+### [Enroot](https://github.com/NVIDIA/enroot):
+
+On some platforms, docker is replaced by enroot to provide virtualization. The following steps describe how to launch vLLM using enroot - SgLang instructions are similar:
+
+```
+enroot import docker://vllm/vllm-openai:latest
+enroot start -e HF_TOKEN=$HF_TOKEN -m $HF_HOME:/root/.cache/huggingface vllm+vllm-openai+latest.sqsh  --model ${MODEL_NAME}
+```
+
+## Launching the client
+
+Once the server is up and running, we can send requests to the endpoint by passing in the endpoint address via `-e` as well as the model name
+
+```
+inference-endpoint benchmark offline -e http://localhost:8000 -d tests/datasets/dummy_1k.pkl  --model ${MODEL_NAME}
+```
+
+# Using a config file
+
+To run [llama2-70b](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) on a single Nvidia-H200 GPU, we first prepare the environment:
+
+```
+export MODEL_NAME=meta-llama/Llama-2-70b-chat-hf
+export HF_TOKEN=<your Hugging Face token>
+hf download $MODEL_NAME
+
+```
+
+Launch docker container:
+
+```
+docker run --runtime nvidia --gpus all     -v ${HF_HOME}:/root/.cache/huggingface     --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN"     -p 8000:8000     --ipc=host     vllm/vllm-openai:latest --model ${MODEL_NAME} --gpu_memory_utilization 0.95
+
+```
+
+And launch the benchmark using the config file `online_llama2_70b_cnn.yaml`. Note that you will need to export the [cnn/dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset via
+
+```
+from datasets import load_dataset
+dataset = load_dataset("cnn_dailymail", "3.0.0")
+dataset["train"].to_json("cnn_dailymail_train.json")
+```
+
+And then launch the example template.
+
+```
+inference-endpoint benchmark from-config -c examples/02_ServerBenchmarking/online_llama2_70b_cnn.yaml --timeout 600
+
+```
diff --git a/examples/02_ServerBenchmarking/offline_llama3_8b_cnn.yaml b/examples/02_ServerBenchmarking/offline_llama3_8b_cnn.yaml
@@ -0,0 +1,43 @@
+# Offline Throughput Benchmark
+name: "offline-llama3-8b-cnn-benchmark"
+version: "1.0"
+type: "offline"
+
+model_params:
+  name: "meta-llama/Llama-3.1-8B-Instruct"
+  temperature: 0.7
+  top_p: 0.9
+  max_new_tokens: 1024
+
+datasets:
+  - name: "perf-test"
+    type: "performance"
+    path: "cnn_dailymail_train.json"
+    samples: 1000
+    parser:
+      prompt: "article"
+
+settings:
+  runtime:
+    min_duration_ms: 6000 # 6 seconds
+    max_duration_ms: 60000 # 1 minute
+    scheduler_random_seed: 137 # For Poisson/distribution sampling
+    dataloader_random_seed: 111 # For dataset shuffling
+
+  load_pattern:
+    type: "max_throughput"
+
+  client:
+    workers: 4
+    max_concurrency: -1 # -1 = unlimited
+
+metrics:
+  collect:
+    - "throughput"
+    - "latency"
+    - "ttft"
+    - "tpot"
+
+endpoint_config:
+  endpoint: "http://localhost:8000"
+  api_key: null
diff --git a/examples/02_ServerBenchmarking/online_llama2_70b_cnn.yaml b/examples/02_ServerBenchmarking/online_llama2_70b_cnn.yaml
@@ -0,0 +1,44 @@
+# Online Latency Benchmark
+name: "online-llama2-70b-cnn-benchmark"
+version: "1.0"
+type: "online"
+
+model_params:
+  name: "meta-llama/Llama-2-70b-chat-hf"
+  temperature: 0.7
+  top_p: 0.9
+  max_new_tokens: 1024
+
+datasets:
+  - name: "perf-test"
+    type: "performance"
+    path: "cnn_dailymail_train.json"
+    samples: 1000
+    parser:
+      prompt: "article"
+
+settings:
+  runtime:
+    min_duration_ms: 60000 # 1 minute
+    max_duration_ms: 180000 # 3 minutes
+    scheduler_random_seed: 42 # For Poisson/distribution sampling
+    dataloader_random_seed: 42 # For dataset shuffling
+
+  load_pattern:
+    type: "max_throughput"
+    target_qps: 10
+
+  client:
+    workers: 4
+    max_concurrency: -1 # -1 = unlimited
+
+metrics:
+  collect:
+    - "throughput"
+    - "latency"
+    - "ttft"
+    - "tpot"
+
+endpoint_config:
+  endpoint: "http://localhost:8000"
+  api_key: null
diff --git a/examples/README.md b/examples/README.md
@@ -8,6 +8,10 @@ This directory contains examples demonstrating how to use the MLPerf Inference E
 
 Local model benchmarking with a small HuggingFace model, demonstrating custom DataLoader and event hooks.
 
+### [02_ServerBenchmarking](02_ServerBenchmarking/)
+
+Benchmarking a real-world model served via open-source serving systems such as [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang).
+
 ## Getting Help
 
 - For general usage: See main [README](../README.md)