Skip to content

Commit 9942926

Browse files
authored
[doc] Doc and example to run via config (#11) (#22)
* Doc and example to run via config Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com> --------- Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
1 parent 26daafe commit 9942926

4 files changed

Lines changed: 177 additions & 0 deletions

File tree

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Benchmarking a HF model via vLLM or SgLang
2+
3+
This document describes how we can benchmark an inference server using the inference endpoints.
4+
5+
## Model
6+
7+
We are going to use [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) from Huggingface to demonstrate how to benchmark vLLM and SgLang via inference endpoints.
8+
9+
## Launch the server
10+
11+
The following environment variables are used by the commands below to make the scripts easier to run
12+
13+
```
14+
export HF_TOKEN=<your Hugging Face token>
15+
export HF_HOME=<Path to your hf_home, usually /USERNAME/.cache/huggingface>
16+
export MODEL_NAME=<model to run, for instance meta-llama/Llama-3.1-8B-Instruct>
17+
```
18+
19+
It is convenient to download the model prior to launch so that the container can reuse the model instead of having to download it post-launch. This can be done via `hf download $MODEL_NAME`. The models downloaded can be verified via `hf cache scan`
20+
21+
### [vLLM](https://github.com/vllm-project/vllm)
22+
23+
We can launch the latest docker image for vllm using the command below:
24+
25+
```
26+
docker run --runtime nvidia --gpus all -v ${HF_HOME}:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model ${MODEL_NAME}
27+
28+
```
29+
30+
### [SGLang](https://github.com/sgl-project/sglang)
31+
32+
For SGLang, we use a similar docker command:
33+
34+
```
35+
docker run --gpus all --shm-size 32g --net host -v ${HF_HOME}:/root/.cache/huggingface --env HF_TOKEN=${HF_TOKEN} --ipc=host lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path ${MODEL_NAME} --host 0.0.0.0 --port 8000 --tp-size 1 --enable-metrics
36+
```
37+
38+
### [Enroot](https://github.com/NVIDIA/enroot):
39+
40+
On some platforms, docker is replaced by enroot to provide virtualization. The following steps describe how to launch vLLM using enroot - SgLang instructions are similar:
41+
42+
```
43+
enroot import docker://vllm/vllm-openai:latest
44+
enroot start -e HF_TOKEN=$HF_TOKEN -m $HF_HOME:/root/.cache/huggingface vllm+vllm-openai+latest.sqsh --model ${MODEL_NAME}
45+
```
46+
47+
## Launching the client
48+
49+
Once the server is up and running, we can send requests to the endpoint by passing in the endpoint address via `-e` as well as the model name
50+
51+
```
52+
inference-endpoint benchmark offline -e http://localhost:8000 -d tests/datasets/dummy_1k.pkl --model ${MODEL_NAME}
53+
```
54+
55+
# Using a config file
56+
57+
To run [llama2-70b](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) on a single Nvidia-H200 GPU, we first prepare the environment:
58+
59+
```
60+
export MODEL_NAME=meta-llama/Llama-2-70b-chat-hf
61+
export HF_TOKEN=<your Hugging Face token>
62+
hf download $MODEL_NAME
63+
64+
```
65+
66+
Launch docker container:
67+
68+
```
69+
docker run --runtime nvidia --gpus all -v ${HF_HOME}:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model ${MODEL_NAME} --gpu_memory_utilization 0.95
70+
71+
```
72+
73+
And launch the benchmark using the config file `online_llama2_70b_cnn.yaml`. Note that you will need to export the [cnn/dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset via
74+
75+
```
76+
from datasets import load_dataset
77+
dataset = load_dataset("cnn_dailymail", "3.0.0")
78+
dataset["train"].to_json("cnn_dailymail_train.json")
79+
```
80+
81+
And then launch the example template.
82+
83+
```
84+
inference-endpoint benchmark from-config -c examples/02_ServerBenchmarking/online_llama2_70b_cnn.yaml --timeout 600
85+
86+
```
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Offline Throughput Benchmark
2+
name: "offline-llama3-8b-cnn-benchmark"
3+
version: "1.0"
4+
type: "offline"
5+
6+
model_params:
7+
name: "meta-llama/Llama-3.1-8B-Instruct"
8+
temperature: 0.7
9+
top_p: 0.9
10+
max_new_tokens: 1024
11+
12+
datasets:
13+
- name: "perf-test"
14+
type: "performance"
15+
path: "cnn_dailymail_train.json"
16+
samples: 1000
17+
parser:
18+
prompt: "article"
19+
20+
settings:
21+
runtime:
22+
min_duration_ms: 6000 # 6 seconds
23+
max_duration_ms: 60000 # 1 minute
24+
scheduler_random_seed: 137 # For Poisson/distribution sampling
25+
dataloader_random_seed: 111 # For dataset shuffling
26+
27+
load_pattern:
28+
type: "max_throughput"
29+
30+
client:
31+
workers: 4
32+
max_concurrency: -1 # -1 = unlimited
33+
34+
metrics:
35+
collect:
36+
- "throughput"
37+
- "latency"
38+
- "ttft"
39+
- "tpot"
40+
41+
endpoint_config:
42+
endpoint: "http://localhost:8000"
43+
api_key: null
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Online Latency Benchmark
2+
name: "online-llama2-70b-cnn-benchmark"
3+
version: "1.0"
4+
type: "online"
5+
6+
model_params:
7+
name: "meta-llama/Llama-2-70b-chat-hf"
8+
temperature: 0.7
9+
top_p: 0.9
10+
max_new_tokens: 1024
11+
12+
datasets:
13+
- name: "perf-test"
14+
type: "performance"
15+
path: "cnn_dailymail_train.json"
16+
samples: 1000
17+
parser:
18+
prompt: "article"
19+
20+
settings:
21+
runtime:
22+
min_duration_ms: 60000 # 1 minute
23+
max_duration_ms: 180000 # 3 minutes
24+
scheduler_random_seed: 42 # For Poisson/distribution sampling
25+
dataloader_random_seed: 42 # For dataset shuffling
26+
27+
load_pattern:
28+
type: "max_throughput"
29+
target_qps: 10
30+
31+
client:
32+
workers: 4
33+
max_concurrency: -1 # -1 = unlimited
34+
35+
metrics:
36+
collect:
37+
- "throughput"
38+
- "latency"
39+
- "ttft"
40+
- "tpot"
41+
42+
endpoint_config:
43+
endpoint: "http://localhost:8000"
44+
api_key: null

examples/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,10 @@ This directory contains examples demonstrating how to use the MLPerf Inference E
88

99
Local model benchmarking with a small HuggingFace model, demonstrating custom DataLoader and event hooks.
1010

11+
### [02_ServerBenchmarking](02_ServerBenchmarking/)
12+
13+
Benchmarking a real-world model served via open-source serving systems such as [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang).
14+
1115
## Getting Help
1216

1317
- For general usage: See main [README](../README.md)

0 commit comments

Comments
 (0)