|
| 1 | +# Benchmarking a HF model via vLLM or SgLang |
| 2 | + |
| 3 | +This document describes how we can benchmark an inference server using the inference endpoints. |
| 4 | + |
| 5 | +## Model |
| 6 | + |
| 7 | +We are going to use [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) from Huggingface to demonstrate how to benchmark vLLM and SgLang via inference endpoints. |
| 8 | + |
| 9 | +## Launch the server |
| 10 | + |
| 11 | +The following environment variables are used by the commands below to make the scripts easier to run |
| 12 | + |
| 13 | +``` |
| 14 | +export HF_TOKEN=<your Hugging Face token> |
| 15 | +export HF_HOME=<Path to your hf_home, usually /USERNAME/.cache/huggingface> |
| 16 | +export MODEL_NAME=<model to run, for instance meta-llama/Llama-3.1-8B-Instruct> |
| 17 | +``` |
| 18 | + |
| 19 | +It is convenient to download the model prior to launch so that the container can reuse the model instead of having to download it post-launch. This can be done via `hf download $MODEL_NAME`. The models downloaded can be verified via `hf cache scan` |
| 20 | + |
| 21 | +### [vLLM](https://github.com/vllm-project/vllm) |
| 22 | + |
| 23 | +We can launch the latest docker image for vllm using the command below: |
| 24 | + |
| 25 | +``` |
| 26 | +docker run --runtime nvidia --gpus all -v ${HF_HOME}:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model ${MODEL_NAME} |
| 27 | +
|
| 28 | +``` |
| 29 | + |
| 30 | +### [SGLang](https://github.com/sgl-project/sglang) |
| 31 | + |
| 32 | +For SGLang, we use a similar docker command: |
| 33 | + |
| 34 | +``` |
| 35 | +docker run --gpus all --shm-size 32g --net host -v ${HF_HOME}:/root/.cache/huggingface --env HF_TOKEN=${HF_TOKEN} --ipc=host lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path ${MODEL_NAME} --host 0.0.0.0 --port 8000 --tp-size 1 --enable-metrics |
| 36 | +``` |
| 37 | + |
| 38 | +### [Enroot](https://github.com/NVIDIA/enroot): |
| 39 | + |
| 40 | +On some platforms, docker is replaced by enroot to provide virtualization. The following steps describe how to launch vLLM using enroot - SgLang instructions are similar: |
| 41 | + |
| 42 | +``` |
| 43 | +enroot import docker://vllm/vllm-openai:latest |
| 44 | +enroot start -e HF_TOKEN=$HF_TOKEN -m $HF_HOME:/root/.cache/huggingface vllm+vllm-openai+latest.sqsh --model ${MODEL_NAME} |
| 45 | +``` |
| 46 | + |
| 47 | +## Launching the client |
| 48 | + |
| 49 | +Once the server is up and running, we can send requests to the endpoint by passing in the endpoint address via `-e` as well as the model name |
| 50 | + |
| 51 | +``` |
| 52 | +inference-endpoint benchmark offline -e http://localhost:8000 -d tests/datasets/dummy_1k.pkl --model ${MODEL_NAME} |
| 53 | +``` |
| 54 | + |
| 55 | +# Using a config file |
| 56 | + |
| 57 | +To run [llama2-70b](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) on a single Nvidia-H200 GPU, we first prepare the environment: |
| 58 | + |
| 59 | +``` |
| 60 | +export MODEL_NAME=meta-llama/Llama-2-70b-chat-hf |
| 61 | +export HF_TOKEN=<your Hugging Face token> |
| 62 | +hf download $MODEL_NAME |
| 63 | +
|
| 64 | +``` |
| 65 | + |
| 66 | +Launch docker container: |
| 67 | + |
| 68 | +``` |
| 69 | +docker run --runtime nvidia --gpus all -v ${HF_HOME}:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model ${MODEL_NAME} --gpu_memory_utilization 0.95 |
| 70 | +
|
| 71 | +``` |
| 72 | + |
| 73 | +And launch the benchmark using the config file `online_llama2_70b_cnn.yaml`. Note that you will need to export the [cnn/dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset via |
| 74 | + |
| 75 | +``` |
| 76 | +from datasets import load_dataset |
| 77 | +dataset = load_dataset("cnn_dailymail", "3.0.0") |
| 78 | +dataset["train"].to_json("cnn_dailymail_train.json") |
| 79 | +``` |
| 80 | + |
| 81 | +And then launch the example template. |
| 82 | + |
| 83 | +``` |
| 84 | +inference-endpoint benchmark from-config -c examples/02_ServerBenchmarking/online_llama2_70b_cnn.yaml --timeout 600 |
| 85 | +
|
| 86 | +``` |
0 commit comments