| layout | post |
|---|---|
| title | Deploying LLMs in Clusters #1: running “vLLM production-stack” on a cloud VM |
| thumbnail-img | https://img.youtube.com/vi/EsTJbQtzj0g/0.jpg |
| share-img | https://img.youtube.com/vi/EsTJbQtzj0g/0.jpg |
| author | LMCache Team |
| image | https://img.youtube.com/vi/EsTJbQtzj0g/0.jpg |
- vLLM boasts the largest open-source community in LLM serving, and “vLLM production-stack” offers a vLLM-based full inference stack with 10x better performance and Easy cluster management.
- Today, we will give a step-by-step demonstration on how to deploy a proof-of-concept “vLLM production-stack” in a cloud VM.
- This is the beginning of our Deploying LLMs in Clusters series. We will be rolling out more blogs about serving LLMs with your own infrastructure during the next few weeks. Let us know which topic we should do next! [poll]
[Github Link] | [More Tutorials] | [Interest Form]
<iframe width="500" height="315" src="(https://www.youtube.com/watch?v=EsTJbQtzj0g&ab_channel=JunchenJiang" frameborder="0" allowfullscreen></iframe>vLLM has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on single-node deployments.
vLLM Production-stack is an open-source reference implementation of an inference stack built on top of vLLM, designed to run seamlessly on a cluster of GPU nodes. It adds four critical functionalities that complement vLLM’s native strengths.
vLLM production-stack offers superior performance than other LLM serving solutions by achieving higher throughput through smart routing and KV cache sharing:
In this section, we will go through the general steps to set up the vLLM production-stack service in the cloud. If you prefer watching videos, please follow this tutorial video.
In this example, we use a Lambda Labs GPU instance with an A40 GPU, but you can do the same thing with AWS EKS.
-
Clone the repository and navigate to the
utils/folder:git clone https://github.com/vllm-project/production-stack.git cd production-stack/utils -
Execute the script
install-kubectl.sh:bash install-kubectl.sh
This script downloads the latest version of
kubectl, the Kubernetes command-line tool, and places it in your PATH for easy execution. -
Expected Output:
-
Verification message using:
kubectl version --client
Example output:
Client Version: v1.32.1 -
-
Execute the script
install-helm.sh:bash install-helm.sh
This script downloads and installs Helm and places the Helm binary in your PATH. Helm is a package manager for Kubernetes and simplifies the deployment process
-
Expected Output:
-
Verification message using:
helm version
Example output:
version.BuildInfo{Version:"v3.17.0", GitCommit:"301108edc7ac2a8ba79e4ebf5701b0b6ce6a31e4", GitTreeState:"clean", GoVersion:"go1.23.4"} -
-
Execute the script
install-minikube-cluster.sh:bash install-minikube-cluster.sh
This script installs Minikube. Minikube configures the system to support GPU workloads by enabling the NVIDIA Container Toolkit and starting Minikube with GPU support.
-
Expected Output:
😄 minikube v1.35.0 on Ubuntu 22.04 (kvm/amd64) ❗ minikube skips various validations when --force is supplied; this may lead to unexpected behavior ✨ Using the docker driver based on user configuration ...... ...... 🏄 Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default "nvidia" has been added to your repositories Hang tight while we grab the latest from your chart repositories... ...... ...... NAME: gpu-operator-1737507918 LAST DEPLOYED: Wed Jan 22 01:05:21 2025 NAMESPACE: gpu-operator STATUS: deployed REVISION: 1 TEST SUITE: None
Now you have everything needed, it is time to deploy production-stack!
First create a yaml file as shown below. Be sure to include your model name, model url, replica-count, and vLLM configurations in the file. Note that the pvcStorage needs be be bigger than the model size.
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "opt125m"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "facebook/opt-125m"
replicaCount: 2
requestCPU: 6
requestMemory: "16Gi"
requestGPU: 0.5
pvcStorage: "10Gi"
pvcAccessMode:
- ReadWriteMany
vllmConfig:
maxModelLen: 1024
extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.4"]And deploy this configuration using Helm:
helm repo add vllm https://vllm-project.github.io/production-stack
helm install mystack vllm/vllm-stack -f example.yamlMonitor the deployment status using:
sudo kubectl get podsExpected output:
- Pods for the
vllmdeployment should transition toReadyand theRunningstate.
NAME READY STATUS RESTARTS AGE
mystack-deployment-router-85d4ffc696-xkg67 1/1 Running 0 2m38s
mystack-opt125m-deployment-vllm-858f4894fc-hfcgg 1/1 Running 0 2m38s
mystack-opt125m-deployment-vllm-858f4894fc-nt6sl 1/1 Running 0 2m38s
Note: It may take some time for the vLLM instance to become ready!
Expose the mystack-router-service port to the host machine:
sudo kubectl port-forward svc/mystack-router-service 30080:80Test the stack's OpenAI-compatible API by querying the available models:
curl -o- http://localhost:30080/modelsExpected output:
{
"object": "list",
"data": [
{
"id": "facebook/opt-125m",
"object": "model",
"created": 1737428424,
"owned_by": "vllm",
"root": null
}
]
}Send a query to the OpenAI /completion endpoint to generate a completion for a prompt:
curl -X POST http://localhost:30080/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "Once upon a time,",
"max_tokens": 10
}'Expected output:
{
"id": "completion-id",
"object": "text_completion",
"created": 1737428424,
"model": "facebook/opt-125m",
"choices": [
{
"text": " there was a brave knight who...",
"index": 0,
"finish_reason": "length"
}
]
}To remove the deployment, run:
sudo helm uninstall mystackWe have demonstrated how to set up a vLLM Production Stack with a GPU VM.
This is the first episode of our Deploy LLMs in Clusters Series. Stay tuned for multi-node deployment on Amazon EKS, serving multiple models with one cluster, LLM router deep dive, and many more! Fill this one question poll to let us know which one we should do next!
Join us to build a future where every application can harness the power of LLM inference—reliably, at scale, and without breaking a sweat. Happy deploying!
Contacts:
- Github: https://github.com/vllm-project/production-stack
- Chat with the Developers Interest Form
- vLLM slack
- LMCache slack


