Skip to content

Latest commit

 

History

History
100 lines (73 loc) · 4.07 KB

File metadata and controls

100 lines (73 loc) · 4.07 KB
title vLLM

LLM Deployment using vLLM

Dynamo vLLM integrates vLLM engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation while maintaining full compatibility with vLLM's native engine arguments. Dynamo leverages vLLM's native KV cache events, NIXL-based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.

Installation

Install Latest Release

We recommend using uv to install:

uv venv --python 3.12 --seed
uv pip install "ai-dynamo[vllm]"

This installs Dynamo with the compatible vLLM version.


Container

We have public images available on NGC Catalog:

docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:<version>
./container/run.sh -it --framework VLLM --image nvcr.io/nvidia/ai-dynamo/vllm-runtime:<version>
python container/render.py --framework vllm --output-short-filename
docker build -f container/rendered.Dockerfile -t dynamo:latest-vllm .
./container/run.sh -it --framework VLLM [--mount-workspace]

Development Setup

For development, use the devcontainer which has all dependencies pre-installed.

Feature Support Matrix

Feature Status Notes
Disaggregated Serving Prefill/decode separation with NIXL KV transfer
KV-Aware Routing
SLA-Based Planner
KVBM
LMCache CUDA 12.9 and arm64/aarch64 containers may require building LMCache from source
FlexKV
Multimodal Support Via vLLM-Omni integration
Observability Metrics and monitoring
WideEP Support for DeepEP
DP Rank Routing Hybrid load balancing via external DP rank control
LoRA Dynamic loading/unloading from S3-compatible storage
GB200 Support Container functional on main

Quick Start

Start infrastructure services for local development:

docker compose -f dev/docker-compose.yml up -d

Launch an aggregated serving deployment:

cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg.sh

Running launch scripts standalone. The launch/*.sh scripts expect etcd and NATS to be reachable on localhost. Bring them up first (run from the repo root, or use the absolute path shown):

docker compose -f "$DYNAMO_HOME/dev/docker-compose.yml" up -d

Then run the launch script. Without these, workers register but the frontend cannot discover them and requests hang.

Next Steps