| title | Examples |
|---|
For quick start instructions, see the vLLM README. This document provides all deployment patterns for running vLLM with Dynamo, including aggregated, disaggregated, KV-routed, and expert-parallel configurations.
For local/bare-metal development, start etcd and optionally NATS using Docker Compose:
docker compose -f deploy/docker-compose.yml up -dThe simplest deployment pattern: a single worker handles both prefill and decode. Requires 1 GPU.
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg.shTwo workers behind a KV-aware router that maximizes cache reuse. Requires 2 GPUs.
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg_router.shThis launches the frontend in KV routing mode with two workers publishing KV events over ZMQ.
Separates prefill and decode into independent workers connected via NIXL for KV cache transfer. Requires 2 GPUs.
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg.shScales to 2 prefill + 2 decode workers with KV-aware routing on both pools. Requires 4 GPUs.
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_router.shThe frontend runs in KV routing mode and automatically detects prefill workers to activate an internal prefill router.
Launches 4 data-parallel workers with expert parallelism behind a KV-aware router. Uses a Mixture-of-Experts model (Qwen/Qwen3-30B-A3B). Requires 4 GPUs.
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/dep.shRun Meta-Llama-3.1-8B-Instruct with Eagle3 as a draft model for faster inference while maintaining accuracy.
Guide: Speculative Decoding Quickstart
See also: Speculative Decoding Feature Overview for cross-backend documentation.
Serve multimodal models using the vLLM-Omni integration.
Guide: vLLM-Omni
Deploy vLLM across multiple nodes using Dynamo's distributed capabilities. Multi-node deployments require network connectivity between nodes and firewall rules allowing NATS/ETCD communication.
Start NATS/ETCD on the head node so all worker nodes can reach them:
# On head node
docker compose -f deploy/docker-compose.yml up -d
# Set on ALL nodes
export HEAD_NODE_IP="<your-head-node-ip>"
export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"For multi-node tensor/pipeline parallelism (when TP x PP exceeds GPUs on a single node), see launch/multi_node_tp.sh. For details on distributed execution, see the vLLM multiprocessing docs.
Dynamo supports DeepSeek R1 with data parallel attention and wide expert parallelism. Each DP attention rank is a separate Dynamo component emitting its own KV events and metrics.
Run on 2 nodes (16 GPUs, dp=16):
# Node 0
cd $DYNAMO_HOME/examples/backends/vllm
./launch/dsr1_dep.sh --num-nodes 2 --node-rank 0 --gpus-per-node 8 --master-addr <node-0-addr>
# Node 1
./launch/dsr1_dep.sh --num-nodes 2 --node-rank 1 --gpus-per-node 8 --master-addr <node-0-addr>See launch/dsr1_dep.sh for configurable options.
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the vLLM Kubernetes Deployment Guide.
See also the Kubernetes Deployment Guide for general Dynamo K8s documentation.
Ensure NIXL is installed and the side-channel ports are not in conflict. Each worker in a multi-worker setup needs a unique VLLM_NIXL_SIDE_CHANNEL_PORT.
Ensure PYTHONHASHSEED=0 is set for all vLLM processes when using KV-aware routing. See Hashing Consistency for details.
If a previous run left orphaned GPU processes, the next launch may OOM. Check for zombie processes:
nvidia-smi # look for lingering python processes
kill -9 <PID>- vLLM README: Quick start and feature overview
- Reference Guide: Configuration, arguments, and operational details
- Observability: Metrics and monitoring
- Benchmarking: Performance benchmarking tools
- Tuning Disaggregated Performance: P/D tuning guide