| title | TensorRT-LLM |
|---|
We recommend using the latest stable release of Dynamo to avoid breaking changes.
Dynamo TensorRT-LLM integrates TensorRT-LLM engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, multi-node deployments, and request cancellation. It supports LLM inference, multimodal models, video diffusion, and advanced features like speculative decoding and attention data parallelism.
| Feature | TensorRT-LLM | Notes |
|---|---|---|
| Disaggregated Serving | ✅ | |
| Conditional Disaggregation | 🚧 | Not supported yet |
| KV-Aware Routing | ✅ | |
| SLA-Based Planner | ✅ | |
| Load Based Planner | 🚧 | Planned |
| KVBM | ✅ |
| Feature | TensorRT-LLM | Notes |
|---|---|---|
| WideEP | ✅ | |
| DP Rank Routing | ✅ | |
| GB200 Support | ✅ |
yqfor in-place YAML edits. Install withwget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O /usr/local/bin/yq && chmod +x /usr/local/bin/yqorpip install yq(the latter is a different tool with the same name but similar syntax). If neither is available, asedfallback is shown inline whereyqis used.
| Container tag | Backend version | CUDA | Min NVIDIA driver |
|---|---|---|---|
tensorrtllm-runtime:1.0.2 |
TRT-LLM v1.3.0rc5.post1 |
v13.1 |
580+ |
vllm-runtime:1.0.2 |
vLLM v0.16.0 |
v12.9 |
575+ |
vllm-runtime:1.0.2-cuda13 |
vLLM v0.16.0 |
v13.0 |
580+ |
sglang-runtime:1.0.2 |
SGLang v0.5.9 |
v12.9 |
575+ |
sglang-runtime:1.0.2-cuda13 |
SGLang v0.5.9 |
v13.0 |
580+ |
Source of truth: docs/reference/support-matrix.md and docs/reference/release-artifacts.md. If those differ from the values above, the source-of-truth files win.
Step 1 (host terminal): Start infrastructure services:
docker compose -f dev/docker-compose.yml up -dStep 2 (host terminal): Pull and run the prebuilt container:
DYNAMO_VERSION=1.0.2
docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION
docker run --gpus all -it --network host --ipc host \
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSIONNote
The DYNAMO_VERSION variable above can be set to any specific available version of the container.
To find the available tensorrtllm-runtime versions for Dynamo, visit the NVIDIA NGC Catalog for Dynamo TensorRT-LLM Runtime.
Step 3 (inside the container): Launch an aggregated serving deployment (uses Qwen/Qwen3-0.6B by default):
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg.shThe launch script will automatically download the model and start the TensorRT-LLM engine. You can override the model by setting MODEL_PATH and SERVED_MODEL_NAME environment variables before running the script.
Step 4 (host terminal): Verify the deployment:
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
"stream": true,
"max_tokens": 30
}'Deploy TensorRT-LLM with Dynamo on Kubernetes using a DynamoGraphDeployment. Before kubectl apply, substitute the container image tag in the deployment YAML. The sed fallback is shown inline for environments without yq:
# yq
yq -i '(.spec.services[].extraPodSpec.mainContainer.image) |= sub(":1\.0\.2", ":<your-tag>")' deploy.yaml
# sed fallback
sed -i.bak 's|:1\.0\.2|:<your-tag>|g' deploy.yamlFor full Kubernetes deployment instructions, see the TensorRT-LLM Kubernetes Deployment Guide.
- Reference Guide: Features, configuration, and operational details
- Examples: All deployment patterns with launch scripts
- KV Cache Transfer: KV cache transfer methods for disaggregated serving
- Observability: Metrics and monitoring
- Multinode Examples: Multi-node deployment with SLURM
- Deploying TensorRT-LLM with Dynamo on Kubernetes: Kubernetes deployment guide