Skip to content

Commit c5fa9f0

Browse files
peterschmidt85Andrey Cheptsov
andauthored
Add NVIDIA Dynamo blog post (dstackai#3949)
* Add NVIDIA Dynamo blog post * Refine NVIDIA Dynamo blog value prop * Update Dynamo compose path --------- Co-authored-by: Andrey Cheptsov <andrey.cheptsov@github.com>
1 parent 5991b1e commit c5fa9f0

3 files changed

Lines changed: 213 additions & 2 deletions

File tree

mkdocs/blog/posts/nvidia-dynamo.md

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
---
2+
title: "Deploying NVIDIA Dynamo PD disaggregation with dstack"
3+
date: 2026-06-10
4+
description: "Deploy NVIDIA Dynamo with Prefill-Decode disaggregation using dstack services."
5+
slug: nvidia-dynamo
6+
image: https://dstack.ai/static-assets/static-assets/images/nvidia-dynamo.png
7+
categories:
8+
- Changelog
9+
---
10+
11+
# Deploying NVIDIA Dynamo PD disaggregation with dstack
12+
13+
`dstack` is an open-source, AI-native orchestrator that works across clouds, Kubernetes clusters, on-prem fleets, hardware vendors, and frameworks. Alongside training, inference is one of the primary use cases `dstack` supports out of the box.
14+
15+
With the latest update, `dstack` added native support for NVIDIA Dynamo with Prefill-Decode (PD) disaggregation, letting a service run a Dynamo router, prefill workers, and decode workers as separate replica groups.
16+
17+
<img src="https://dstack.ai/static-assets/static-assets/images/nvidia-dynamo.png" width="630" />
18+
19+
<!-- more -->
20+
21+
## About NVIDIA Dynamo
22+
23+
[NVIDIA Dynamo](https://docs.nvidia.com/dynamo/getting-started/introduction) is an open-source, high-throughput, low-latency inference framework for serving generative AI workloads in distributed environments. It adds a system-level layer above inference engines such as SGLang, vLLM, and TensorRT-LLM, coordinating them across GPUs and nodes.
24+
25+
Dynamo brings together disaggregated serving, intelligent routing, KV cache management, KV cache transfer, and automatic scaling to maximize throughput and minimize latency for LLM, reasoning, multimodal, and video generation workloads.
26+
27+
!!! info "PD disaggregation"
28+
Prefill-Decode disaggregation separates the two phases of LLM inference: prompt processing (prefill) and token generation (decode). Prefill is compute-bound and parallelizable. Decode is memory-bound and sequential. Running them as separate pools allows each phase to be sized and scaled independently.
29+
30+
## PD disaggregation with dstack
31+
32+
To deploy NVIDIA Dynamo with PD disaggregation, define a [service](../../docs/concepts/services.md) with three [replica groups](../../docs/concepts/services.md#replicas-and-scaling):
33+
34+
- a Dynamo router
35+
- prefill workers
36+
- decode workers
37+
38+
The router replica group declares `router: { type: dynamo }`. This tells `dstack` to route external traffic only to the router replica and to inject `DSTACK_ROUTER_INTERNAL_IP` into the worker replicas after the router is provisioned.
39+
40+
This support was introduced in [`0.20.20`](https://github.com/dstackai/dstack/releases/tag/0.20.20).
41+
42+
??? info "Prerequisites"
43+
Running PD disaggregation on `dstack` requires a [fleet](../../docs/concepts/fleets.md) with [cluster placement](../../docs/concepts/fleets.md#cluster-placement), because prefill and decode workers need a fast interconnect for KV cache transfer.
44+
45+
The prefill and decode replicas run on GPUs. The router replica can run on CPU, but it must run in the same cluster.
46+
47+
## Deploying the service
48+
49+
Here's a complete service configuration that deploys `zai-org/GLM-4.5-Air-FP8` with NVIDIA Dynamo, SGLang workers, and PD disaggregation on `dstack`:
50+
51+
<div editor-title="dynamo-pd.dstack.yml">
52+
53+
```yaml
54+
type: service
55+
name: dynamo-pd
56+
57+
env:
58+
- HF_TOKEN
59+
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
60+
61+
replicas:
62+
- count: 1
63+
docker: true
64+
commands:
65+
- apt-get update
66+
- apt-get install -y python3-dev python3-venv
67+
- python3 -m venv ~/dyn-venv
68+
- source ~/dyn-venv/bin/activate
69+
- pip install -U pip
70+
- pip install "ai-dynamo[sglang]==1.1.1"
71+
- git clone https://github.com/ai-dynamo/dynamo.git
72+
# Brings up the NATS / etcd compose stack and runs the Dynamo HTTP frontend.
73+
- docker compose -f dynamo/dev/docker-compose.yml up -d
74+
- |
75+
python3 -m dynamo.frontend \
76+
--http-host 0.0.0.0 --http-port 8000 \
77+
--discovery-backend etcd --router-mode kv \
78+
--kv-cache-block-size 64
79+
resources:
80+
cpu: 4
81+
router:
82+
type: dynamo
83+
84+
- count: 1..4
85+
scaling:
86+
metric: rps
87+
target: 3
88+
python: "3.12"
89+
nvcc: true
90+
commands:
91+
# dstack injects DSTACK_ROUTER_INTERNAL_IP after the router replica
92+
# is provisioned. Compose the etcd/NATS endpoints from it.
93+
- export ETCD_ENDPOINTS="http://$DSTACK_ROUTER_INTERNAL_IP:2379"
94+
- export NATS_SERVER="nats://$DSTACK_ROUTER_INTERNAL_IP:4222"
95+
# Set to enable /health endpoint required by dstack probes.
96+
- export DYN_SYSTEM_PORT="8000"
97+
# Wait until the router's etcd and NATS ports are actually accepting connections.
98+
- |
99+
until (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/2379) 2>/dev/null \
100+
&& (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/4222) 2>/dev/null; do
101+
echo "waiting for etcd/NATS on $DSTACK_ROUTER_INTERNAL_IP..."; sleep 3
102+
done
103+
- pip install "ai-dynamo[sglang]==1.1.1"
104+
- |
105+
python3 -m dynamo.sglang \
106+
--model-path $MODEL_ID --served-model-name $MODEL_ID \
107+
--discovery-backend etcd --host 0.0.0.0 \
108+
--page-size 64 \
109+
--disaggregation-mode prefill --disaggregation-transfer-backend nixl
110+
resources:
111+
gpu: H200
112+
113+
- count: 1..8
114+
scaling:
115+
metric: rps
116+
target: 2
117+
python: "3.12"
118+
nvcc: true
119+
commands:
120+
- export ETCD_ENDPOINTS="http://$DSTACK_ROUTER_INTERNAL_IP:2379"
121+
- export NATS_SERVER="nats://$DSTACK_ROUTER_INTERNAL_IP:4222"
122+
- export DYN_SYSTEM_PORT="8000"
123+
- |
124+
until (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/2379) 2>/dev/null \
125+
&& (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/4222) 2>/dev/null; do
126+
echo "waiting for etcd/NATS on $DSTACK_ROUTER_INTERNAL_IP..."; sleep 3
127+
done
128+
- pip install "ai-dynamo[sglang]==1.1.1"
129+
- |
130+
python3 -m dynamo.sglang \
131+
--model-path $MODEL_ID --served-model-name $MODEL_ID \
132+
--discovery-backend etcd --host 0.0.0.0 \
133+
--page-size 64 \
134+
--disaggregation-mode decode --disaggregation-transfer-backend nixl
135+
resources:
136+
gpu: H200
137+
138+
port: 8000
139+
model: zai-org/GLM-4.5-Air-FP8
140+
141+
# Custom probe is required for PD disaggregation.
142+
probes:
143+
- type: http
144+
url: /health
145+
interval: 15s
146+
```
147+
148+
</div>
149+
150+
The router replica group starts the Dynamo HTTP frontend and the NATS/etcd compose stack used by the workers. It declares `router: { type: dynamo }`, so `dstack` treats it as the service router.
151+
152+
The prefill and decode replica groups use the router's internal IP to set `ETCD_ENDPOINTS` and `NATS_SERVER`, wait for those services to become reachable, then start `dynamo.sglang` in either `prefill` or `decode` mode. `DYN_SYSTEM_PORT=8000` exposes the `/health` endpoint required by the `dstack` [probe](../../docs/concepts/services.md#probes).
153+
154+
In this setup, Dynamo uses etcd for worker discovery and NATS for worker and KV-cache events used by the router. NIXL handles KV cache transfer between prefill and decode workers. `dstack` handles provisioning, service exposure, health probes, and independent scaling of the prefill and decode replica groups.
155+
156+
> With the `dynamo` router, `dstack` can run SGLang, vLLM, or TensorRT-LLM prefill and decode workers.
157+
158+
Apply the configuration:
159+
160+
<div class="termy">
161+
162+
```shell
163+
$ HF_TOKEN=...
164+
$ dstack apply -f dynamo-pd.dstack.yml
165+
```
166+
167+
</div>
168+
169+
Once provisioning completes, `dstack` exposes a single OpenAI-compatible endpoint. Without a gateway, the endpoint is available through the server proxy:
170+
171+
<div class="termy">
172+
173+
```shell
174+
$ curl http://127.0.0.1:3000/proxy/services/main/dynamo-pd/v1/chat/completions \
175+
-X POST \
176+
-H 'Authorization: Bearer <dstack token>' \
177+
-H 'Content-Type: application/json' \
178+
-d '{
179+
"model": "zai-org/GLM-4.5-Air-FP8",
180+
"messages": [
181+
{
182+
"role": "user",
183+
"content": "What is prefill-decode disaggregation?"
184+
}
185+
],
186+
"max_tokens": 1024
187+
}'
188+
```
189+
190+
</div>
191+
192+
If a [gateway](../../docs/concepts/gateways.md) is configured, the service endpoint is available at `https://dynamo-pd.<gateway domain>/`.
193+
194+
!!! info "Limitations"
195+
- The router replica group must use `count: 1`.
196+
- Services with a Dynamo router cannot configure `retry`, because workers cache the router's internal IP at provisioning time.
197+
- In-place updates are blocked when they would replace the Dynamo router replica. If the router gets a new internal IP, already-running workers would still point to the old etcd and NATS endpoints. Stop the run and apply again for router-affecting changes.
198+
- The `scaling` blocks use [`dstack` service autoscaling](../../docs/reference/dstack.yml/service.md#scaling), which currently scales replica groups based on `rps`. Support for scaling based on inference metrics such as TTFT and ITL is planned.
199+
200+
## Why this matters
201+
202+
Dynamo brings system-level inference optimizations such as disaggregated serving, KV-aware routing, KV cache transfer, and coordination across workers. `dstack` complements it with orchestration for provisioning compute, cluster placement, service exposure, health probes, and independent scaling of worker groups.
203+
204+
With native Dynamo support, `dstack` streamlines high-throughput inference with leading open-source serving frameworks, while avoiding custom deployment glue. The same `dstack` orchestration layer can be used for training, inference, and development across GPU clouds, Kubernetes clusters, and on-prem fleets.
205+
206+
## What's next?
207+
208+
1. Read the [NVIDIA Dynamo example](../../docs/examples/inference/dynamo.md)
209+
2. Read about [services](../../docs/concepts/services.md), [fleets](../../docs/concepts/fleets.md), and [gateways](../../docs/concepts/gateways.md)
210+
3. Review the [NVIDIA Dynamo documentation](https://docs.nvidia.com/dynamo/getting-started/introduction) and [Dynamo GitHub repository](https://github.com/ai-dynamo/dynamo)
211+
4. Join [Discord](https://discord.gg/u8SmfwPpMd)

mkdocs/docs/concepts/services.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -446,7 +446,7 @@ Below is an example for running `zai-org/GLM-4.5-Air-FP8` on `H200`:
446446
- pip install "ai-dynamo[sglang]==1.1.1"
447447
- git clone https://github.com/ai-dynamo/dynamo.git
448448
# Brings up the NATS / etcd compose stack and runs the Dynamo HTTP frontend.
449-
- docker compose -f dynamo/deploy/docker-compose.yml up -d
449+
- docker compose -f dynamo/dev/docker-compose.yml up -d
450450
- |
451451
python3 -m dynamo.frontend \
452452
--http-host 0.0.0.0 --http-port 8000 \

mkdocs/docs/examples/inference/dynamo.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ replicas:
3636
- pip install "ai-dynamo[sglang]==1.1.1"
3737
- git clone https://github.com/ai-dynamo/dynamo.git
3838
# Brings up the NATS / etcd compose stack and runs the Dynamo HTTP frontend.
39-
- docker compose -f dynamo/deploy/docker-compose.yml up -d
39+
- docker compose -f dynamo/dev/docker-compose.yml up -d
4040
- |
4141
python3 -m dynamo.frontend \
4242
--http-host 0.0.0.0 --http-port 8000 \

0 commit comments

Comments
 (0)