Skip to content

Commit 417000b

Browse files
BihanAndrey Cheptsov
andauthored
[Docs]Add AMD Mi300x PD-Disaggregation Example (#3890)
* [Docs]Add AMD Mi300x PD-Disaggregation Example * Explain RDMA/RoCE Library loading in AMD examples * Update RCCL AMD PD disaggregation examples with short syntax * Revert rccl test update with additional comment * Add AMD PD disaggregation blog post --------- Co-authored-by: Bihan Rana Co-authored-by: Andrey Cheptsov <andrey.cheptsov@github.com>
1 parent ab4d4c0 commit 417000b

4 files changed

Lines changed: 445 additions & 6 deletions

File tree

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
---
2+
title: "Deploying inference endpoints with PD disaggregation on AMD GPUs"
3+
date: 2026-05-21
4+
description: "A walkthrough of deploying PD disaggregated inference on AMD GPUs with dstack and Shepherd Model Gateway (SMG), using SGLang workers and the Mooncake Transfer Engine."
5+
slug: amd-pd-disaggregation
6+
image: https://dstack.ai/static-assets/static-assets/images/amd-pd-disaggregation.png
7+
categories:
8+
- Changelog
9+
---
10+
11+
# Deploying inference endpoints with PD disaggregation on AMD GPUs
12+
13+
`dstack` is an open-source, AI-native orchestrator that works across clouds, Kubernetes clusters, on-prem fleets, hardware vendors, and frameworks. Alongside training, inference is one of the primary use cases `dstack` supports out of the box.
14+
15+
<img src="https://dstack.ai/static-assets/static-assets/images/amd-pd-disaggregation.png" width="630" />
16+
17+
`dstack` recently added native support for Prefill–Decode (PD) disaggregation. It works with [Shepherd Model Gateway](smg.md) (SMG) — a high-performance inference gateway evolved from the SGLang Router — on both NVIDIA and AMD, and with [NVIDIA Dynamo](https://docs.nvidia.com/dynamo/) on NVIDIA. This post walks through deploying it on AMD GPUs with SMG.
18+
19+
<!-- more -->
20+
21+
## Why PD disaggregation
22+
23+
PD disaggregation is useful when a single LLM deployment has two different bottlenecks:
24+
25+
- **Prefill** processes the prompt. It is compute-bound, parallelizable, and has a direct impact on Time to First Token (TTFT).
26+
- **Decode** generates tokens one by one. It is memory-bound, sequential, and has a direct impact on inter-token latency.
27+
28+
When the same worker handles both phases, every replica has to serve both bottlenecks. With PD disaggregation, prefill and decode run as separate pools, and each pool can be sized and scaled independently.
29+
30+
The tradeoff is operational: for every request, the KV cache produced by the prefill worker must be transferred to the decode worker before generation can continue. That transfer sits on the TTFT path, so the cluster needs a high-bandwidth, low-latency interconnect such as RDMA over InfiniBand or RoCE, rather than TCP over a conventional NIC.
31+
32+
In this walkthrough, [SMG](https://lightseek.org/smg/) routes requests between SGLang workers. On AMD, the workers use the [Mooncake Transfer Engine](https://github.com/kvcache-ai/Mooncake) to transfer KV cache over RDMA/RoCE. In the configuration we tested, the RDMA fabric is exposed by Broadcom `bnxt_re` Ethernet devices.
33+
34+
??? info "Prerequisites"
35+
Running PD disaggregation on `dstack` requires first creating a [fleet](https://dstack.ai/docs/concepts/fleets/) with `placement: cluster`, so that prefill and decode workers share a high-bandwidth interconnect. This can be a [backend fleet](https://dstack.ai/docs/concepts/fleets/#backend-fleets_1) provisioned by `dstack` on a cloud or Kubernetes cluster, or an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#ssh-fleets_1) registered against bare-metal or VM hosts you already manage.
36+
37+
## Validating the interconnect
38+
39+
To measure end-to-end bandwidth across nodes, run the [NCCL/RCCL tests example](../../docs/examples/clusters/nccl-rccl-tests.md).
40+
41+
For a quick check that the RDMA devices are visible on a particular host, run:
42+
43+
<div class="termy">
44+
45+
```shell
46+
$ ibv_devices
47+
```
48+
49+
</div>
50+
51+
All eight `bnxt_re*` interfaces should be listed. Use `ibv_devinfo` to inspect port state and link details. If devices are missing or in an unexpected state, install or update the NIC driver and userspace RDMA library before proceeding.
52+
53+
## Deploying the service
54+
55+
To deploy an inference endpoint with PD disaggregation using `dstack`, define a [service](../../docs/concepts/services.md) with three replica groups: an SMG router, a pool of prefill workers, and a pool of decode workers.
56+
57+
The example below deploys `Qwen/Qwen2.5-72B-Instruct` on a multi-node cluster with AMD MI300X GPUs:
58+
59+
<div editor-title="amd-pd.dstack.yml">
60+
61+
```yaml
62+
type: service
63+
name: amd-sglang-pd-service
64+
65+
image: rocm/sgl-dev:v0.5.10.post1-rocm720-mi30x-20260427
66+
privileged: true
67+
68+
env:
69+
- MODEL_ID=Qwen/Qwen2.5-72B-Instruct
70+
- HF_TOKEN
71+
- SGLANG_USE_AITER=0
72+
- SGLANG_ROCM_FUSED_DECODE_MLA=0
73+
- SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
74+
- SGLANG_DISAGGREGATION_WAITING_TIMEOUT=600
75+
- RDMA_DEVICES=bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7
76+
- NCCL_IB_DISABLE=1
77+
78+
replicas:
79+
- count: 1
80+
commands:
81+
- pip install smg
82+
- |
83+
smg launch \
84+
--pd-disaggregation \
85+
--host 0.0.0.0 \
86+
--port 30000
87+
resources:
88+
cpu: 4..
89+
router:
90+
type: sglang
91+
92+
- count: 1..2
93+
scaling:
94+
metric: rps
95+
target: 300
96+
commands:
97+
- |
98+
python3 -m sglang.launch_server \
99+
--model $MODEL_ID \
100+
--disaggregation-mode prefill \
101+
--disaggregation-transfer-backend mooncake \
102+
--host 0.0.0.0 \
103+
--port 30000 \
104+
--tp $DSTACK_GPUS_NUM \
105+
--trust-remote-code \
106+
--disaggregation-ib-device $RDMA_DEVICES \
107+
--disaggregation-bootstrap-port 8998 \
108+
--disable-radix-cache \
109+
--disable-cuda-graph \
110+
--disable-overlap-schedule \
111+
--mem-fraction-static 0.8 \
112+
--max-running-requests 1024
113+
resources:
114+
gpu: MI300X:8
115+
cpu: 96..
116+
memory: 512GB..
117+
118+
- count: 1..4
119+
scaling:
120+
metric: rps
121+
target: 300
122+
commands:
123+
- |
124+
python3 -m sglang.launch_server \
125+
--model $MODEL_ID \
126+
--disaggregation-mode decode \
127+
--disaggregation-transfer-backend mooncake \
128+
--host 0.0.0.0 \
129+
--port 30000 \
130+
--tp $DSTACK_GPUS_NUM \
131+
--trust-remote-code \
132+
--disaggregation-ib-device $RDMA_DEVICES \
133+
--disable-radix-cache \
134+
--disable-cuda-graph \
135+
--disable-overlap-schedule \
136+
--decode-attention-backend triton \
137+
--mem-fraction-static 0.8 \
138+
--max-running-requests 1024
139+
resources:
140+
gpu: MI300X:8
141+
cpu: 96..
142+
memory: 512GB..
143+
144+
port: 30000
145+
model: Qwen/Qwen2.5-72B-Instruct
146+
147+
# Custom probe is required for PD disaggregation.
148+
probes:
149+
- type: http
150+
url: /health
151+
interval: 15s
152+
153+
volumes:
154+
- /usr/lib64/libibverbs/libbnxt_re-rdmav34.so:/usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so
155+
```
156+
157+
</div>
158+
159+
`dstack` provisions each group, registers workers with the router, runs health probes, and autoscales prefill and decode pools independently against RPS.
160+
161+
Worker replicas run on GPU and bind to the Broadcom RDMA devices. While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.
162+
163+
!!! info "RoCE library"
164+
Mooncake uses the RDMA/RoCE interconnect for KV cache transfer. To use the RDMA/RoCE interconnect on Broadcom `bnxt_re` devices, Mooncake requires the Broadcom-specific userspace provider library `libbnxt_re-rdmav34.so` to be available inside the container at `/usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so`. We make this library available by mounting the host provider library from `/usr/lib64/libibverbs/libbnxt_re-rdmav34.so`.
165+
166+
Apply the configuration:
167+
168+
<div class="termy">
169+
170+
```shell
171+
$ HF_TOKEN=...
172+
$ dstack apply -f amd-pd.dstack.yml
173+
```
174+
175+
</div>
176+
177+
Once provisioning completes, `dstack` exposes the service through a single endpoint:
178+
179+
<div class="termy">
180+
181+
```shell
182+
$ curl http://localhost:3000/proxy/services/main/amd-sglang-pd-service/v1/chat/completions \
183+
-H 'Content-Type: application/json' \
184+
-H 'Authorization: Bearer <dstack token>' \
185+
-d '{
186+
"model": "Qwen/Qwen2.5-72B-Instruct",
187+
"messages": [
188+
{
189+
"role": "user",
190+
"content": "Compose a poem that explains the concept of recursion in programming."
191+
}
192+
]
193+
}'
194+
```
195+
196+
</div>
197+
198+
Requests are routed to SMG, which selects the prefill and decode workers for each request. The prefill worker processes the prompt, the decode worker continues generation, and Mooncake transfers the KV cache between them over RoCE. `dstack` registers and deregisters workers with SMG as replicas are added or removed, runs the `/health` probe on each replica, and scales each replica group independently.
199+
200+
!!! info "Limitations"
201+
- Currently, only one router replica per service is supported.
202+
- The example uses the SGLang inference backend for prefill and decode workers. vLLM backend support is coming soon.
203+
- Autoscaling supports the RPS metric. TTFT and ITL-based autoscaling support is coming soon.
204+
205+
## Why this matters
206+
207+
`dstack` provides a single, simple interface for orchestrating training and inference across hardware vendors, serving frameworks, routers, and infrastructure. It removes the need to assemble multiple fragmented tools on top of Kubernetes or build your own orchestration layer in-house.
208+
209+
!!! info "Benchmarks"
210+
Benchmarks for PD disaggregation on AMD are in progress and will be published in a follow-up. If you are running AMD GPUs and would like to contribute workloads or collaborate on benchmarking, please get in touch.
211+
212+
Bug reports, feedback, and feature requests are welcome on the [issue tracker](https://github.com/dstackai/dstack/issues) and on [Discord](https://discord.gg/u8SmfwPpMd).
213+
214+
> *Thanks to Matthew Bettinger at AMD for the collaboration, testing time, and feedback that shaped this integration.*
215+
216+
## What's next?
217+
218+
1. Read about [services](https://dstack.ai/docs/concepts/services/) and [fleets](https://dstack.ai/docs/concepts/fleets/)
219+
2. Check the [NCCL/RCCL tests](https://dstack.ai/docs/examples/clusters/nccl-rccl-tests/) example
220+
3. Review the [Shepherd Model Gateway](https://lightseek.org/smg/getting-started/) and [SGLang PD disaggregation](https://docs.sglang.ai/advanced_features/pd_disaggregation.html) documentation
221+
4. Join [Discord](https://discord.gg/u8SmfwPpMd)

mkdocs/docs/concepts/services.md

Lines changed: 113 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -346,7 +346,9 @@ Since 0.20.17, `dstack` supports serving a model using Prefill-Decode disaggrega
346346

347347
`dstack` integrates with two routers for PD disaggregation: [Shepherd Model Gateway (SMG)](https://docs.sglang.io/advanced_features/sgl_model_gateway.html) and [NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo).
348348

349-
Below is an example for running `zai-org/GLM-4.5-Air-FP8`:
349+
#### NVIDIA
350+
351+
Below is an example for running `zai-org/GLM-4.5-Air-FP8` on `H200`:
350352

351353
=== "SMG"
352354

@@ -521,7 +523,116 @@ Below is an example for running `zai-org/GLM-4.5-Air-FP8`:
521523

522524
</div>
523525

524-
> With the the `dynamo` router, you can use SGLang, vLLM, and TensorRT-LLM prefill and decode workers.
526+
> With the `dynamo` router, you can use SGLang, vLLM, and TensorRT-LLM prefill and decode workers.
527+
528+
#### AMD
529+
530+
The example below deploys `Qwen/Qwen2.5-72B-Instruct` on a multi-node cluster with AMD MI300X GPUs:
531+
532+
<div editor-title="amd-pd.dstack.yml">
533+
534+
```yaml
535+
type: service
536+
name: amd-sglang-pd-service
537+
538+
image: rocm/sgl-dev:v0.5.10.post1-rocm720-mi30x-20260427
539+
privileged: true
540+
541+
env:
542+
- MODEL_ID=Qwen/Qwen2.5-72B-Instruct
543+
- HF_TOKEN
544+
- SGLANG_USE_AITER=0
545+
- SGLANG_ROCM_FUSED_DECODE_MLA=0
546+
- SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
547+
- SGLANG_DISAGGREGATION_WAITING_TIMEOUT=600
548+
- RDMA_DEVICES=bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7
549+
- NCCL_IB_DISABLE=1
550+
551+
replicas:
552+
- count: 1
553+
commands:
554+
- pip install smg
555+
- |
556+
smg launch \
557+
--pd-disaggregation \
558+
--host 0.0.0.0 \
559+
--port 30000
560+
resources:
561+
cpu: 4..
562+
router:
563+
type: sglang
564+
565+
- count: 1..2
566+
scaling:
567+
metric: rps
568+
target: 300
569+
commands:
570+
- |
571+
python3 -m sglang.launch_server \
572+
--model $MODEL_ID \
573+
--disaggregation-mode prefill \
574+
--disaggregation-transfer-backend mooncake \
575+
--host 0.0.0.0 \
576+
--port 30000 \
577+
--tp $DSTACK_GPUS_NUM \
578+
--trust-remote-code \
579+
--disaggregation-ib-device $RDMA_DEVICES \
580+
--disaggregation-bootstrap-port 8998 \
581+
--disable-radix-cache \
582+
--disable-cuda-graph \
583+
--disable-overlap-schedule \
584+
--mem-fraction-static 0.8 \
585+
--max-running-requests 1024
586+
resources:
587+
gpu: MI300X:8
588+
cpu: 96..
589+
memory: 512GB..
590+
591+
- count: 1..4
592+
scaling:
593+
metric: rps
594+
target: 300
595+
commands:
596+
- |
597+
python3 -m sglang.launch_server \
598+
--model $MODEL_ID \
599+
--disaggregation-mode decode \
600+
--disaggregation-transfer-backend mooncake \
601+
--host 0.0.0.0 \
602+
--port 30000 \
603+
--tp $DSTACK_GPUS_NUM \
604+
--trust-remote-code \
605+
--disaggregation-ib-device $RDMA_DEVICES \
606+
--disable-radix-cache \
607+
--disable-cuda-graph \
608+
--disable-overlap-schedule \
609+
--decode-attention-backend triton \
610+
--mem-fraction-static 0.8 \
611+
--max-running-requests 1024
612+
resources:
613+
gpu: MI300X:8
614+
cpu: 96..
615+
memory: 512GB..
616+
617+
port: 30000
618+
model: Qwen/Qwen2.5-72B-Instruct
619+
620+
# Custom probe is required for PD disaggregation.
621+
probes:
622+
- type: http
623+
url: /health
624+
interval: 15s
625+
626+
volumes:
627+
- /usr/lib64/libibverbs/libbnxt_re-rdmav34.so:/usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so
628+
```
629+
630+
</div>
631+
632+
!!! info "RoCE library"
633+
Mooncake uses the RDMA/RoCE interconnect for KV Cache transfer. To use the RDMA/RoCE interconnect on Broadcom `bnxt_re` devices, Mooncake requires the Broadcom-specific userspace provider library `libbnxt_re-rdmav34.so` to be available inside the container at `/usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so`. We make this library available by mounting the host provider library from `/usr/lib64/libibverbs/libbnxt_re-rdmav34.so`.
634+
635+
525636

526637
!!! info "Cluster"
527638
PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances.

mkdocs/docs/examples/clusters/nccl-rccl-tests.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -111,9 +111,9 @@ Here's an example of a task that runs AllReduce test on 2 nodes, each with 4 GPU
111111
</div>
112112

113113
!!! info "RoCE library"
114-
Broadcom RoCE drivers require the `libbnxt_re` userspace library inside the container to be compatible with the host’s Broadcom
115-
kernel driver `bnxt_re`. To ensure this compatibility, we mount `libbnxt_re-rdmav34.so` from the host and preload it
116-
using `LD_PRELOAD` when running MPI.
114+
RCCL tests use the RDMA/RoCE interconnect for internode communication. To use the RDMA/RoCE interconnect on Broadcom `bnxt_re` devices, RCCL requires the Broadcom-specific userspace provider library `libbnxt_re-rdmav34.so` to be available inside the container at `/usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so`. We make this library available by mounting it from the host and using `LD_PRELOAD` when running MPI.
115+
116+
Alternatively, you can avoid `LD_PRELOAD` and directly mount `/usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so` if you use a custom image with OpenMPI pre-installed.
117117

118118
!!! info "Privileged"
119119
In some cases, the backend (e.g., `kubernetes`) may require `privileged: true` to access the high-speed interconnect (e.g., InfiniBand).

0 commit comments

Comments
 (0)