|
| 1 | +--- |
| 2 | +title: "Deploying inference endpoints with PD disaggregation on AMD GPUs" |
| 3 | +date: 2026-05-21 |
| 4 | +description: "A walkthrough of deploying PD disaggregated inference on AMD GPUs with dstack and Shepherd Model Gateway (SMG), using SGLang workers and the Mooncake Transfer Engine." |
| 5 | +slug: amd-pd-disaggregation |
| 6 | +image: https://dstack.ai/static-assets/static-assets/images/amd-pd-disaggregation.png |
| 7 | +categories: |
| 8 | + - Changelog |
| 9 | +--- |
| 10 | + |
| 11 | +# Deploying inference endpoints with PD disaggregation on AMD GPUs |
| 12 | + |
| 13 | +`dstack` is an open-source, AI-native orchestrator that works across clouds, Kubernetes clusters, on-prem fleets, hardware vendors, and frameworks. Alongside training, inference is one of the primary use cases `dstack` supports out of the box. |
| 14 | + |
| 15 | +<img src="https://dstack.ai/static-assets/static-assets/images/amd-pd-disaggregation.png" width="630" /> |
| 16 | + |
| 17 | +`dstack` recently added native support for Prefill–Decode (PD) disaggregation. It works with [Shepherd Model Gateway](smg.md) (SMG) — a high-performance inference gateway evolved from the SGLang Router — on both NVIDIA and AMD, and with [NVIDIA Dynamo](https://docs.nvidia.com/dynamo/) on NVIDIA. This post walks through deploying it on AMD GPUs with SMG. |
| 18 | + |
| 19 | +<!-- more --> |
| 20 | + |
| 21 | +## Why PD disaggregation |
| 22 | + |
| 23 | +PD disaggregation is useful when a single LLM deployment has two different bottlenecks: |
| 24 | + |
| 25 | +- **Prefill** processes the prompt. It is compute-bound, parallelizable, and has a direct impact on Time to First Token (TTFT). |
| 26 | +- **Decode** generates tokens one by one. It is memory-bound, sequential, and has a direct impact on inter-token latency. |
| 27 | + |
| 28 | +When the same worker handles both phases, every replica has to serve both bottlenecks. With PD disaggregation, prefill and decode run as separate pools, and each pool can be sized and scaled independently. |
| 29 | + |
| 30 | +The tradeoff is operational: for every request, the KV cache produced by the prefill worker must be transferred to the decode worker before generation can continue. That transfer sits on the TTFT path, so the cluster needs a high-bandwidth, low-latency interconnect such as RDMA over InfiniBand or RoCE, rather than TCP over a conventional NIC. |
| 31 | + |
| 32 | +In this walkthrough, [SMG](https://lightseek.org/smg/) routes requests between SGLang workers. On AMD, the workers use the [Mooncake Transfer Engine](https://github.com/kvcache-ai/Mooncake) to transfer KV cache over RDMA/RoCE. In the configuration we tested, the RDMA fabric is exposed by Broadcom `bnxt_re` Ethernet devices. |
| 33 | + |
| 34 | +??? info "Prerequisites" |
| 35 | + Running PD disaggregation on `dstack` requires first creating a [fleet](https://dstack.ai/docs/concepts/fleets/) with `placement: cluster`, so that prefill and decode workers share a high-bandwidth interconnect. This can be a [backend fleet](https://dstack.ai/docs/concepts/fleets/#backend-fleets_1) provisioned by `dstack` on a cloud or Kubernetes cluster, or an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#ssh-fleets_1) registered against bare-metal or VM hosts you already manage. |
| 36 | + |
| 37 | +## Validating the interconnect |
| 38 | + |
| 39 | +To measure end-to-end bandwidth across nodes, run the [NCCL/RCCL tests example](../../docs/examples/clusters/nccl-rccl-tests.md). |
| 40 | + |
| 41 | +For a quick check that the RDMA devices are visible on a particular host, run: |
| 42 | + |
| 43 | +<div class="termy"> |
| 44 | + |
| 45 | +```shell |
| 46 | +$ ibv_devices |
| 47 | +``` |
| 48 | + |
| 49 | +</div> |
| 50 | + |
| 51 | +All eight `bnxt_re*` interfaces should be listed. Use `ibv_devinfo` to inspect port state and link details. If devices are missing or in an unexpected state, install or update the NIC driver and userspace RDMA library before proceeding. |
| 52 | + |
| 53 | +## Deploying the service |
| 54 | + |
| 55 | +To deploy an inference endpoint with PD disaggregation using `dstack`, define a [service](../../docs/concepts/services.md) with three replica groups: an SMG router, a pool of prefill workers, and a pool of decode workers. |
| 56 | + |
| 57 | +The example below deploys `Qwen/Qwen2.5-72B-Instruct` on a multi-node cluster with AMD MI300X GPUs: |
| 58 | + |
| 59 | +<div editor-title="amd-pd.dstack.yml"> |
| 60 | + |
| 61 | +```yaml |
| 62 | +type: service |
| 63 | +name: amd-sglang-pd-service |
| 64 | + |
| 65 | +image: rocm/sgl-dev:v0.5.10.post1-rocm720-mi30x-20260427 |
| 66 | +privileged: true |
| 67 | + |
| 68 | +env: |
| 69 | + - MODEL_ID=Qwen/Qwen2.5-72B-Instruct |
| 70 | + - HF_TOKEN |
| 71 | + - SGLANG_USE_AITER=0 |
| 72 | + - SGLANG_ROCM_FUSED_DECODE_MLA=0 |
| 73 | + - SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 |
| 74 | + - SGLANG_DISAGGREGATION_WAITING_TIMEOUT=600 |
| 75 | + - RDMA_DEVICES=bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 |
| 76 | + - NCCL_IB_DISABLE=1 |
| 77 | + |
| 78 | +replicas: |
| 79 | + - count: 1 |
| 80 | + commands: |
| 81 | + - pip install smg |
| 82 | + - | |
| 83 | + smg launch \ |
| 84 | + --pd-disaggregation \ |
| 85 | + --host 0.0.0.0 \ |
| 86 | + --port 30000 |
| 87 | + resources: |
| 88 | + cpu: 4.. |
| 89 | + router: |
| 90 | + type: sglang |
| 91 | + |
| 92 | + - count: 1..2 |
| 93 | + scaling: |
| 94 | + metric: rps |
| 95 | + target: 300 |
| 96 | + commands: |
| 97 | + - | |
| 98 | + python3 -m sglang.launch_server \ |
| 99 | + --model $MODEL_ID \ |
| 100 | + --disaggregation-mode prefill \ |
| 101 | + --disaggregation-transfer-backend mooncake \ |
| 102 | + --host 0.0.0.0 \ |
| 103 | + --port 30000 \ |
| 104 | + --tp $DSTACK_GPUS_NUM \ |
| 105 | + --trust-remote-code \ |
| 106 | + --disaggregation-ib-device $RDMA_DEVICES \ |
| 107 | + --disaggregation-bootstrap-port 8998 \ |
| 108 | + --disable-radix-cache \ |
| 109 | + --disable-cuda-graph \ |
| 110 | + --disable-overlap-schedule \ |
| 111 | + --mem-fraction-static 0.8 \ |
| 112 | + --max-running-requests 1024 |
| 113 | + resources: |
| 114 | + gpu: MI300X:8 |
| 115 | + cpu: 96.. |
| 116 | + memory: 512GB.. |
| 117 | + |
| 118 | + - count: 1..4 |
| 119 | + scaling: |
| 120 | + metric: rps |
| 121 | + target: 300 |
| 122 | + commands: |
| 123 | + - | |
| 124 | + python3 -m sglang.launch_server \ |
| 125 | + --model $MODEL_ID \ |
| 126 | + --disaggregation-mode decode \ |
| 127 | + --disaggregation-transfer-backend mooncake \ |
| 128 | + --host 0.0.0.0 \ |
| 129 | + --port 30000 \ |
| 130 | + --tp $DSTACK_GPUS_NUM \ |
| 131 | + --trust-remote-code \ |
| 132 | + --disaggregation-ib-device $RDMA_DEVICES \ |
| 133 | + --disable-radix-cache \ |
| 134 | + --disable-cuda-graph \ |
| 135 | + --disable-overlap-schedule \ |
| 136 | + --decode-attention-backend triton \ |
| 137 | + --mem-fraction-static 0.8 \ |
| 138 | + --max-running-requests 1024 |
| 139 | + resources: |
| 140 | + gpu: MI300X:8 |
| 141 | + cpu: 96.. |
| 142 | + memory: 512GB.. |
| 143 | + |
| 144 | +port: 30000 |
| 145 | +model: Qwen/Qwen2.5-72B-Instruct |
| 146 | + |
| 147 | +# Custom probe is required for PD disaggregation. |
| 148 | +probes: |
| 149 | + - type: http |
| 150 | + url: /health |
| 151 | + interval: 15s |
| 152 | + |
| 153 | +volumes: |
| 154 | + - /usr/lib64/libibverbs/libbnxt_re-rdmav34.so:/usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so |
| 155 | +``` |
| 156 | +
|
| 157 | +</div> |
| 158 | +
|
| 159 | +`dstack` provisions each group, registers workers with the router, runs health probes, and autoscales prefill and decode pools independently against RPS. |
| 160 | + |
| 161 | +Worker replicas run on GPU and bind to the Broadcom RDMA devices. While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster. |
| 162 | + |
| 163 | +!!! info "RoCE library" |
| 164 | + Mooncake uses the RDMA/RoCE interconnect for KV cache transfer. To use the RDMA/RoCE interconnect on Broadcom `bnxt_re` devices, Mooncake requires the Broadcom-specific userspace provider library `libbnxt_re-rdmav34.so` to be available inside the container at `/usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so`. We make this library available by mounting the host provider library from `/usr/lib64/libibverbs/libbnxt_re-rdmav34.so`. |
| 165 | + |
| 166 | +Apply the configuration: |
| 167 | + |
| 168 | +<div class="termy"> |
| 169 | + |
| 170 | +```shell |
| 171 | +$ HF_TOKEN=... |
| 172 | +$ dstack apply -f amd-pd.dstack.yml |
| 173 | +``` |
| 174 | + |
| 175 | +</div> |
| 176 | + |
| 177 | +Once provisioning completes, `dstack` exposes the service through a single endpoint: |
| 178 | + |
| 179 | +<div class="termy"> |
| 180 | + |
| 181 | +```shell |
| 182 | +$ curl http://localhost:3000/proxy/services/main/amd-sglang-pd-service/v1/chat/completions \ |
| 183 | + -H 'Content-Type: application/json' \ |
| 184 | + -H 'Authorization: Bearer <dstack token>' \ |
| 185 | + -d '{ |
| 186 | + "model": "Qwen/Qwen2.5-72B-Instruct", |
| 187 | + "messages": [ |
| 188 | + { |
| 189 | + "role": "user", |
| 190 | + "content": "Compose a poem that explains the concept of recursion in programming." |
| 191 | + } |
| 192 | + ] |
| 193 | + }' |
| 194 | +``` |
| 195 | + |
| 196 | +</div> |
| 197 | + |
| 198 | +Requests are routed to SMG, which selects the prefill and decode workers for each request. The prefill worker processes the prompt, the decode worker continues generation, and Mooncake transfers the KV cache between them over RoCE. `dstack` registers and deregisters workers with SMG as replicas are added or removed, runs the `/health` probe on each replica, and scales each replica group independently. |
| 199 | + |
| 200 | +!!! info "Limitations" |
| 201 | + - Currently, only one router replica per service is supported. |
| 202 | + - The example uses the SGLang inference backend for prefill and decode workers. vLLM backend support is coming soon. |
| 203 | + - Autoscaling supports the RPS metric. TTFT and ITL-based autoscaling support is coming soon. |
| 204 | + |
| 205 | +## Why this matters |
| 206 | + |
| 207 | +`dstack` provides a single, simple interface for orchestrating training and inference across hardware vendors, serving frameworks, routers, and infrastructure. It removes the need to assemble multiple fragmented tools on top of Kubernetes or build your own orchestration layer in-house. |
| 208 | + |
| 209 | +!!! info "Benchmarks" |
| 210 | + Benchmarks for PD disaggregation on AMD are in progress and will be published in a follow-up. If you are running AMD GPUs and would like to contribute workloads or collaborate on benchmarking, please get in touch. |
| 211 | + |
| 212 | +Bug reports, feedback, and feature requests are welcome on the [issue tracker](https://github.com/dstackai/dstack/issues) and on [Discord](https://discord.gg/u8SmfwPpMd). |
| 213 | + |
| 214 | +> *Thanks to Matthew Bettinger at AMD for the collaboration, testing time, and feedback that shaped this integration.* |
| 215 | + |
| 216 | +## What's next? |
| 217 | + |
| 218 | +1. Read about [services](https://dstack.ai/docs/concepts/services/) and [fleets](https://dstack.ai/docs/concepts/fleets/) |
| 219 | +2. Check the [NCCL/RCCL tests](https://dstack.ai/docs/examples/clusters/nccl-rccl-tests/) example |
| 220 | +3. Review the [Shepherd Model Gateway](https://lightseek.org/smg/getting-started/) and [SGLang PD disaggregation](https://docs.sglang.ai/advanced_features/pd_disaggregation.html) documentation |
| 221 | +4. Join [Discord](https://discord.gg/u8SmfwPpMd) |
0 commit comments