Skip to content

Commit 20771c3

Browse files
BihanBihan  Ranapeterschmidt85
authored
[Docs] PD disaggregation (#3592)
* Add pd-disaggregation docs * Add pd.dstack.yml file * Minor update * Update gateway and services docs * [Docs] Minor changes related to PD disaggregation --------- Co-authored-by: Bihan Rana <bihan@Bihans-MacBook-Pro.local> Co-authored-by: peterschmidt85 <andrey.cheptsov@gmail.com>
1 parent 7c4314b commit 20771c3

File tree

4 files changed

+158
-12
lines changed

4 files changed

+158
-12
lines changed

docs/docs/concepts/gateways.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,11 @@ router:
110110

111111
</div>
112112

113-
!!! info "Policy"
113+
If you configure the `sglang` router, [services](../concepts/services.md) can run either [standard SGLang workers](../../examples/inference/sglang/index.md) or [Prefill-Decode workers](../../examples/inference/sglang/index.md#pd-disaggregation) (aka PD disaggregation).
114+
115+
> Note, if you want to run services with PD disaggregation, the gateway must currently run in the same cluster as the service.
116+
117+
??? info "Policy"
114118
The `policy` property allows you to configure the routing policy:
115119

116120
* `cache_aware` &mdash; Default policy; combines cache locality with load balancing, falling back to shortest queue.
@@ -119,9 +123,6 @@ router:
119123
* `round_robin` &mdash; Cycles through workers in order.
120124

121125

122-
> Currently, services using this type of gateway must run standard SGLang workers. See the [example](../../examples/inference/sglang/index.md).
123-
>
124-
> Support for prefill/decode disaggregation and auto-scaling based on inter-token latency is coming soon.
125126

126127
### Public IP
127128

docs/docs/concepts/services.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,8 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
182182

183183
> The `scaling` property requires creating a [gateway](gateways.md).
184184

185+
<span id="replica-groups"></span>
186+
185187
??? info "Replica groups"
186188
A service can include multiple replica groups. Each group can define its own `commands`, `resources` requirements, and `scaling` rules.
187189

@@ -230,8 +232,9 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
230232

231233
> Properties such as `regions`, `port`, `image`, `env` and some other cannot be configured per replica group. This support is coming soon.
232234

233-
??? info "Disaggregated serving"
234-
Native support for disaggregated prefill and decode, allowing both worker types to run within a single service, is coming soon.
235+
### PD disaggregation
236+
237+
If you create a gateway with the [`sglang` router](gateways.md#sglang), you can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
235238

236239
### Authorization
237240

examples/inference/sglang/README.md

Lines changed: 97 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ This example shows how to deploy DeepSeek-R1-Distill-Llama 8B and 70B using [SGL
99

1010
## Apply a configuration
1111

12-
Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SgLang.
12+
Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B using SGLang.
1313

1414
=== "NVIDIA"
1515

@@ -108,15 +108,106 @@ curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
108108
```
109109
</div>
110110

111-
!!! info "SGLang Model Gateway"
112-
If you'd like to use a custom routing policy, e.g. by leveraging the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#), create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
111+
!!! info "Router policy"
112+
If you'd like to use a custom routing policy, create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
113113

114-
> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling or HTTPs, rate-limits, etc), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
114+
> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
115+
116+
## Configuration options
117+
118+
### PD disaggregation
119+
120+
If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html).
121+
122+
<div editor-title="examples/inference/sglang/pd.dstack.yml">
123+
124+
```yaml
125+
type: service
126+
name: prefill-decode
127+
image: lmsysorg/sglang:latest
128+
129+
env:
130+
- HF_TOKEN
131+
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
132+
133+
replicas:
134+
- count: 1..4
135+
scaling:
136+
metric: rps
137+
target: 3
138+
commands:
139+
- |
140+
python -m sglang.launch_server \
141+
--model-path $MODEL_ID \
142+
--disaggregation-mode prefill \
143+
--disaggregation-transfer-backend mooncake \
144+
--host 0.0.0.0 \
145+
--port 8000 \
146+
--disaggregation-bootstrap-port 8998
147+
resources:
148+
gpu: H200
149+
150+
- count: 1..8
151+
scaling:
152+
metric: rps
153+
target: 2
154+
commands:
155+
- |
156+
python -m sglang.launch_server \
157+
--model-path $MODEL_ID \
158+
--disaggregation-mode decode \
159+
--disaggregation-transfer-backend mooncake \
160+
--host 0.0.0.0 \
161+
--port 8000
162+
resources:
163+
gpu: H200
164+
165+
port: 8000
166+
model: zai-org/GLM-4.5-Air-FP8
167+
168+
# Custom probe is required for PD disaggregation
169+
probes:
170+
- type: http
171+
url: /health_generate
172+
interval: 15s
173+
174+
router:
175+
type: sglang
176+
pd_disaggregation: true
177+
```
178+
179+
</div>
180+
181+
Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.
182+
183+
#### Gateway
184+
185+
Note, running services with PD disaggregation currently requires the gateway to run in the same cluster as the service.
186+
187+
For example, if you run services on the `kubernetes` backend, make sure to also create the gateway in the same backend:
188+
189+
<div editor-title="gateway.dstack.yml">
190+
191+
```yaml
192+
type: gateway
193+
name: gateway-name
194+
195+
backend: kubernetes
196+
region: any
197+
198+
domain: example.com
199+
router:
200+
type: sglang
201+
```
202+
203+
</div>
204+
205+
<!-- TODO: Gateway creation using fleets is coming to simplify this. -->
115206

116207
## Source code
117208

118-
The source-code of this example can be found in
119-
[`examples/llms/deepseek/sglang`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/sglang).
209+
The source-code of these examples can be found in
210+
[`examples/llms/deepseek/sglang`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/sglang) and [`examples/inference/sglang`](https://github.com/dstackai/dstack/blob/master/examples/inference/sglang).
120211

121212
## What's next?
122213

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
type: service
2+
name: prefill-decode
3+
image: lmsysorg/sglang:latest
4+
5+
env:
6+
- HF_TOKEN
7+
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
8+
9+
replicas:
10+
- count: 1..4
11+
scaling:
12+
metric: rps
13+
target: 3
14+
commands:
15+
- |
16+
python -m sglang.launch_server \
17+
--model-path $MODEL_ID \
18+
--disaggregation-mode prefill \
19+
--disaggregation-transfer-backend mooncake \
20+
--host 0.0.0.0 \
21+
--port 8000 \
22+
--disaggregation-bootstrap-port 8998
23+
resources:
24+
gpu: 1
25+
26+
- count: 1..8
27+
scaling:
28+
metric: rps
29+
target: 2
30+
commands:
31+
- |
32+
python -m sglang.launch_server \
33+
--model-path $MODEL_ID \
34+
--disaggregation-mode decode \
35+
--disaggregation-transfer-backend mooncake \
36+
--host 0.0.0.0 \
37+
--port 8000
38+
resources:
39+
gpu: 1
40+
41+
port: 8000
42+
model: zai-org/GLM-4.5-Air-FP8
43+
44+
probes:
45+
- type: http
46+
url: /health_generate
47+
interval: 15s
48+
49+
router:
50+
type: sglang
51+
pd_disaggregation: true

0 commit comments

Comments
 (0)