Skip to content

Commit 59d246b

Browse files
Bihan  RanaBihan  Rana
authored andcommitted
Update docs for router as replica
1 parent cbb13f0 commit 59d246b

File tree

7 files changed

+173
-63
lines changed

7 files changed

+173
-63
lines changed

docs/blog/posts/pd-disaggregation.md

Lines changed: 36 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ For inference, `dstack` provides a [services](../../docs/concepts/services.md) a
2828
2929
## Services
3030

31-
With `dstack` `0.20.10`, you can define a service with separate replica groups for Prefill and Decode workers and enable PD disaggregation directly in the `router` configuration.
31+
With `dstack` `0.20.17`, you can define a service with separate replica groups for Router, Prefill and Decode workers and run PD disaggregated Inference.
3232

3333
<div editor-title="glm45air.dstack.yml">
3434

@@ -43,6 +43,21 @@ env:
4343
image: lmsysorg/sglang:latest
4444

4545
replicas:
46+
- count: 1
47+
# For now replica group with router must have count: 1
48+
commands:
49+
- pip install sglang_router
50+
- |
51+
python -m sglang_router.launch_router \
52+
--host 0.0.0.0 \
53+
--port 8000 \
54+
--pd-disaggregation \
55+
--prefill-policy cache_aware
56+
router:
57+
type: sglang
58+
resources:
59+
cpu: 4
60+
4661
- count: 1..4
4762
scaling:
4863
metric: rps
@@ -52,7 +67,7 @@ replicas:
5267
python -m sglang.launch_server \
5368
--model-path $MODEL_ID \
5469
--disaggregation-mode prefill \
55-
--disaggregation-transfer-backend mooncake \
70+
--disaggregation-transfer-backend nixl \
5671
--host 0.0.0.0 \
5772
--port 8000 \
5873
--disaggregation-bootstrap-port 8998
@@ -68,7 +83,7 @@ replicas:
6883
python -m sglang.launch_server \
6984
--model-path $MODEL_ID \
7085
--disaggregation-mode decode \
71-
--disaggregation-transfer-backend mooncake \
86+
--disaggregation-transfer-backend nixl \
7287
--host 0.0.0.0 \
7388
--port 8000
7489
resources:
@@ -79,12 +94,8 @@ model: zai-org/GLM-4.5-Air-FP8
7994

8095
probes:
8196
- type: http
82-
url: /health_generate
97+
url: /health
8398
interval: 15s
84-
85-
router:
86-
type: sglang
87-
pd_disaggregation: true
8899
```
89100
90101
</div>
@@ -100,32 +111,32 @@ $ dstack apply -f glm45air.dstack.yml
100111

101112
</div>
102113

103-
### Gateway
114+
### SSH fleet
104115

105-
Just like `dstack` relies on the SGLang router for cache-aware routing, Prefill–Decode disaggregation also requires a [gateway](../../docs/concepts/gateways.md#sglang) configured with the SGLang router.
116+
Create an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#apply-a-configuration) that includes one CPU host for the router and one or more GPU hosts for the workers. Make sure the CPU and GPU hosts are in the same network.
106117

107-
<div editor-title="gateway-sglang.dstack.yml">
118+
<div editor-title="pd-fleet.dstack.yml">
108119

109120
```yaml
110-
type: gateway
111-
name: inference-gateway
112-
113-
backends: [kubernetes]
114-
region: any
115-
116-
domain: example.com
117-
118-
router:
119-
type: sglang
120-
policy: cache_aware
121+
type: fleet
122+
name: pd-disagg
123+
124+
placement: cluster
125+
126+
ssh_config:
127+
user: ubuntu
128+
identity_file: ~/.ssh/id_rsa
129+
hosts:
130+
- 89.169.108.16 # CPU Host (router)
131+
- 89.169.123.100 # GPU Host (prefill/decode workers)
132+
- 89.169.110.65 # GPU Host (prefill/decode workers)
121133
```
122134
123135
</div>
124136
125137
## Limitations
126-
127-
* Because the SGLang router requires all workers to be on the same network, and `dstack` currently runs the router inside the gateway, the gateway and the service must be running in the same cluster.
128-
* Prefill–Decode disaggregation is currently available with the SGLang backend (vLLM support is coming).
138+
* The router replica group is currently limited to `count: 1` (no HA yet). Support for multiple router replicas for HA is planned.
139+
* Prefill–Decode disaggregation is currently available with the SGLang backend (Nvidia-dynamo and vLLM support is coming).
129140
* Autoscaling supports RPS as the metric for now; TTFT and ITL metrics are planned next.
130141

131142
With native support for inference and now Prefill–Decode disaggregation, `dstack` makes it easier to run high-throughput, low-latency model serving across GPU clouds, and Kubernetes or bare-metal clusters.

docs/docs/concepts/gateways.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,11 @@ router:
9595

9696
If you configure the `sglang` router, [services](../concepts/services.md) can run either [standard SGLang workers](../../examples/inference/sglang/index.md) or [Prefill-Decode workers](../../examples/inference/sglang/index.md#pd-disaggregation) (aka PD disaggregation).
9797

98-
> Note, if you want to run services with PD disaggregation, the gateway must currently run in the same cluster as the service.
98+
!!! note "PD disaggregation"
99+
To run services with PD disaggregation see [SGLang PD disaggregation](https://dstack.ai/examples/inference/sglang/#pd-disaggregation).
100+
101+
!!! note "Deprecation"
102+
Configuring the SGLang router in a gateway will be deprecated in a future release.
99103

100104
??? info "Policy"
101105
The `policy` property allows you to configure the routing policy:

docs/docs/concepts/services.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,6 @@ If [authorization](#authorization) is not disabled, the service endpoint require
107107
Here are cases where a service may need a [gateway](gateways.md):
108108

109109
* To use [auto-scaling](#replicas-and-scaling) or [rate limits](#rate-limits)
110-
* To enable a support custom router, e.g. such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#)
111110
* To enable HTTPS for the endpoint and map it to your domain
112111
* If your service requires WebSockets
113112
* If your service cannot work with a [path prefix](#path-prefix)
@@ -234,7 +233,7 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
234233

235234
### PD disaggregation
236235

237-
If you create a gateway with the [`sglang` router](gateways.md#sglang), you can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
236+
You can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
238237

239238
### Authorization
240239

examples/inference/sglang/README.md

Lines changed: 41 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -108,16 +108,16 @@ curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
108108
```
109109
</div>
110110

111-
!!! info "Router policy"
112-
If you'd like to use a custom routing policy, create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
111+
!!! info "Run router and workers separately"
112+
To run the SGLang router and workers separately, use replica groups (router as a CPU replica group, workers as GPU replica groups). See [PD disaggregation](#pd-disaggregation).
113113

114114
> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
115115
116116
## Configuration options
117117

118118
### PD disaggregation
119119

120-
If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html).
120+
To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html), run the **router as a replica** on a CPU-only host, while running **prefill/decode workers** as replicas on GPU hosts.
121121

122122
<div editor-title="examples/inference/sglang/pd.dstack.yml">
123123

@@ -131,6 +131,21 @@ env:
131131
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
132132

133133
replicas:
134+
- count: 1
135+
# For now replica group with router must have count: 1
136+
commands:
137+
- pip install sglang_router
138+
- |
139+
python -m sglang_router.launch_router \
140+
--host 0.0.0.0 \
141+
--port 8000 \
142+
--pd-disaggregation \
143+
--prefill-policy cache_aware
144+
router:
145+
type: sglang
146+
resources:
147+
cpu: 4
148+
134149
- count: 1..4
135150
scaling:
136151
metric: rps
@@ -140,7 +155,7 @@ replicas:
140155
python -m sglang.launch_server \
141156
--model-path $MODEL_ID \
142157
--disaggregation-mode prefill \
143-
--disaggregation-transfer-backend mooncake \
158+
--disaggregation-transfer-backend nixl \
144159
--host 0.0.0.0 \
145160
--port 8000 \
146161
--disaggregation-bootstrap-port 8998
@@ -156,53 +171,53 @@ replicas:
156171
python -m sglang.launch_server \
157172
--model-path $MODEL_ID \
158173
--disaggregation-mode decode \
159-
--disaggregation-transfer-backend mooncake \
174+
--disaggregation-transfer-backend nixl \
160175
--host 0.0.0.0 \
161176
--port 8000
162177
resources:
163178
gpu: H200
164179

165180
port: 8000
166181
model: zai-org/GLM-4.5-Air-FP8
182+
# SSH fleet containing both router (CPU) and workers (GPU).
183+
fleets: [pd-disagg]
167184

168-
# Custom probe is required for PD disaggregation
185+
# Custom probe is required for PD disaggregation.
169186
probes:
170187
- type: http
171-
url: /health_generate
188+
url: /health
172189
interval: 15s
173-
174-
router:
175-
type: sglang
176-
pd_disaggregation: true
177190
```
178191
179192
</div>
180193
181194
Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.
182195

183-
#### Gateway
184-
185-
Note, running services with PD disaggregation currently requires the gateway to run in the same cluster as the service.
196+
#### SSH fleet
186197

187-
For example, if you run services on the `kubernetes` backend, make sure to also create the gateway in the same backend:
198+
Create an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#apply-a-configuration) that includes one CPU host for the router and one or more GPU hosts for the workers. Make sure the CPU and GPU hosts are in the same network.
188199

189-
<div editor-title="gateway.dstack.yml">
200+
<div editor-title="pd-fleet.dstack.yml">
190201

191202
```yaml
192-
type: gateway
193-
name: gateway-name
194-
195-
backend: kubernetes
196-
region: any
197-
198-
domain: example.com
199-
router:
200-
type: sglang
203+
type: fleet
204+
name: pd-disagg
205+
206+
placement: cluster
207+
208+
ssh_config:
209+
user: ubuntu
210+
identity_file: ~/.ssh/id_rsa
211+
hosts:
212+
- 89.169.108.16 # CPU Host (router)
213+
- 89.169.123.100 # GPU Host (prefill/decode workers)
214+
- 89.169.110.65 # GPU Host (prefill/decode workers)
201215
```
202216

203217
</div>
204218

205-
<!-- TODO: Gateway creation using fleets is coming to simplify this. -->
219+
!!! note "Gateway-based routing (deprecated)"
220+
If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can also run SGLang with PD disaggregation. This method will be deprecated in the future in favor of running the router as a replica.
206221

207222
## Source code
208223

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
type: fleet
2+
name: pd-disagg
3+
4+
placement: cluster
5+
6+
ssh_config:
7+
user: ubuntu
8+
identity_file: ~/.ssh/id_rsa
9+
hosts:
10+
- 89.169.108.16 # CPU Host (router)
11+
- 89.169.123.100 # GPU Host (prefill/decode workers)
12+
- 89.169.110.65 # GPU Host (prefill/decode workers)
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# DEPRECATED: Gateway-based PD disaggregation config.
2+
# Use `pd.dstack.yml` instead (router runs as a replica).
3+
4+
type: service
5+
name: prefill-decode
6+
image: lmsysorg/sglang:latest
7+
8+
env:
9+
- HF_TOKEN
10+
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
11+
12+
replicas:
13+
- count: 1..4
14+
scaling:
15+
metric: rps
16+
target: 3
17+
commands:
18+
- |
19+
python -m sglang.launch_server \
20+
--model-path $MODEL_ID \
21+
--disaggregation-mode prefill \
22+
--disaggregation-transfer-backend mooncake \
23+
--host 0.0.0.0 \
24+
--port 8000 \
25+
--disaggregation-bootstrap-port 8998
26+
resources:
27+
gpu: 1
28+
29+
- count: 1..8
30+
scaling:
31+
metric: rps
32+
target: 2
33+
commands:
34+
- |
35+
python -m sglang.launch_server \
36+
--model-path $MODEL_ID \
37+
--disaggregation-mode decode \
38+
--disaggregation-transfer-backend mooncake \
39+
--host 0.0.0.0 \
40+
--port 8000
41+
resources:
42+
gpu: 1
43+
44+
port: 8000
45+
model: zai-org/GLM-4.5-Air-FP8
46+
47+
probes:
48+
- type: http
49+
url: /health_generate
50+
interval: 15s
51+
52+
router:
53+
type: sglang
54+
pd_disaggregation: true

0 commit comments

Comments
 (0)