You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/blog/posts/pd-disaggregation.md
+36-25Lines changed: 36 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,7 @@ For inference, `dstack` provides a [services](../../docs/concepts/services.md) a
28
28
29
29
## Services
30
30
31
-
With `dstack``0.20.10`, you can define a service with separate replica groups for Prefill and Decode workers and enable PD disaggregation directly in the `router` configuration.
31
+
With `dstack``0.20.17`, you can define a service with separate replica groups for Router, Prefill and Decode workers and run PD disaggregated Inference.
32
32
33
33
<diveditor-title="glm45air.dstack.yml">
34
34
@@ -43,6 +43,21 @@ env:
43
43
image: lmsysorg/sglang:latest
44
44
45
45
replicas:
46
+
- count: 1
47
+
# For now replica group with router must have count: 1
Just like `dstack` relies on the SGLang router for cache-aware routing, Prefill–Decode disaggregation also requires a [gateway](../../docs/concepts/gateways.md#sglang) configured with the SGLang router.
116
+
Create an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#apply-a-configuration) that includes one CPU host for the router and one or more GPU hosts for the workers. Make sure the CPU and GPU hosts are in the same network.
* Because the SGLang router requires all workers to be on the same network, and `dstack` currently runs the router inside the gateway, the gateway and the service must be running in the same cluster.
128
-
* Prefill–Decode disaggregation is currently available with the SGLang backend (vLLM support is coming).
138
+
* The router replica group is currently limited to `count: 1` (no HA yet). Support for multiple router replicas for HA is planned.
139
+
* Prefill–Decode disaggregation is currently available with the SGLang backend (Nvidia-dynamo and vLLM support is coming).
129
140
* Autoscaling supports RPS as the metric for now; TTFT and ITL metrics are planned next.
130
141
131
142
With native support for inference and now Prefill–Decode disaggregation, `dstack` makes it easier to run high-throughput, low-latency model serving across GPU clouds, and Kubernetes or bare-metal clusters.
Copy file name to clipboardExpand all lines: docs/docs/concepts/gateways.md
+5-1Lines changed: 5 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -95,7 +95,11 @@ router:
95
95
96
96
If you configure the `sglang` router, [services](../concepts/services.md) can run either [standard SGLang workers](../../examples/inference/sglang/index.md) or [Prefill-Decode workers](../../examples/inference/sglang/index.md#pd-disaggregation) (aka PD disaggregation).
97
97
98
-
> Note, if you want to run services with PD disaggregation, the gateway must currently run in the same cluster as the service.
98
+
!!! note "PD disaggregation"
99
+
To run services with PD disaggregation see [SGLang PD disaggregation](https://dstack.ai/examples/inference/sglang/#pd-disaggregation).
100
+
101
+
!!! note "Deprecation"
102
+
Configuring the SGLang router in a gateway will be deprecated in a future release.
99
103
100
104
??? info "Policy"
101
105
The `policy` property allows you to configure the routing policy:
Copy file name to clipboardExpand all lines: docs/docs/concepts/services.md
+1-2Lines changed: 1 addition & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -107,7 +107,6 @@ If [authorization](#authorization) is not disabled, the service endpoint require
107
107
Here are cases where a service may need a [gateway](gateways.md):
108
108
109
109
* To use [auto-scaling](#replicas-and-scaling) or [rate limits](#rate-limits)
110
-
* To enable a support custom router, e.g. such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#)
111
110
* To enable HTTPS for the endpoint and map it to your domain
112
111
* If your service requires WebSockets
113
112
* If your service cannot work with a [path prefix](#path-prefix)
@@ -234,7 +233,7 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
234
233
235
234
### PD disaggregation
236
235
237
-
If you create a gateway with the [`sglang` router](gateways.md#sglang), you can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
236
+
You can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
If you'd like to use a custom routing policy, create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
111
+
!!! info "Run router and workers separately"
112
+
To run the SGLang router and workers separately, use replica groups (router as a CPU replica group, workers as GPU replica groups). See [PD disaggregation](#pd-disaggregation).
113
113
114
114
> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
115
115
116
116
## Configuration options
117
117
118
118
### PD disaggregation
119
119
120
-
If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html).
120
+
To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html), run the **router as a replica** on a CPU-only host, while running **prefill/decode workers** as replicas on GPU hosts.
# For now replica group with router must have count: 1
136
+
commands:
137
+
- pip install sglang_router
138
+
- |
139
+
python -m sglang_router.launch_router \
140
+
--host 0.0.0.0 \
141
+
--port 8000 \
142
+
--pd-disaggregation \
143
+
--prefill-policy cache_aware
144
+
router:
145
+
type: sglang
146
+
resources:
147
+
cpu: 4
148
+
134
149
- count: 1..4
135
150
scaling:
136
151
metric: rps
@@ -140,7 +155,7 @@ replicas:
140
155
python -m sglang.launch_server \
141
156
--model-path $MODEL_ID \
142
157
--disaggregation-mode prefill \
143
-
--disaggregation-transfer-backend mooncake \
158
+
--disaggregation-transfer-backend nixl \
144
159
--host 0.0.0.0 \
145
160
--port 8000 \
146
161
--disaggregation-bootstrap-port 8998
@@ -156,53 +171,53 @@ replicas:
156
171
python -m sglang.launch_server \
157
172
--model-path $MODEL_ID \
158
173
--disaggregation-mode decode \
159
-
--disaggregation-transfer-backend mooncake \
174
+
--disaggregation-transfer-backend nixl \
160
175
--host 0.0.0.0 \
161
176
--port 8000
162
177
resources:
163
178
gpu: H200
164
179
165
180
port: 8000
166
181
model: zai-org/GLM-4.5-Air-FP8
182
+
# SSH fleet containing both router (CPU) and workers (GPU).
183
+
fleets: [pd-disagg]
167
184
168
-
# Custom probe is required for PD disaggregation
185
+
# Custom probe is required for PD disaggregation.
169
186
probes:
170
187
- type: http
171
-
url: /health_generate
188
+
url: /health
172
189
interval: 15s
173
-
174
-
router:
175
-
type: sglang
176
-
pd_disaggregation: true
177
190
```
178
191
179
192
</div>
180
193
181
194
Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.
182
195
183
-
#### Gateway
184
-
185
-
Note, running services with PD disaggregation currently requires the gateway to run in the same cluster as the service.
196
+
#### SSH fleet
186
197
187
-
For example, if you run services on the `kubernetes` backend, make sure to also create the gateway in the same backend:
198
+
Create an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#apply-a-configuration) that includes one CPU host for the router and one or more GPU hosts for the workers. Make sure the CPU and GPU hosts are in the same network.
<!-- TODO: Gateway creation using fleets is coming to simplify this. -->
219
+
!!! note "Gateway-based routing (deprecated)"
220
+
If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can also run SGLang with PD disaggregation. This method will be deprecated in the future in favor of running the router as a replica.
0 commit comments