You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/concepts/gateways.md
+5-4Lines changed: 5 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -110,7 +110,11 @@ router:
110
110
111
111
</div>
112
112
113
-
!!! info "Policy"
113
+
If you configure the `sglang` router, [services](../concepts/services.md) can run either [standard SGLang workers](../../examples/inference/sglang/index.md) or [Prefill-Decode workers](../../examples/inference/sglang/index.md#pd-disaggregation) (aka PD disaggregation).
114
+
115
+
> Note, if you want to run services with PD disaggregation, the gateway must currently run in the same cluster as the service.
116
+
117
+
??? info "Policy"
114
118
The `policy` property allows you to configure the routing policy:
115
119
116
120
* `cache_aware` — Default policy; combines cache locality with load balancing, falling back to shortest queue.
@@ -119,9 +123,6 @@ router:
119
123
* `round_robin` — Cycles through workers in order.
120
124
121
125
122
-
> Currently, services using this type of gateway must run standard SGLang workers. See the [example](../../examples/inference/sglang/index.md).
123
-
>
124
-
> Support for prefill/decode disaggregation and auto-scaling based on inter-token latency is coming soon.
Copy file name to clipboardExpand all lines: docs/docs/concepts/services.md
+5-2Lines changed: 5 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -182,6 +182,8 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
182
182
183
183
> The `scaling` property requires creating a [gateway](gateways.md).
184
184
185
+
<span id="replica-groups"></span>
186
+
185
187
??? info "Replica groups"
186
188
A service can include multiple replica groups. Each group can define its own `commands`, `resources` requirements, and `scaling` rules.
187
189
@@ -230,8 +232,9 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
230
232
231
233
> Properties such as `regions`, `port`, `image`, `env` and some other cannot be configured per replica group. This support is coming soon.
232
234
233
-
??? info "Disaggregated serving"
234
-
Native support for disaggregated prefill and decode, allowing both worker types to run within a single service, is coming soon.
235
+
### PD disaggregation
236
+
237
+
If you create a gateway with the [`sglang` router](gateways.md#sglang), you can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
If you'd like to use a custom routing policy, e.g. by leveraging the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#), create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
111
+
!!! info "Router policy"
112
+
If you'd like to use a custom routing policy, create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
113
113
114
-
> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling or HTTPs, rate-limits, etc), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
114
+
> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
115
+
116
+
## Configuration options
117
+
118
+
### PD disaggregation
119
+
120
+
If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html).
[`examples/llms/deepseek/sglang`](https://github.com/dstackai/dstack/blob/master/examples/llms/deepseek/sglang) and [`examples/inference/sglang`](https://github.com/dstackai/dstack/blob/master/examples/inference/sglang).
0 commit comments