Update docs for router as replica

Bihan  Rana · Bihan  Rana · commit 59d246bbcefe · 2026-04-14T18:00:13.000+05:45
diff --git a/docs/blog/posts/pd-disaggregation.md b/docs/blog/posts/pd-disaggregation.md
@@ -28,7 +28,7 @@ For inference, `dstack` provides a [services](../../docs/concepts/services.md) a
 
 ## Services
 
-With `dstack` `0.20.10`, you can define a service with separate replica groups for Prefill and Decode workers and enable PD disaggregation directly in the `router` configuration.
+With `dstack` `0.20.17`, you can define a service with separate replica groups for Router, Prefill and Decode workers and run PD disaggregated Inference.
 
 <div editor-title="glm45air.dstack.yml">
 
@@ -43,6 +43,21 @@ env:
 image: lmsysorg/sglang:latest
 
 replicas:
+  - count: 1
+    # For now replica group with router must have count: 1
+    commands:
+      - pip install sglang_router
+      - |
+          python -m sglang_router.launch_router \
+            --host 0.0.0.0 \
+            --port 8000 \
+            --pd-disaggregation \
+            --prefill-policy cache_aware
+    router:
+      type: sglang
+    resources:
+      cpu: 4
+
   - count: 1..4
     scaling:
       metric: rps
@@ -52,7 +67,7 @@ replicas:
           python -m sglang.launch_server \
             --model-path $MODEL_ID \
             --disaggregation-mode prefill \
-            --disaggregation-transfer-backend mooncake \
+            --disaggregation-transfer-backend nixl \
             --host 0.0.0.0 \
             --port 8000 \
             --disaggregation-bootstrap-port 8998
@@ -68,7 +83,7 @@ replicas:
           python -m sglang.launch_server \
             --model-path $MODEL_ID \
             --disaggregation-mode decode \
-            --disaggregation-transfer-backend mooncake \
+            --disaggregation-transfer-backend nixl \
             --host 0.0.0.0 \
             --port 8000
     resources:
@@ -79,12 +94,8 @@ model: zai-org/GLM-4.5-Air-FP8
 
 probes:
   - type: http
-    url: /health_generate
+    url: /health
     interval: 15s
-
-router:
-  type: sglang
-  pd_disaggregation: true
 ```
 
 </div>
@@ -100,32 +111,32 @@ $ dstack apply -f glm45air.dstack.yml
 
 </div>
 
-### Gateway
+### SSH fleet
 
-Just like `dstack` relies on the SGLang router for cache-aware routing, Prefill–Decode disaggregation also requires a [gateway](../../docs/concepts/gateways.md#sglang) configured with the SGLang router.
+Create an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#apply-a-configuration) that includes one CPU host for the router and one or more GPU hosts for the workers. Make sure the CPU and GPU hosts are in the same network.
 
-<div editor-title="gateway-sglang.dstack.yml">
+<div editor-title="pd-fleet.dstack.yml">
 
 ```yaml
-type: gateway
-name: inference-gateway
-
-backends: [kubernetes]
-region: any
-
-domain: example.com
-
-router:
-  type: sglang
-  policy: cache_aware
+type: fleet
+name: pd-disagg
+
+placement: cluster
+
+ssh_config:
+  user: ubuntu
+  identity_file: ~/.ssh/id_rsa
+  hosts:
+    - 89.169.108.16   # CPU Host (router)
+    - 89.169.123.100  # GPU Host (prefill/decode workers)
+    - 89.169.110.65   # GPU Host (prefill/decode workers)
 ```
 
 </div>
 
 ## Limitations
-
-* Because the SGLang router requires all workers to be on the same network, and `dstack` currently runs the router inside the gateway, the gateway and the service must be running in the same cluster.
-* Prefill–Decode disaggregation is currently available with the SGLang backend (vLLM support is coming).
+* The router replica group is currently limited to `count: 1` (no HA yet). Support for multiple router replicas for HA is planned.
+* Prefill–Decode disaggregation is currently available with the SGLang backend (Nvidia-dynamo and vLLM support is coming).
 * Autoscaling supports RPS as the metric for now; TTFT and ITL metrics are planned next.
 
 With native support for inference and now Prefill–Decode disaggregation, `dstack` makes it easier to run high-throughput, low-latency model serving across GPU clouds, and Kubernetes or bare-metal clusters.
diff --git a/docs/docs/concepts/gateways.md b/docs/docs/concepts/gateways.md
@@ -95,7 +95,11 @@ router:
 
 If you configure the `sglang` router, [services](../concepts/services.md) can run either [standard SGLang workers](../../examples/inference/sglang/index.md) or [Prefill-Decode workers](../../examples/inference/sglang/index.md#pd-disaggregation) (aka PD disaggregation).
 
-> Note, if you want to run services with PD disaggregation, the gateway must currently run in the same cluster as the service.
+!!! note "PD disaggregation"
+    To run services with PD disaggregation see [SGLang PD disaggregation](https://dstack.ai/examples/inference/sglang/#pd-disaggregation).
+
+!!! note "Deprecation"
+    Configuring the SGLang router in a gateway will be deprecated in a future release.
 
 ??? info "Policy"
     The `policy` property allows you to configure the routing policy:
diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md
@@ -107,7 +107,6 @@ If [authorization](#authorization) is not disabled, the service endpoint require
 Here are cases where a service may need a [gateway](gateways.md):
 
 * To use [auto-scaling](#replicas-and-scaling) or [rate limits](#rate-limits)
-* To enable a support custom router, e.g. such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#)
 * To enable HTTPS for the endpoint and map it to your domain
 * If your service requires WebSockets
 * If your service cannot work with a [path prefix](#path-prefix)
@@ -234,7 +233,7 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
 
 ### PD disaggregation
 
-If you create a gateway with the [`sglang` router](gateways.md#sglang), you can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
+You can run SGLang with [Prefill-Decode disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html). See the [corresponding example](../../examples/inference/sglang/index.md#pd-disaggregation).
 
 ### Authorization
 
diff --git a/examples/inference/sglang/README.md b/examples/inference/sglang/README.md
@@ -108,16 +108,16 @@ curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
 ```
 </div>
 
-!!! info "Router policy"
-    If you'd like to use a custom routing policy, create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
+!!! info "Run router and workers separately"
+    To run the SGLang router and workers separately, use replica groups (router as a CPU replica group, workers as GPU replica groups). See [PD disaggregation](#pd-disaggregation).
 
 > If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
 
 ## Configuration options
 
 ### PD disaggregation
 
-If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html).
+To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/pd_disaggregation.html), run the **router as a replica** on a CPU-only host, while running **prefill/decode workers** as replicas on GPU hosts.
 
 <div editor-title="examples/inference/sglang/pd.dstack.yml">
 
@@ -131,6 +131,21 @@ env:
   - MODEL_ID=zai-org/GLM-4.5-Air-FP8
 
 replicas:
+  - count: 1
+    # For now replica group with router must have count: 1
+    commands:
+      - pip install sglang_router
+      - |
+        python -m sglang_router.launch_router \
+          --host 0.0.0.0 \
+          --port 8000 \
+          --pd-disaggregation \
+          --prefill-policy cache_aware
+    router:
+      type: sglang
+    resources:
+      cpu: 4
+
   - count: 1..4
     scaling:
       metric: rps
@@ -140,7 +155,7 @@ replicas:
         python -m sglang.launch_server \
           --model-path $MODEL_ID \
           --disaggregation-mode prefill \
-          --disaggregation-transfer-backend mooncake \
+          --disaggregation-transfer-backend nixl \
           --host 0.0.0.0 \
           --port 8000 \
           --disaggregation-bootstrap-port 8998
@@ -156,53 +171,53 @@ replicas:
         python -m sglang.launch_server \
           --model-path $MODEL_ID \
           --disaggregation-mode decode \
-          --disaggregation-transfer-backend mooncake \
+          --disaggregation-transfer-backend nixl \
           --host 0.0.0.0 \
           --port 8000
     resources:
       gpu: H200
 
 port: 8000
 model: zai-org/GLM-4.5-Air-FP8
+# SSH fleet containing both router (CPU) and workers (GPU).
+fleets: [pd-disagg]
 
-# Custom probe is required for PD disaggregation
+# Custom probe is required for PD disaggregation.
 probes:
   - type: http
-    url: /health_generate
+    url: /health
     interval: 15s
-
-router:
-  type: sglang
-  pd_disaggregation: true
 ```
 
 </div>
 
 Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.
 
-#### Gateway
-
-Note, running services with PD disaggregation currently requires the gateway to run in the same cluster as the service.
+#### SSH fleet
 
-For example, if you run services on the `kubernetes` backend, make sure to also create the gateway in the same backend:
+Create an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#apply-a-configuration) that includes one CPU host for the router and one or more GPU hosts for the workers. Make sure the CPU and GPU hosts are in the same network.
 
-<div editor-title="gateway.dstack.yml">
+<div editor-title="pd-fleet.dstack.yml">
 
 ```yaml
-type: gateway
-name: gateway-name
-
-backend: kubernetes
-region: any
-
-domain: example.com
-router:
-  type: sglang
+type: fleet
+name: pd-disagg
+
+placement: cluster
+
+ssh_config:
+  user: ubuntu
+  identity_file: ~/.ssh/id_rsa
+  hosts:
+    - 89.169.108.16   # CPU Host (router)
+    - 89.169.123.100  # GPU Host (prefill/decode workers)
+    - 89.169.110.65   # GPU Host (prefill/decode workers)
 ```
 
 </div>
 
-<!-- TODO: Gateway creation using fleets is coming to simplify this. -->
+!!! note "Gateway-based routing (deprecated)"
+    If you create a gateway with the [`sglang` router](https://dstack.ai/docs/concepts/gateways/#sglang), you can also run SGLang with PD disaggregation. This method will be deprecated in the future in favor of running the router as a replica.
 
 ## Source code
 
diff --git a/examples/inference/sglang/pd-disagg.fleet.dstack.yml b/examples/inference/sglang/pd-disagg.fleet.dstack.yml
@@ -0,0 +1,12 @@
+type: fleet
+name: pd-disagg
+
+placement: cluster
+
+ssh_config:
+  user: ubuntu
+  identity_file: ~/.ssh/id_rsa
+  hosts:
+    - 89.169.108.16   # CPU Host (router)
+    - 89.169.123.100  # GPU Host (prefill/decode workers)
+    - 89.169.110.65   # GPU Host (prefill/decode workers)
diff --git a/examples/inference/sglang/pd.deprecated.dstack.yml b/examples/inference/sglang/pd.deprecated.dstack.yml
@@ -0,0 +1,54 @@
+# DEPRECATED: Gateway-based PD disaggregation config.
+# Use `pd.dstack.yml` instead (router runs as a replica).
+
+type: service
+name: prefill-decode
+image: lmsysorg/sglang:latest
+
+env:
+  - HF_TOKEN
+  - MODEL_ID=zai-org/GLM-4.5-Air-FP8
+
+replicas:
+  - count: 1..4
+    scaling:
+      metric: rps
+      target: 3
+    commands:
+      - |
+          python -m sglang.launch_server \
+            --model-path $MODEL_ID \
+            --disaggregation-mode prefill \
+            --disaggregation-transfer-backend mooncake \
+            --host 0.0.0.0 \
+            --port 8000 \
+            --disaggregation-bootstrap-port 8998
+    resources:
+      gpu: 1
+
+  - count: 1..8
+    scaling:
+      metric: rps
+      target: 2
+    commands:
+      - |
+          python -m sglang.launch_server \
+            --model-path $MODEL_ID \
+            --disaggregation-mode decode \
+            --disaggregation-transfer-backend mooncake \
+            --host 0.0.0.0 \
+            --port 8000
+    resources:
+      gpu: 1
+
+port: 8000
+model: zai-org/GLM-4.5-Air-FP8
+
+probes:
+  - type: http
+    url: /health_generate
+    interval: 15s
+
+router:
+  type: sglang
+  pd_disaggregation: true
diff --git a/examples/inference/sglang/pd.dstack.yml b/examples/inference/sglang/pd.dstack.yml