You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add llm-d as a second router backend alongside vllm-router, selected via
`[inference.deployment.router] type = "llm-d"`. Built on the external-LB
launch substrate — the per-rank vLLM engines launch identically; only the
router control plane differs.
- `LlmdRouterConfig`: discriminated-union member with `scorers` (base) +
per-profile `prefill_scorer_overrides`/`decode_scorer_overrides`,
`non_cached_tokens`, `decode_sidecar_port`, and a known-scorer validator.
- `launch_router` helper gains an llm-d branch: renders per-replica EPP +
Envoy + file-discovery endpoints and launches `epp` + `envoy` instead of
vllm-router. Call sites stay router-agnostic.
- pd-sidecar on each decode node (P/D) for remote-prefill orchestration + NIXL.
- Entrypoints pass the `router` config object to the templates.
- Reject `enable_return_routed_experts` with llm-d (breaks P/D, unverified
for multi-node).
- SLURM cleanup also clears stale `epp`/`envoy`/`pd-sidecar` processes.
- Presets under `templates/llmd/` + `scripts/install_llmd.sh`; docs + install skill.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/inference.md
+17-4Lines changed: 17 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,7 +30,7 @@ We support 3 distinct deployment shapes:
30
30
31
31
Most of the features are supported for all deployment shapes, with few exceptions. These exceptions are rejected on validation.
32
32
33
-
You can select the deployment shape with `InferenceDeploymentConfig` in your config file. This is a config-field that allows you to set the deployment shape, deployment-specific knobs such as `num_nodes`, `num_replicas`, `router_port`, `backend_port`, etc.
33
+
You can select the deployment shape with `InferenceDeploymentConfig` in your config file. This is a config-field that allows you to set the deployment shape, deployment-specific knobs such as `num_nodes`, `num_replicas`, `backend_port`, and a `[...deployment.router]` block, etc.
34
34
35
35
```toml
36
36
[inference.deployment]
@@ -102,7 +102,7 @@ tp = 2
102
102
dp = 4
103
103
```
104
104
105
-
This configuration will run 2 independent vLLM replicas, each with `tp=2` and `dp=4`. Routing will be handled by the `vllm-router` instance running on the same node as the 1st replica. We aim to support more advanced routing options, such as `llm-d`or `dynamo` in the future. You can read more about the supported routing options in the [router](#router) section.
105
+
This configuration will run 2 independent vLLM replicas, each with `tp=2` and `dp=4`. Routing is handled by a router instance running on the same node as the 1st replica — either `vllm-router` (default) or the upstream `llm-d`EPP+Envoy, selected via the `[...deployment.router]` block. You can read more about the supported routing options in the [router](#router) section.
106
106
107
107
### Wide-EP
108
108
@@ -167,9 +167,22 @@ This will run 3 inference replicas, each running on 6 nodes. Each replica will r
167
167
168
168
## Router
169
169
170
-
We use our own fork of [vllm-router](https://github.com/PrimeIntellect-ai/router) as the request handler. We plan to support more advanced proxy options in the future.
170
+
Multi-node and disaggregated deployments front their vLLM backends with a router, configured via a discriminated `[...deployment.router]` block (`type = "vllm-router" | "llm-d"`):
171
171
172
-
Right now, router handles 2 most important things:
172
+
```toml
173
+
[inference.deployment.router] # or [deployment.router] for the standalone inference entrypoint
174
+
type = "llm-d"# "vllm-router" (default) or "llm-d"
175
+
# llm-d-only knobs (all optional):
176
+
scorers = { "prefix-cache-scorer" = 3.0, "active-request-scorer" = 2.0 } # base, applied to every profile
decode_scorer_overrides = {} # merged onto the P/D decode profile
179
+
non_cached_tokens = 16# below this many non-cached prompt tokens, skip remote prefill (P/D)
180
+
```
181
+
182
+
-**`vllm-router`** (default) — our fork of [vllm-router](https://github.com/PrimeIntellect-ai/router). Knob: `policy`.
183
+
-**`llm-d`** — the upstream [llm-d](https://llm-d.ai) Endpoint Picker (EPP) + Envoy proxy. Routing combines **prefix-cache affinity** (grouped rollouts reuse a cached prefix and skip prefill) with the **`active-request-scorer`** — an in-flight load balancer that spreads requests across ranks immediately, unlike the metrics-scraped `queue-scorer` / `kv-cache-utilization-scorer` / `load-aware-scorer` (which lag and concentrate bursts of same-prefix requests). The scorer weights follow the upstream llm-d P/D guide; tune via `scorers` (base) + `prefill_scorer_overrides` / `decode_scorer_overrides` (per-profile, P/D). Does not support `enable_return_routed_experts` (router replay).
184
+
185
+
Both backends support the 2 most important things:
173
186
- Request routing - KV cache re-use and balanced routing
174
187
- P/D disaggregation - handling the prefill and decode stages separately
# Known llm-d EPP scorer plugins (used to guard the ``scorers`` map against typos).
171
+
KNOWN_SCORERS=frozenset(
172
+
{
173
+
"prefix-cache-scorer",
174
+
"precise-prefix-cache-scorer",
175
+
"queue-scorer",
176
+
"kv-cache-utilization-scorer",
177
+
"active-request-scorer",
178
+
"load-aware-scorer",
179
+
"running-requests-size-scorer",
180
+
"token-load-scorer",
181
+
"latency-scorer",
182
+
"session-affinity-scorer",
183
+
"lora-affinity-scorer",
184
+
}
185
+
)
186
+
187
+
170
188
classVllmRouterConfig(BaseConfig):
171
-
"""PrimeIntellect vllm-router fronting the per-rank (external-LB) endpoints."""
189
+
"""PrimeIntellect vllm-router."""
172
190
173
191
type: Literal["vllm-router"] ="vllm-router"
174
192
@@ -179,9 +197,57 @@ class VllmRouterConfig(BaseConfig):
179
197
"""Routing policy, e.g. ``consistent_hash`` or ``round_robin``."""
180
198
181
199
182
-
# Discriminated on ``type`` so additional router backends can be added to the
183
-
# union (a single member needs no discriminator yet).
184
-
RouterConfig: TypeAlias=VllmRouterConfig
200
+
classLlmdRouterConfig(BaseConfig):
201
+
"""llm-d router backend (EPP + Envoy)."""
202
+
203
+
type: Literal["llm-d"] ="llm-d"
204
+
205
+
port: int=8000
206
+
"""Port the Envoy gateway listens on — becomes the client-facing router URL."""
207
+
208
+
scorers: dict[str, float] = {
209
+
"prefix-cache-scorer": 3.0,
210
+
"active-request-scorer": 2.0,
211
+
}
212
+
"""EPP scorer name → weight, applied to every routing profile (before the per-profile P/D overrides). Defaults to prefix-cache affinity plus in-flight (active-request) load balancing. Unknown scorer names are rejected."""
213
+
214
+
prefill_scorer_overrides: dict[str, float] = {
215
+
"queue-scorer": 2.0,
216
+
"kv-cache-utilization-scorer": 2.0,
217
+
}
218
+
"""P/D only: scorer → weight merged onto ``scorers`` for the prefill profile (a per-profile weight overrides the base)."""
219
+
220
+
decode_scorer_overrides: dict[str, float] = {}
221
+
"""P/D only: scorer → weight merged onto ``scorers`` for the decode profile (a per-profile weight overrides the base); empty by default."""
222
+
223
+
non_cached_tokens: int=16
224
+
"""P/D only: requests with fewer than this many non-cached prompt tokens skip remote prefill and run decode-only."""
225
+
226
+
decode_sidecar_port: int=8300
227
+
"""P/D only: port the decode-side llm-d sidecar listens on."""
228
+
229
+
@property
230
+
defprefill_scorers(self) ->dict[str, float]:
231
+
"""Effective prefill-profile scorers: ``scorers`` merged with ``prefill_scorer_overrides``."""
Verify: `uv run python -c 'import deep_ep; print(deep_ep.__file__)'`.
63
63
64
+
### llm-d router backend
65
+
66
+
Multi-node / disaggregated deployments can route through the upstream llm-d Endpoint Picker instead of `vllm-router` (set `[...deployment.router] type = "llm-d"`). It needs three native binaries — install once:
67
+
68
+
```bash
69
+
bash scripts/install_llmd.sh # builds epp + pd-sidecar from a pinned llm-d-router commit (vendored Go), fetches envoy
70
+
```
71
+
72
+
Binaries land in `third_party/llmd/bin/{epp,envoy,pd-sidecar}` (a shared path, so SLURM nodes see them). `epp` is pinned to the commit that includes the `vllmhttp-parser` (PR #1248) so prime-rl's renderer/TITO `/inference/v1/generate` path routes correctly. Override the pin with `LLMD_ROUTER_REF=<sha>`. The EPP + Envoy + endpoints configs are rendered from `templates/llmd/*.yaml.j2` (included into the SLURM script); only the per-node IPv4 addresses are filled in inline at launch time.
73
+
64
74
## Key files
65
75
66
76
-`pyproject.toml` — dependencies, extras, dependency groups
0 commit comments