Skip to content

Commit 44e47f9

Browse files
S1ro1claude
andcommitted
feat: llm-d (EPP+Envoy) router backend
Add llm-d as a second router backend alongside vllm-router, selected via `[inference.deployment.router] type = "llm-d"`. Built on the external-LB launch substrate — the per-rank vLLM engines launch identically; only the router control plane differs. - `LlmdRouterConfig`: discriminated-union member with `scorers` (base) + per-profile `prefill_scorer_overrides`/`decode_scorer_overrides`, `non_cached_tokens`, `decode_sidecar_port`, and a known-scorer validator. - `launch_router` helper gains an llm-d branch: renders per-replica EPP + Envoy + file-discovery endpoints and launches `epp` + `envoy` instead of vllm-router. Call sites stay router-agnostic. - pd-sidecar on each decode node (P/D) for remote-prefill orchestration + NIXL. - Entrypoints pass the `router` config object to the templates. - Reject `enable_return_routed_experts` with llm-d (breaks P/D, unverified for multi-node). - SLURM cleanup also clears stale `epp`/`envoy`/`pd-sidecar` processes. - Presets under `templates/llmd/` + `scripts/install_llmd.sh`; docs + install skill. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 0ec66ef commit 44e47f9

14 files changed

Lines changed: 526 additions & 31 deletions

File tree

docs/inference.md

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ We support 3 distinct deployment shapes:
3030

3131
Most of the features are supported for all deployment shapes, with few exceptions. These exceptions are rejected on validation.
3232

33-
You can select the deployment shape with `InferenceDeploymentConfig` in your config file. This is a config-field that allows you to set the deployment shape, deployment-specific knobs such as `num_nodes`, `num_replicas`, `router_port`, `backend_port`, etc.
33+
You can select the deployment shape with `InferenceDeploymentConfig` in your config file. This is a config-field that allows you to set the deployment shape, deployment-specific knobs such as `num_nodes`, `num_replicas`, `backend_port`, and a `[...deployment.router]` block, etc.
3434

3535
```toml
3636
[inference.deployment]
@@ -102,7 +102,7 @@ tp = 2
102102
dp = 4
103103
```
104104

105-
This configuration will run 2 independent vLLM replicas, each with `tp=2` and `dp=4`. Routing will be handled by the `vllm-router` instance running on the same node as the 1st replica. We aim to support more advanced routing options, such as `llm-d` or `dynamo` in the future. You can read more about the supported routing options in the [router](#router) section.
105+
This configuration will run 2 independent vLLM replicas, each with `tp=2` and `dp=4`. Routing is handled by a router instance running on the same node as the 1st replica — either `vllm-router` (default) or the upstream `llm-d` EPP+Envoy, selected via the `[...deployment.router]` block. You can read more about the supported routing options in the [router](#router) section.
106106

107107
### Wide-EP
108108

@@ -167,9 +167,22 @@ This will run 3 inference replicas, each running on 6 nodes. Each replica will r
167167

168168
## Router
169169

170-
We use our own fork of [vllm-router](https://github.com/PrimeIntellect-ai/router) as the request handler. We plan to support more advanced proxy options in the future.
170+
Multi-node and disaggregated deployments front their vLLM backends with a router, configured via a discriminated `[...deployment.router]` block (`type = "vllm-router" | "llm-d"`):
171171

172-
Right now, router handles 2 most important things:
172+
```toml
173+
[inference.deployment.router] # or [deployment.router] for the standalone inference entrypoint
174+
type = "llm-d" # "vllm-router" (default) or "llm-d"
175+
# llm-d-only knobs (all optional):
176+
scorers = { "prefix-cache-scorer" = 3.0, "active-request-scorer" = 2.0 } # base, applied to every profile
177+
prefill_scorer_overrides = { "queue-scorer" = 2.0, "kv-cache-utilization-scorer" = 2.0 } # merged onto the P/D prefill profile
178+
decode_scorer_overrides = {} # merged onto the P/D decode profile
179+
non_cached_tokens = 16 # below this many non-cached prompt tokens, skip remote prefill (P/D)
180+
```
181+
182+
- **`vllm-router`** (default) — our fork of [vllm-router](https://github.com/PrimeIntellect-ai/router). Knob: `policy`.
183+
- **`llm-d`** — the upstream [llm-d](https://llm-d.ai) Endpoint Picker (EPP) + Envoy proxy. Routing combines **prefix-cache affinity** (grouped rollouts reuse a cached prefix and skip prefill) with the **`active-request-scorer`** — an in-flight load balancer that spreads requests across ranks immediately, unlike the metrics-scraped `queue-scorer` / `kv-cache-utilization-scorer` / `load-aware-scorer` (which lag and concentrate bursts of same-prefix requests). The scorer weights follow the upstream llm-d P/D guide; tune via `scorers` (base) + `prefill_scorer_overrides` / `decode_scorer_overrides` (per-profile, P/D). Does not support `enable_return_routed_experts` (router replay).
184+
185+
Both backends support the 2 most important things:
173186
- Request routing - KV cache re-use and balanced routing
174187
- P/D disaggregation - handling the prefill and decode stages separately
175188

packages/prime-rl-configs/src/prime_rl/configs/inference.py

Lines changed: 82 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -167,8 +167,26 @@ def to_connector_dict(self) -> dict[str, Any]:
167167
]
168168

169169

170+
# Known llm-d EPP scorer plugins (used to guard the ``scorers`` map against typos).
171+
KNOWN_SCORERS = frozenset(
172+
{
173+
"prefix-cache-scorer",
174+
"precise-prefix-cache-scorer",
175+
"queue-scorer",
176+
"kv-cache-utilization-scorer",
177+
"active-request-scorer",
178+
"load-aware-scorer",
179+
"running-requests-size-scorer",
180+
"token-load-scorer",
181+
"latency-scorer",
182+
"session-affinity-scorer",
183+
"lora-affinity-scorer",
184+
}
185+
)
186+
187+
170188
class VllmRouterConfig(BaseConfig):
171-
"""PrimeIntellect vllm-router fronting the per-rank (external-LB) endpoints."""
189+
"""PrimeIntellect vllm-router."""
172190

173191
type: Literal["vllm-router"] = "vllm-router"
174192

@@ -179,9 +197,57 @@ class VllmRouterConfig(BaseConfig):
179197
"""Routing policy, e.g. ``consistent_hash`` or ``round_robin``."""
180198

181199

182-
# Discriminated on ``type`` so additional router backends can be added to the
183-
# union (a single member needs no discriminator yet).
184-
RouterConfig: TypeAlias = VllmRouterConfig
200+
class LlmdRouterConfig(BaseConfig):
201+
"""llm-d router backend (EPP + Envoy)."""
202+
203+
type: Literal["llm-d"] = "llm-d"
204+
205+
port: int = 8000
206+
"""Port the Envoy gateway listens on — becomes the client-facing router URL."""
207+
208+
scorers: dict[str, float] = {
209+
"prefix-cache-scorer": 3.0,
210+
"active-request-scorer": 2.0,
211+
}
212+
"""EPP scorer name → weight, applied to every routing profile (before the per-profile P/D overrides). Defaults to prefix-cache affinity plus in-flight (active-request) load balancing. Unknown scorer names are rejected."""
213+
214+
prefill_scorer_overrides: dict[str, float] = {
215+
"queue-scorer": 2.0,
216+
"kv-cache-utilization-scorer": 2.0,
217+
}
218+
"""P/D only: scorer → weight merged onto ``scorers`` for the prefill profile (a per-profile weight overrides the base)."""
219+
220+
decode_scorer_overrides: dict[str, float] = {}
221+
"""P/D only: scorer → weight merged onto ``scorers`` for the decode profile (a per-profile weight overrides the base); empty by default."""
222+
223+
non_cached_tokens: int = 16
224+
"""P/D only: requests with fewer than this many non-cached prompt tokens skip remote prefill and run decode-only."""
225+
226+
decode_sidecar_port: int = 8300
227+
"""P/D only: port the decode-side llm-d sidecar listens on."""
228+
229+
@property
230+
def prefill_scorers(self) -> dict[str, float]:
231+
"""Effective prefill-profile scorers: ``scorers`` merged with ``prefill_scorer_overrides``."""
232+
return {**self.scorers, **self.prefill_scorer_overrides}
233+
234+
@property
235+
def decode_scorers(self) -> dict[str, float]:
236+
"""Effective decode-profile scorers: ``scorers`` merged with ``decode_scorer_overrides``."""
237+
return {**self.scorers, **self.decode_scorer_overrides}
238+
239+
@model_validator(mode="after")
240+
def validate_scorers(self):
241+
unknown = (
242+
set(self.scorers) | set(self.prefill_scorer_overrides) | set(self.decode_scorer_overrides)
243+
) - KNOWN_SCORERS
244+
if unknown:
245+
raise ValueError(f"Unknown llm-d scorer(s): {sorted(unknown)}. Known scorers: {sorted(KNOWN_SCORERS)}.")
246+
return self
247+
248+
249+
# Discriminated on ``type`` so the launch path can pick the router backend.
250+
RouterConfig: TypeAlias = Annotated[VllmRouterConfig | LlmdRouterConfig, Field(discriminator="type")]
185251

186252

187253
class BaseInferenceDeploymentConfig(BaseConfig):
@@ -366,6 +432,18 @@ def validate_multi_node_requires_slurm(self):
366432
raise ValueError("Must use SLURM for multi-node / disaggregated deployment.")
367433
return self
368434

435+
@model_validator(mode="after")
436+
def validate_llmd_no_routed_experts(self):
437+
"""Reject routed-expert return with the llm-d router (breaks P/D, unverified for multi-node)."""
438+
router = getattr(self.deployment, "router", None)
439+
if router is not None and router.type == "llm-d" and self.enable_return_routed_experts:
440+
raise ValueError(
441+
"The llm-d router backend does not support routed-expert return "
442+
"(enable_return_routed_experts): it breaks P/D and is unverified for multi-node. "
443+
"Use router type 'vllm-router' for routed-expert runs."
444+
)
445+
return self
446+
369447
@model_validator(mode="after")
370448
def auto_setup_kv_cache_offload(self):
371449
if self.kv_cache_offload is not None:

packages/prime-rl-configs/src/prime_rl/configs/rl.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -435,6 +435,25 @@ def auto_setup_router_replay(self):
435435
)
436436
return self
437437

438+
@model_validator(mode="after")
439+
def validate_llmd_no_routed_experts(self):
440+
"""Reject routed-expert return with the llm-d router (breaks P/D, unverified for multi-node).
441+
442+
Runs after ``auto_setup_router_replay`` so it also catches the
443+
``trainer.enable_router_replay`` path, which sets the inference flag here
444+
(after InferenceConfig's own validators, which therefore miss it).
445+
"""
446+
if self.inference is not None and self.inference.enable_return_routed_experts:
447+
router = getattr(self.inference.deployment, "router", None)
448+
if router is not None and router.type == "llm-d":
449+
raise ValueError(
450+
"The llm-d router backend does not support routed-expert return "
451+
"(inference.enable_return_routed_experts / trainer.enable_router_replay): it "
452+
"breaks P/D and is unverified for multi-node. Use router type 'vllm-router' "
453+
"for router-replay runs."
454+
)
455+
return self
456+
438457
@model_validator(mode="after")
439458
def validate_router_replay_without_kv_offload(self):
440459
if (

scripts/install_llmd.sh

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
#!/usr/bin/env bash
2+
# Install the llm-d standalone (no-Kubernetes) routing binaries into
3+
# third_party/llmd/bin:
4+
# - epp : llm-d-router Endpoint Picker (the routing brain)
5+
# - pd-sidecar : decode-side proxy for P/D disaggregation
6+
# - envoy : the data-plane proxy that calls the EPP via ext_proc
7+
#
8+
# epp/pd-sidecar are built from a pinned llm-d-router commit. We currently build
9+
# from a small fork that adds P/D disaggregation for vLLM's token-in
10+
# /inference/v1/generate endpoint (prime-rl's renderer / TITO rollout path) —
11+
# upstream's pd-sidecar only disaggregates the OpenAI endpoints, so token-in P/D
12+
# silently runs decode-only. The fork is pending upstream PR
13+
# llm-d/llm-d-router#1458; switch LLMD_ROUTER_REPO back to the upstream repo and
14+
# bump LLMD_ROUTER_REF once it merges. The fork is branched off the upstream
15+
# commit that added the EPP vllmhttp parser (PR #1248), which the renderer path
16+
# also needs.
17+
#
18+
# System Go is not required: a Go toolchain is vendored under third_party/llmd/go
19+
# and used to bootstrap; GOTOOLCHAIN=auto fetches the exact version the module
20+
# pins. Envoy is pulled as a static binary from its release container image
21+
# (docker is the only widely-available extraction path on bare-metal nodes).
22+
set -euo pipefail
23+
24+
LLMD_ROUTER_REPO="${LLMD_ROUTER_REPO:-https://github.com/S1ro1/llm-d-router.git}"
25+
LLMD_ROUTER_REF="${LLMD_ROUTER_REF:-1ca4243ec84c657b4a5f507a1776d6c15a618d5b}"
26+
GO_BOOTSTRAP_VERSION="${GO_BOOTSTRAP_VERSION:-1.23.4}"
27+
ENVOY_VERSION="${ENVOY_VERSION:-1.36.0}"
28+
29+
SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
30+
PROJECT_DIR=$(cd "$SCRIPT_DIR/.." && pwd)
31+
LLMD_DIR="$PROJECT_DIR/third_party/llmd"
32+
BIN_DIR="$LLMD_DIR/bin"
33+
GO_ROOT="$LLMD_DIR/go"
34+
SRC_DIR="$LLMD_DIR/src"
35+
mkdir -p "$BIN_DIR"
36+
37+
# --- Go toolchain (vendored bootstrap; auto-upgrades to the module's version) ---
38+
if [ ! -x "$GO_ROOT/bin/go" ]; then
39+
echo "[install_llmd] downloading bootstrap Go $GO_BOOTSTRAP_VERSION"
40+
rm -rf "$GO_ROOT"
41+
curl -fsSL "https://go.dev/dl/go${GO_BOOTSTRAP_VERSION}.linux-amd64.tar.gz" | tar -xz -C "$LLMD_DIR"
42+
fi
43+
export GOROOT="$GO_ROOT"
44+
export PATH="$GO_ROOT/bin:$PATH"
45+
export GOTOOLCHAIN=auto
46+
echo "[install_llmd] bootstrap $(go version)"
47+
48+
# --- Fetch llm-d-router source at the pinned ref ---
49+
echo "[install_llmd] fetching ${LLMD_ROUTER_REPO}@${LLMD_ROUTER_REF}"
50+
if [ ! -d "$SRC_DIR/.git" ]; then
51+
rm -rf "$SRC_DIR"
52+
git clone --quiet "$LLMD_ROUTER_REPO" "$SRC_DIR"
53+
fi
54+
git -C "$SRC_DIR" remote set-url origin "$LLMD_ROUTER_REPO"
55+
git -C "$SRC_DIR" fetch --quiet origin
56+
git -C "$SRC_DIR" checkout --quiet "$LLMD_ROUTER_REF"
57+
58+
# --- Build epp + pd-sidecar ---
59+
echo "[install_llmd] building epp + pd-sidecar"
60+
( cd "$SRC_DIR" && go build -o "$BIN_DIR/epp" ./cmd/epp && go build -o "$BIN_DIR/pd-sidecar" ./cmd/pd-sidecar )
61+
62+
# --- Envoy static binary (extract from the release image; keep if present) ---
63+
if [ ! -x "$BIN_DIR/envoy" ]; then
64+
echo "[install_llmd] extracting Envoy ${ENVOY_VERSION} from envoyproxy/envoy image"
65+
cid=$(docker create "envoyproxy/envoy:v${ENVOY_VERSION}")
66+
docker cp "${cid}:/usr/local/bin/envoy" "$BIN_DIR/envoy"
67+
docker rm "$cid" >/dev/null
68+
chmod +x "$BIN_DIR/envoy"
69+
fi
70+
71+
echo "[install_llmd] installed binaries in $BIN_DIR:"
72+
"$BIN_DIR/epp" --version 2>&1 | head -1 || true
73+
"$BIN_DIR/envoy" --version 2>&1 | head -1 || true
74+
"$BIN_DIR/pd-sidecar" --help >/dev/null 2>&1 && echo " pd-sidecar OK" || true
75+
echo "[install_llmd] done"

skills/install/SKILL.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,16 @@ Flags: `--workspace DIR`, `--deepep-ref REF` (default `73b6ea4`), `--nvshmem-ver
6161

6262
Verify: `uv run python -c 'import deep_ep; print(deep_ep.__file__)'`.
6363

64+
### llm-d router backend
65+
66+
Multi-node / disaggregated deployments can route through the upstream llm-d Endpoint Picker instead of `vllm-router` (set `[...deployment.router] type = "llm-d"`). It needs three native binaries — install once:
67+
68+
```bash
69+
bash scripts/install_llmd.sh # builds epp + pd-sidecar from a pinned llm-d-router commit (vendored Go), fetches envoy
70+
```
71+
72+
Binaries land in `third_party/llmd/bin/{epp,envoy,pd-sidecar}` (a shared path, so SLURM nodes see them). `epp` is pinned to the commit that includes the `vllmhttp-parser` (PR #1248) so prime-rl's renderer/TITO `/inference/v1/generate` path routes correctly. Override the pin with `LLMD_ROUTER_REF=<sha>`. The EPP + Envoy + endpoints configs are rendered from `templates/llmd/*.yaml.j2` (included into the SLURM script); only the per-node IPv4 addresses are filled in inline at launch time.
73+
6474
## Key files
6575

6676
- `pyproject.toml` — dependencies, extras, dependency groups

src/prime_rl/entrypoints/inference.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ def write_slurm_script(config: InferenceConfig, config_path: Path, script_path:
4747
dp_per_node=dp_per_node,
4848
num_nodes=getattr(config.deployment, "num_nodes", 1),
4949
port=config.server.port,
50-
disaggregated=is_disaggregated,
50+
is_disaggregated=is_disaggregated,
5151
kv_offload=offload is not None,
5252
kv_offload_mooncake=is_mooncake,
5353
kv_offload_cpu_bytes=int(offload.cpu.num_bytes) if is_mooncake else 0,
@@ -65,20 +65,19 @@ def write_slurm_script(config: InferenceConfig, config_path: Path, script_path:
6565
num_decode_replicas=config.deployment.num_decode_replicas,
6666
prefill_port=config.deployment.prefill_port,
6767
decode_port=config.deployment.decode_port,
68-
router_port=config.deployment.router.port,
69-
router_policy=config.deployment.router.policy,
68+
router=config.deployment.router,
7069
data_parallel_rpc_port=config.data_parallel_rpc_port,
7170
use_deep_gemm=config.use_deep_gemm,
7271
prefill_env_overrides=config.deployment.prefill_env_overrides,
7372
decode_env_overrides=config.deployment.decode_env_overrides,
7473
)
7574
elif is_multi_node:
7675
template_vars.update(
77-
router_port=config.deployment.router.port,
76+
router=config.deployment.router,
7877
backend_port=config.deployment.backend_port,
79-
router_policy=config.deployment.router.policy,
8078
data_parallel_rpc_port=config.data_parallel_rpc_port,
8179
enable_expert_parallel=config.enable_expert_parallel,
80+
infer_nodes_per_replica=config.deployment.num_nodes,
8281
)
8382

8483
script = template.render(**template_vars)

src/prime_rl/entrypoints/rl.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
import pynvml
1313
import tomli_w
1414

15+
from prime_rl.configs.inference import VllmRouterConfig
1516
from prime_rl.configs.rl import RLConfig
1617
from prime_rl.utils.config import cli
1718
from prime_rl.utils.logger import get_logger, setup_logger
@@ -371,7 +372,7 @@ def write_slurm_script(config: RLConfig, config_dir: Path, script_path: Path) ->
371372
num_prefill_replicas=infer_deploy.num_prefill_replicas,
372373
num_decode_replicas=infer_deploy.num_decode_replicas,
373374
gpus_per_node=config.deployment.gpus_per_node,
374-
router_port=infer_deploy.router.port,
375+
router=infer_deploy.router,
375376
prefill_port=infer_deploy.prefill_port,
376377
decode_port=infer_deploy.decode_port,
377378
inference_tp=config.inference.parallel.tp,
@@ -396,7 +397,8 @@ def write_slurm_script(config: RLConfig, config_dir: Path, script_path: Path) ->
396397
nodes_per_infer_replica=config.deployment.num_infer_nodes,
397398
num_infer_replicas=config.deployment.num_infer_replicas,
398399
gpus_per_node=config.deployment.gpus_per_node,
399-
router_port=config.inference.deployment.router.port if config.inference else 8000,
400+
router=config.inference.deployment.router if config.inference else VllmRouterConfig(),
401+
infer_nodes_per_replica=config.deployment.num_infer_nodes,
400402
backend_port=config.inference.deployment.backend_port if config.inference else 8100,
401403
inference_tp=config.inference.parallel.tp if config.inference else 1,
402404
inference_enable_expert_parallel=config.inference.enable_expert_parallel if config.inference else False,

0 commit comments

Comments
 (0)