PrimeIntellect-ai · S1ro1 · Jun 3, 2026 · Jun 3, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,7 @@
 
 Documenting **breaking** configuration changes — renamed, removed, or moved fields that require users to update existing configs.
 
+- **`client.dp_rank_count` removed**: With external-LB data parallelism each DP rank is its own endpoint (one `client.base_url` per rank), so the client no longer pins a rollout to an internal DP shard via the `X-data-parallel-rank` header. The `[orchestrator.student.client] dp_rank_count` field (and its per-rank client expansion) is gone — the router load-balances across the per-rank endpoints. Existing configs setting `dp_rank_count` must drop it (`extra="forbid"` rejects it); there is nothing to migrate. (2026-06-03)
 - **`inference.kv_cache_offload.cpu_bytes` removed → discriminated `type` config**: The flat `[inference.kv_cache_offload]` block with a single `cpu_bytes` field is replaced by a backend-discriminated union with composable `cpu`/`disk` tiers. Migrate native CPU offload from `[inference.kv_cache_offload]\ncpu_bytes = N` to `[inference.kv_cache_offload]\ntype = "native"` plus `[inference.kv_cache_offload.cpu]\nnum_bytes = N`. A `type = "mooncake"` backend (per-node distributed store; multi-node/SLURM only) and an optional `[inference.kv_cache_offload.disk]\npath = "..."` tier (layered behind cpu) are also available. `extra="forbid"` rejects the old `cpu_bytes` key, so existing configs must migrate. (2026-06-02)
 - **Orchestrator async-pipeline rewrite** (collection of removals/renames). The orchestrator was rewritten to overlap train/eval rollouts on a shared concurrency limiter; several config fields were removed or renamed.
   - **`orchestrator.seed` removed**: was only consumed by the deleted buffer; no replacement.

diff --git a/docs/inference.md b/docs/inference.md
@@ -30,7 +30,7 @@ We support 3 distinct deployment shapes:
 
 Most of the features are supported for all deployment shapes, with few exceptions. These exceptions are rejected on validation.
 
-You can select the deployment shape with `InferenceDeploymentConfig` in your config file. This is a config-field that allows you to set the deployment shape, deployment-specific knobs such as `num_nodes`, `num_replicas`, `router_port`, `backend_port`, etc.
+You can select the deployment shape with `InferenceDeploymentConfig` in your config file. This is a config-field that allows you to set the deployment shape, deployment-specific knobs such as `num_nodes`, `num_replicas`, `backend_port`, and a `[...deployment.router]` block, etc.
 
 ```toml
 [inference.deployment]
@@ -102,7 +102,7 @@ tp = 2
 dp = 4
 ```
 
-This configuration will run 2 independent vLLM replicas, each with `tp=2` and `dp=4`. Routing will be handled by the `vllm-router` instance running on the same node as the 1st replica. We aim to support more advanced routing options, such as `llm-d` or `dynamo` in the future. You can read more about the supported routing options in the [router](#router) section.
+This configuration will run 2 independent vLLM replicas, each with `tp=2` and `dp=4`. Routing is handled by a router instance running on the same node as the 1st replica — either `vllm-router` (default) or the upstream `llm-d` EPP+Envoy, selected via the `[...deployment.router]` block. You can read more about the supported routing options in the [router](#router) section.
 
 ### Wide-EP
 
@@ -167,9 +167,22 @@ This will run 3 inference replicas, each running on 6 nodes. Each replica will r
 
 ## Router
 
-We use our own fork of [vllm-router](https://github.com/PrimeIntellect-ai/router) as the request handler. We plan to support more advanced proxy options in the future.
+Multi-node and disaggregated deployments front their vLLM backends with a router, configured via a discriminated `[...deployment.router]` block (`type = "vllm-router" | "llm-d"`):
 
-Right now, router handles 2 most important things:
+```toml
+[inference.deployment.router]   # or [deployment.router] for the standalone inference entrypoint
+type = "llm-d"                  # "vllm-router" (default) or "llm-d"
+# llm-d-only knobs (all optional):
+scorers = { "prefix-cache-scorer" = 3.0, "active-request-scorer" = 2.0 }   # base, applied to every profile
+prefill_scorer_overrides = { "queue-scorer" = 2.0, "kv-cache-utilization-scorer" = 2.0 }  # merged onto the P/D prefill profile
+decode_scorer_overrides = {}    # merged onto the P/D decode profile
+non_cached_tokens = 16          # below this many non-cached prompt tokens, skip remote prefill (P/D)
+```
+
+- **`vllm-router`** (default) — our fork of [vllm-router](https://github.com/PrimeIntellect-ai/router). Knob: `policy`.
+- **`llm-d`** — the upstream [llm-d](https://llm-d.ai) Endpoint Picker (EPP) + Envoy proxy. Routing combines **prefix-cache affinity** (grouped rollouts reuse a cached prefix and skip prefill) with the **`active-request-scorer`** — an in-flight load balancer that spreads requests across ranks immediately, unlike the metrics-scraped `queue-scorer` / `kv-cache-utilization-scorer` / `load-aware-scorer` (which lag and concentrate bursts of same-prefix requests). The scorer weights follow the upstream llm-d P/D guide; tune via `scorers` (base) + `prefill_scorer_overrides` / `decode_scorer_overrides` (per-profile, P/D). Does not support `enable_return_routed_experts` (router replay).
+
+Both backends support the 2 most important things:
 - Request routing - KV cache re-use and balanced routing
 - P/D disaggregation - handling the prefill and decode stages separately
 

diff --git a/packages/prime-rl-configs/src/prime_rl/configs/inference.py b/packages/prime-rl-configs/src/prime_rl/configs/inference.py
@@ -167,8 +167,26 @@ def to_connector_dict(self) -> dict[str, Any]:
 ]
 
 
+# Known llm-d EPP scorer plugins (used to guard the ``scorers`` map against typos).
+KNOWN_SCORERS = frozenset(
+    {
+        "prefix-cache-scorer",
+        "precise-prefix-cache-scorer",
+        "queue-scorer",
+        "kv-cache-utilization-scorer",
+        "active-request-scorer",
+        "load-aware-scorer",
+        "running-requests-size-scorer",
+        "token-load-scorer",
+        "latency-scorer",
+        "session-affinity-scorer",
+        "lora-affinity-scorer",
+    }
+)
+
+
 class VllmRouterConfig(BaseConfig):
-    """PrimeIntellect vllm-router fronting the per-rank (external-LB) endpoints."""
+    """PrimeIntellect vllm-router."""
 
     type: Literal["vllm-router"] = "vllm-router"
 
@@ -179,9 +197,57 @@ class VllmRouterConfig(BaseConfig):
     """Routing policy, e.g. ``consistent_hash`` or ``round_robin``."""
 
 
-# Discriminated on ``type`` so additional router backends can be added to the
-# union (a single member needs no discriminator yet).
-RouterConfig: TypeAlias = VllmRouterConfig
+class LlmdRouterConfig(BaseConfig):
+    """llm-d router backend (EPP + Envoy)."""
+
+    type: Literal["llm-d"] = "llm-d"
+
+    port: int = 8000
+    """Port the Envoy gateway listens on — becomes the client-facing router URL."""
+
+    scorers: dict[str, float] = {
+        "prefix-cache-scorer": 3.0,
+        "active-request-scorer": 2.0,
+    }
+    """EPP scorer name → weight, applied to every routing profile (before the per-profile P/D overrides). Defaults to prefix-cache affinity plus in-flight (active-request) load balancing. Unknown scorer names are rejected."""
+
+    prefill_scorer_overrides: dict[str, float] = {
+        "queue-scorer": 2.0,
+        "kv-cache-utilization-scorer": 2.0,
+    }
+    """P/D only: scorer → weight merged onto ``scorers`` for the prefill profile (a per-profile weight overrides the base)."""
+
+    decode_scorer_overrides: dict[str, float] = {}
+    """P/D only: scorer → weight merged onto ``scorers`` for the decode profile (a per-profile weight overrides the base); empty by default."""
+
+    non_cached_tokens: int = 16
+    """P/D only: requests with fewer than this many non-cached prompt tokens skip remote prefill and run decode-only."""
+
+    decode_sidecar_port: int = 8300
+    """P/D only: port the decode-side llm-d sidecar listens on."""
+
+    @property
+    def prefill_scorers(self) -> dict[str, float]:
+        """Effective prefill-profile scorers: ``scorers`` merged with ``prefill_scorer_overrides``."""
+        return {**self.scorers, **self.prefill_scorer_overrides}
+
+    @property
+    def decode_scorers(self) -> dict[str, float]:
+        """Effective decode-profile scorers: ``scorers`` merged with ``decode_scorer_overrides``."""
+        return {**self.scorers, **self.decode_scorer_overrides}
+
+    @model_validator(mode="after")
+    def validate_scorers(self):
+        unknown = (
+            set(self.scorers) | set(self.prefill_scorer_overrides) | set(self.decode_scorer_overrides)
+        ) - KNOWN_SCORERS
+        if unknown:
+            raise ValueError(f"Unknown llm-d scorer(s): {sorted(unknown)}. Known scorers: {sorted(KNOWN_SCORERS)}.")
+        return self
+
+
+# Discriminated on ``type`` so the launch path can pick the router backend.
+RouterConfig: TypeAlias = Annotated[VllmRouterConfig | LlmdRouterConfig, Field(discriminator="type")]
 
 
 class BaseInferenceDeploymentConfig(BaseConfig):
@@ -366,6 +432,18 @@ def validate_multi_node_requires_slurm(self):
             raise ValueError("Must use SLURM for multi-node / disaggregated deployment.")
         return self
 
+    @model_validator(mode="after")
+    def validate_llmd_no_routed_experts(self):
+        """Reject routed-expert return with the llm-d router (breaks P/D, unverified for multi-node)."""
+        router = getattr(self.deployment, "router", None)
+        if router is not None and router.type == "llm-d" and self.enable_return_routed_experts:
+            raise ValueError(
+                "The llm-d router backend does not support routed-expert return "
+                "(enable_return_routed_experts): it breaks P/D and is unverified for multi-node. "
+                "Use router type 'vllm-router' for routed-expert runs."
+            )
+        return self
+
     @model_validator(mode="after")
     def auto_setup_kv_cache_offload(self):
         if self.kv_cache_offload is not None:

diff --git a/packages/prime-rl-configs/src/prime_rl/configs/rl.py b/packages/prime-rl-configs/src/prime_rl/configs/rl.py
@@ -435,6 +435,25 @@ def auto_setup_router_replay(self):
                 )
         return self
 
+    @model_validator(mode="after")
+    def validate_llmd_no_routed_experts(self):
+        """Reject routed-expert return with the llm-d router (breaks P/D, unverified for multi-node).
+
+        Runs after ``auto_setup_router_replay`` so it also catches the
+        ``trainer.enable_router_replay`` path, which sets the inference flag here
+        (after InferenceConfig's own validators, which therefore miss it).
+        """
+        if self.inference is not None and self.inference.enable_return_routed_experts:
+            router = getattr(self.inference.deployment, "router", None)
+            if router is not None and router.type == "llm-d":
+                raise ValueError(
+                    "The llm-d router backend does not support routed-expert return "
+                    "(inference.enable_return_routed_experts / trainer.enable_router_replay): it "
+                    "breaks P/D and is unverified for multi-node. Use router type 'vllm-router' "
+                    "for router-replay runs."
+                )
+        return self
+
     @model_validator(mode="after")
     def validate_router_replay_without_kv_offload(self):
         if (
@@ -600,16 +619,13 @@ def auto_setup_disaggregated_inference(self):
     def auto_setup_inference_client(self):
         """Auto-configure orchestrator student client from the inference server config.
 
-        For all modes, sets dp_rank_count from inference DP size. For SFT mode,
-        also sets base_url - rl/opd rely on the ClientConfig default
+        For SFT mode, sets base_url - rl/opd rely on the ClientConfig default
         (``["http://localhost:8000/v1"]``) which already matches the auto-launched
         student vLLM at inference.server.port = 8000.
         """
         if self.inference is None:
             return self
         client = self.orchestrator.student.client
-        if "dp_rank_count" not in client.model_fields_set:
-            client.dp_rank_count = self.inference.data_parallel_size_local or self.inference.parallel.dp
         if self.orchestrator.training_mode == "sft" and "base_url" not in client.model_fields_set:
             host = self.inference.server.host or "localhost"
             port = self.inference.server.port

diff --git a/packages/prime-rl-configs/src/prime_rl/configs/shared.py b/packages/prime-rl-configs/src/prime_rl/configs/shared.py
@@ -127,9 +127,6 @@ class ClientConfig(BaseConfig):
     skip_model_check: bool = False
     """Skip checking that the model is available in the inference pool. Useful for external APIs or keys that do not expose ``/models``."""
 
-    dp_rank_count: int = Field(1, ge=1)
-    """Number of data-parallel ranks behind each base URL. When > 1, each URL is expanded into ``dp_rank_count`` logical clients pinned via the ``X-data-parallel-rank`` header, so every request within a rollout hits the same DP engine and reuses KV cache. Auto-set from the inference config when using the RL entrypoint."""
-
     admin_base_url: list[str] | None = None
     """Separate base URLs for admin operations (weight updates, health checks). When set, admin clients bypass routers and hit each server directly — used in disaggregated P/D deployments where the router must not handle admin traffic."""
 

diff --git a/scripts/install_llmd.sh b/scripts/install_llmd.sh
@@ -0,0 +1,75 @@
+#!/usr/bin/env bash
+# Install the llm-d standalone (no-Kubernetes) routing binaries into
+# third_party/llmd/bin:
+#   - epp         : llm-d-router Endpoint Picker (the routing brain)
+#   - pd-sidecar  : decode-side proxy for P/D disaggregation
+#   - envoy       : the data-plane proxy that calls the EPP via ext_proc
+#
+# epp/pd-sidecar are built from a pinned llm-d-router commit. We currently build
+# from a small fork that adds P/D disaggregation for vLLM's token-in
+# /inference/v1/generate endpoint (prime-rl's renderer / TITO rollout path) —
+# upstream's pd-sidecar only disaggregates the OpenAI endpoints, so token-in P/D
+# silently runs decode-only. The fork is pending upstream PR
+# llm-d/llm-d-router#1458; switch LLMD_ROUTER_REPO back to the upstream repo and
+# bump LLMD_ROUTER_REF once it merges. The fork is branched off the upstream
+# commit that added the EPP vllmhttp parser (PR #1248), which the renderer path
+# also needs.
+#
+# System Go is not required: a Go toolchain is vendored under third_party/llmd/go
+# and used to bootstrap; GOTOOLCHAIN=auto fetches the exact version the module
+# pins. Envoy is pulled as a static binary from its release container image
+# (docker is the only widely-available extraction path on bare-metal nodes).
+set -euo pipefail
+
+LLMD_ROUTER_REPO="${LLMD_ROUTER_REPO:-https://github.com/S1ro1/llm-d-router.git}"
+LLMD_ROUTER_REF="${LLMD_ROUTER_REF:-1ca4243ec84c657b4a5f507a1776d6c15a618d5b}"
+GO_BOOTSTRAP_VERSION="${GO_BOOTSTRAP_VERSION:-1.23.4}"
+ENVOY_VERSION="${ENVOY_VERSION:-1.36.0}"
+
+SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+PROJECT_DIR=$(cd "$SCRIPT_DIR/.." && pwd)
+LLMD_DIR="$PROJECT_DIR/third_party/llmd"
+BIN_DIR="$LLMD_DIR/bin"
+GO_ROOT="$LLMD_DIR/go"
+SRC_DIR="$LLMD_DIR/src"
+mkdir -p "$BIN_DIR"
+
+# --- Go toolchain (vendored bootstrap; auto-upgrades to the module's version) ---
+if [ ! -x "$GO_ROOT/bin/go" ]; then
+    echo "[install_llmd] downloading bootstrap Go $GO_BOOTSTRAP_VERSION"
+    rm -rf "$GO_ROOT"
+    curl -fsSL "https://go.dev/dl/go${GO_BOOTSTRAP_VERSION}.linux-amd64.tar.gz" | tar -xz -C "$LLMD_DIR"
+fi
+export GOROOT="$GO_ROOT"
+export PATH="$GO_ROOT/bin:$PATH"
+export GOTOOLCHAIN=auto
+echo "[install_llmd] bootstrap $(go version)"
+
+# --- Fetch llm-d-router source at the pinned ref ---
+echo "[install_llmd] fetching ${LLMD_ROUTER_REPO}@${LLMD_ROUTER_REF}"
+if [ ! -d "$SRC_DIR/.git" ]; then
+    rm -rf "$SRC_DIR"
+    git clone --quiet "$LLMD_ROUTER_REPO" "$SRC_DIR"
+fi
+git -C "$SRC_DIR" remote set-url origin "$LLMD_ROUTER_REPO"
+git -C "$SRC_DIR" fetch --quiet origin
+git -C "$SRC_DIR" checkout --quiet "$LLMD_ROUTER_REF"
+
+# --- Build epp + pd-sidecar ---
+echo "[install_llmd] building epp + pd-sidecar"
+( cd "$SRC_DIR" && go build -o "$BIN_DIR/epp" ./cmd/epp && go build -o "$BIN_DIR/pd-sidecar" ./cmd/pd-sidecar )
+
+# --- Envoy static binary (extract from the release image; keep if present) ---
+if [ ! -x "$BIN_DIR/envoy" ]; then
+    echo "[install_llmd] extracting Envoy ${ENVOY_VERSION} from envoyproxy/envoy image"
+    cid=$(docker create "envoyproxy/envoy:v${ENVOY_VERSION}")
+    docker cp "${cid}:/usr/local/bin/envoy" "$BIN_DIR/envoy"
+    docker rm "$cid" >/dev/null
+    chmod +x "$BIN_DIR/envoy"
+fi
+
+echo "[install_llmd] installed binaries in $BIN_DIR:"
+"$BIN_DIR/epp" --version 2>&1 | head -1 || true
+"$BIN_DIR/envoy" --version 2>&1 | head -1 || true
+"$BIN_DIR/pd-sidecar" --help >/dev/null 2>&1 && echo "  pd-sidecar OK" || true
+echo "[install_llmd] done"
diff --git a/skills/install/SKILL.md b/skills/install/SKILL.md
@@ -61,6 +61,16 @@ Flags: `--workspace DIR`, `--deepep-ref REF` (default `73b6ea4`), `--nvshmem-ver
 
 Verify: `uv run python -c 'import deep_ep; print(deep_ep.__file__)'`.
 
+### llm-d router backend
+
+Multi-node / disaggregated deployments can route through the upstream llm-d Endpoint Picker instead of `vllm-router` (set `[...deployment.router] type = "llm-d"`). It needs three native binaries — install once:
+
+```bash
+bash scripts/install_llmd.sh   # builds epp + pd-sidecar from a pinned llm-d-router commit (vendored Go), fetches envoy
+```
+
+Binaries land in `third_party/llmd/bin/{epp,envoy,pd-sidecar}` (a shared path, so SLURM nodes see them). `epp` is pinned to the commit that includes the `vllmhttp-parser` (PR #1248) so prime-rl's renderer/TITO `/inference/v1/generate` path routes correctly. Override the pin with `LLMD_ROUTER_REF=<sha>`. The EPP + Envoy + endpoints configs are rendered from `templates/llmd/*.yaml.j2` (included into the SLURM script); only the per-node IPv4 addresses are filled in inline at launch time.
+
 ## Key files
 
 - `pyproject.toml` — dependencies, extras, dependency groups

diff --git a/src/prime_rl/entrypoints/inference.py b/src/prime_rl/entrypoints/inference.py
@@ -47,7 +47,7 @@ def write_slurm_script(config: InferenceConfig, config_path: Path, script_path:
         dp_per_node=dp_per_node,
         num_nodes=getattr(config.deployment, "num_nodes", 1),
         port=config.server.port,
-        disaggregated=is_disaggregated,
+        is_disaggregated=is_disaggregated,
         kv_offload=offload is not None,
         kv_offload_mooncake=is_mooncake,
         kv_offload_cpu_bytes=int(offload.cpu.num_bytes) if is_mooncake else 0,
@@ -65,20 +65,19 @@ def write_slurm_script(config: InferenceConfig, config_path: Path, script_path:
             num_decode_replicas=config.deployment.num_decode_replicas,
             prefill_port=config.deployment.prefill_port,
             decode_port=config.deployment.decode_port,
-            router_port=config.deployment.router.port,
-            router_policy=config.deployment.router.policy,
+            router=config.deployment.router,
             data_parallel_rpc_port=config.data_parallel_rpc_port,
             use_deep_gemm=config.use_deep_gemm,
             prefill_env_overrides=config.deployment.prefill_env_overrides,
             decode_env_overrides=config.deployment.decode_env_overrides,
         )
     elif is_multi_node:
         template_vars.update(
-            router_port=config.deployment.router.port,
+            router=config.deployment.router,
             backend_port=config.deployment.backend_port,
-            router_policy=config.deployment.router.policy,
             data_parallel_rpc_port=config.data_parallel_rpc_port,
             enable_expert_parallel=config.enable_expert_parallel,
+            infer_nodes_per_replica=config.deployment.num_nodes,
         )
 
     script = template.render(**template_vars)