Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

Documenting **breaking** configuration changes — renamed, removed, or moved fields that require users to update existing configs.

- **`client.dp_rank_count` removed**: With external-LB data parallelism each DP rank is its own endpoint (one `client.base_url` per rank), so the client no longer pins a rollout to an internal DP shard via the `X-data-parallel-rank` header. The `[orchestrator.student.client] dp_rank_count` field (and its per-rank client expansion) is gone — the router load-balances across the per-rank endpoints. Existing configs setting `dp_rank_count` must drop it (`extra="forbid"` rejects it); there is nothing to migrate. (2026-06-03)
- **`inference.kv_cache_offload.cpu_bytes` removed → discriminated `type` config**: The flat `[inference.kv_cache_offload]` block with a single `cpu_bytes` field is replaced by a backend-discriminated union with composable `cpu`/`disk` tiers. Migrate native CPU offload from `[inference.kv_cache_offload]\ncpu_bytes = N` to `[inference.kv_cache_offload]\ntype = "native"` plus `[inference.kv_cache_offload.cpu]\nnum_bytes = N`. A `type = "mooncake"` backend (per-node distributed store; multi-node/SLURM only) and an optional `[inference.kv_cache_offload.disk]\npath = "..."` tier (layered behind cpu) are also available. `extra="forbid"` rejects the old `cpu_bytes` key, so existing configs must migrate. (2026-06-02)
- **Orchestrator async-pipeline rewrite** (collection of removals/renames). The orchestrator was rewritten to overlap train/eval rollouts on a shared concurrency limiter; several config fields were removed or renamed.
- **`orchestrator.seed` removed**: was only consumed by the deleted buffer; no replacement.
Expand Down
21 changes: 17 additions & 4 deletions docs/inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ We support 3 distinct deployment shapes:

Most of the features are supported for all deployment shapes, with few exceptions. These exceptions are rejected on validation.

You can select the deployment shape with `InferenceDeploymentConfig` in your config file. This is a config-field that allows you to set the deployment shape, deployment-specific knobs such as `num_nodes`, `num_replicas`, `router_port`, `backend_port`, etc.
You can select the deployment shape with `InferenceDeploymentConfig` in your config file. This is a config-field that allows you to set the deployment shape, deployment-specific knobs such as `num_nodes`, `num_replicas`, `backend_port`, and a `[...deployment.router]` block, etc.

```toml
[inference.deployment]
Expand Down Expand Up @@ -102,7 +102,7 @@ tp = 2
dp = 4
```

This configuration will run 2 independent vLLM replicas, each with `tp=2` and `dp=4`. Routing will be handled by the `vllm-router` instance running on the same node as the 1st replica. We aim to support more advanced routing options, such as `llm-d` or `dynamo` in the future. You can read more about the supported routing options in the [router](#router) section.
This configuration will run 2 independent vLLM replicas, each with `tp=2` and `dp=4`. Routing is handled by a router instance running on the same node as the 1st replica — either `vllm-router` (default) or the upstream `llm-d` EPP+Envoy, selected via the `[...deployment.router]` block. You can read more about the supported routing options in the [router](#router) section.

### Wide-EP

Expand Down Expand Up @@ -167,9 +167,22 @@ This will run 3 inference replicas, each running on 6 nodes. Each replica will r

## Router

We use our own fork of [vllm-router](https://github.com/PrimeIntellect-ai/router) as the request handler. We plan to support more advanced proxy options in the future.
Multi-node and disaggregated deployments front their vLLM backends with a router, configured via a discriminated `[...deployment.router]` block (`type = "vllm-router" | "llm-d"`):

Right now, router handles 2 most important things:
```toml
[inference.deployment.router] # or [deployment.router] for the standalone inference entrypoint
type = "llm-d" # "vllm-router" (default) or "llm-d"
# llm-d-only knobs (all optional):
scorers = { "prefix-cache-scorer" = 3.0, "active-request-scorer" = 2.0 } # base, applied to every profile
prefill_scorer_overrides = { "queue-scorer" = 2.0, "kv-cache-utilization-scorer" = 2.0 } # merged onto the P/D prefill profile
decode_scorer_overrides = {} # merged onto the P/D decode profile
non_cached_tokens = 16 # below this many non-cached prompt tokens, skip remote prefill (P/D)
```

- **`vllm-router`** (default) — our fork of [vllm-router](https://github.com/PrimeIntellect-ai/router). Knob: `policy`.
- **`llm-d`** — the upstream [llm-d](https://llm-d.ai) Endpoint Picker (EPP) + Envoy proxy. Routing combines **prefix-cache affinity** (grouped rollouts reuse a cached prefix and skip prefill) with the **`active-request-scorer`** — an in-flight load balancer that spreads requests across ranks immediately, unlike the metrics-scraped `queue-scorer` / `kv-cache-utilization-scorer` / `load-aware-scorer` (which lag and concentrate bursts of same-prefix requests). The scorer weights follow the upstream llm-d P/D guide; tune via `scorers` (base) + `prefill_scorer_overrides` / `decode_scorer_overrides` (per-profile, P/D). Does not support `enable_return_routed_experts` (router replay).

Both backends support the 2 most important things:
- Request routing - KV cache re-use and balanced routing
- P/D disaggregation - handling the prefill and decode stages separately

Expand Down
86 changes: 82 additions & 4 deletions packages/prime-rl-configs/src/prime_rl/configs/inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,8 +167,26 @@ def to_connector_dict(self) -> dict[str, Any]:
]


# Known llm-d EPP scorer plugins (used to guard the ``scorers`` map against typos).
KNOWN_SCORERS = frozenset(
{
"prefix-cache-scorer",
"precise-prefix-cache-scorer",
"queue-scorer",
"kv-cache-utilization-scorer",
"active-request-scorer",
"load-aware-scorer",
"running-requests-size-scorer",
"token-load-scorer",
"latency-scorer",
"session-affinity-scorer",
"lora-affinity-scorer",
}
)


class VllmRouterConfig(BaseConfig):
"""PrimeIntellect vllm-router fronting the per-rank (external-LB) endpoints."""
"""PrimeIntellect vllm-router."""

type: Literal["vllm-router"] = "vllm-router"

Expand All @@ -179,9 +197,57 @@ class VllmRouterConfig(BaseConfig):
"""Routing policy, e.g. ``consistent_hash`` or ``round_robin``."""


# Discriminated on ``type`` so additional router backends can be added to the
# union (a single member needs no discriminator yet).
RouterConfig: TypeAlias = VllmRouterConfig
class LlmdRouterConfig(BaseConfig):
"""llm-d router backend (EPP + Envoy)."""

type: Literal["llm-d"] = "llm-d"

port: int = 8000
"""Port the Envoy gateway listens on — becomes the client-facing router URL."""

scorers: dict[str, float] = {
"prefix-cache-scorer": 3.0,
"active-request-scorer": 2.0,
}
"""EPP scorer name → weight, applied to every routing profile (before the per-profile P/D overrides). Defaults to prefix-cache affinity plus in-flight (active-request) load balancing. Unknown scorer names are rejected."""

prefill_scorer_overrides: dict[str, float] = {
"queue-scorer": 2.0,
"kv-cache-utilization-scorer": 2.0,
}
"""P/D only: scorer → weight merged onto ``scorers`` for the prefill profile (a per-profile weight overrides the base)."""

decode_scorer_overrides: dict[str, float] = {}
"""P/D only: scorer → weight merged onto ``scorers`` for the decode profile (a per-profile weight overrides the base); empty by default."""

non_cached_tokens: int = 16
"""P/D only: requests with fewer than this many non-cached prompt tokens skip remote prefill and run decode-only."""

decode_sidecar_port: int = 8300
"""P/D only: port the decode-side llm-d sidecar listens on."""

@property
def prefill_scorers(self) -> dict[str, float]:
"""Effective prefill-profile scorers: ``scorers`` merged with ``prefill_scorer_overrides``."""
return {**self.scorers, **self.prefill_scorer_overrides}

@property
def decode_scorers(self) -> dict[str, float]:
"""Effective decode-profile scorers: ``scorers`` merged with ``decode_scorer_overrides``."""
return {**self.scorers, **self.decode_scorer_overrides}

@model_validator(mode="after")
def validate_scorers(self):
unknown = (
set(self.scorers) | set(self.prefill_scorer_overrides) | set(self.decode_scorer_overrides)
) - KNOWN_SCORERS
if unknown:
raise ValueError(f"Unknown llm-d scorer(s): {sorted(unknown)}. Known scorers: {sorted(KNOWN_SCORERS)}.")
return self


# Discriminated on ``type`` so the launch path can pick the router backend.
RouterConfig: TypeAlias = Annotated[VllmRouterConfig | LlmdRouterConfig, Field(discriminator="type")]


class BaseInferenceDeploymentConfig(BaseConfig):
Expand Down Expand Up @@ -366,6 +432,18 @@ def validate_multi_node_requires_slurm(self):
raise ValueError("Must use SLURM for multi-node / disaggregated deployment.")
return self

@model_validator(mode="after")
def validate_llmd_no_routed_experts(self):
"""Reject routed-expert return with the llm-d router (breaks P/D, unverified for multi-node)."""
router = getattr(self.deployment, "router", None)
if router is not None and router.type == "llm-d" and self.enable_return_routed_experts:
raise ValueError(
"The llm-d router backend does not support routed-expert return "
"(enable_return_routed_experts): it breaks P/D and is unverified for multi-node. "
"Use router type 'vllm-router' for routed-expert runs."
)
return self

@model_validator(mode="after")
def auto_setup_kv_cache_offload(self):
if self.kv_cache_offload is not None:
Expand Down
24 changes: 20 additions & 4 deletions packages/prime-rl-configs/src/prime_rl/configs/rl.py
Original file line number Diff line number Diff line change
Expand Up @@ -435,6 +435,25 @@ def auto_setup_router_replay(self):
)
return self

@model_validator(mode="after")
def validate_llmd_no_routed_experts(self):
"""Reject routed-expert return with the llm-d router (breaks P/D, unverified for multi-node).

Runs after ``auto_setup_router_replay`` so it also catches the
``trainer.enable_router_replay`` path, which sets the inference flag here
(after InferenceConfig's own validators, which therefore miss it).
"""
if self.inference is not None and self.inference.enable_return_routed_experts:
router = getattr(self.inference.deployment, "router", None)
if router is not None and router.type == "llm-d":
raise ValueError(
"The llm-d router backend does not support routed-expert return "
"(inference.enable_return_routed_experts / trainer.enable_router_replay): it "
"breaks P/D and is unverified for multi-node. Use router type 'vllm-router' "
"for router-replay runs."
)
return self

@model_validator(mode="after")
def validate_router_replay_without_kv_offload(self):
if (
Expand Down Expand Up @@ -600,16 +619,13 @@ def auto_setup_disaggregated_inference(self):
def auto_setup_inference_client(self):
"""Auto-configure orchestrator student client from the inference server config.

For all modes, sets dp_rank_count from inference DP size. For SFT mode,
also sets base_url - rl/opd rely on the ClientConfig default
For SFT mode, sets base_url - rl/opd rely on the ClientConfig default
Comment thread
cursor[bot] marked this conversation as resolved.
(``["http://localhost:8000/v1"]``) which already matches the auto-launched
student vLLM at inference.server.port = 8000.
"""
if self.inference is None:
return self
client = self.orchestrator.student.client
if "dp_rank_count" not in client.model_fields_set:
client.dp_rank_count = self.inference.data_parallel_size_local or self.inference.parallel.dp
if self.orchestrator.training_mode == "sft" and "base_url" not in client.model_fields_set:
host = self.inference.server.host or "localhost"
port = self.inference.server.port
Expand Down
3 changes: 0 additions & 3 deletions packages/prime-rl-configs/src/prime_rl/configs/shared.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,9 +127,6 @@ class ClientConfig(BaseConfig):
skip_model_check: bool = False
"""Skip checking that the model is available in the inference pool. Useful for external APIs or keys that do not expose ``/models``."""

dp_rank_count: int = Field(1, ge=1)
"""Number of data-parallel ranks behind each base URL. When > 1, each URL is expanded into ``dp_rank_count`` logical clients pinned via the ``X-data-parallel-rank`` header, so every request within a rollout hits the same DP engine and reuses KV cache. Auto-set from the inference config when using the RL entrypoint."""
Comment thread
S1ro1 marked this conversation as resolved.

admin_base_url: list[str] | None = None
"""Separate base URLs for admin operations (weight updates, health checks). When set, admin clients bypass routers and hit each server directly — used in disaggregated P/D deployments where the router must not handle admin traffic."""

Expand Down
75 changes: 75 additions & 0 deletions scripts/install_llmd.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#!/usr/bin/env bash
# Install the llm-d standalone (no-Kubernetes) routing binaries into
# third_party/llmd/bin:
# - epp : llm-d-router Endpoint Picker (the routing brain)
# - pd-sidecar : decode-side proxy for P/D disaggregation
# - envoy : the data-plane proxy that calls the EPP via ext_proc
#
# epp/pd-sidecar are built from a pinned llm-d-router commit. We currently build
# from a small fork that adds P/D disaggregation for vLLM's token-in
# /inference/v1/generate endpoint (prime-rl's renderer / TITO rollout path) —
# upstream's pd-sidecar only disaggregates the OpenAI endpoints, so token-in P/D
# silently runs decode-only. The fork is pending upstream PR
# llm-d/llm-d-router#1458; switch LLMD_ROUTER_REPO back to the upstream repo and
# bump LLMD_ROUTER_REF once it merges. The fork is branched off the upstream
# commit that added the EPP vllmhttp parser (PR #1248), which the renderer path
# also needs.
#
# System Go is not required: a Go toolchain is vendored under third_party/llmd/go
# and used to bootstrap; GOTOOLCHAIN=auto fetches the exact version the module
# pins. Envoy is pulled as a static binary from its release container image
# (docker is the only widely-available extraction path on bare-metal nodes).
set -euo pipefail

LLMD_ROUTER_REPO="${LLMD_ROUTER_REPO:-https://github.com/S1ro1/llm-d-router.git}"
LLMD_ROUTER_REF="${LLMD_ROUTER_REF:-1ca4243ec84c657b4a5f507a1776d6c15a618d5b}"
GO_BOOTSTRAP_VERSION="${GO_BOOTSTRAP_VERSION:-1.23.4}"
ENVOY_VERSION="${ENVOY_VERSION:-1.36.0}"

SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
PROJECT_DIR=$(cd "$SCRIPT_DIR/.." && pwd)
LLMD_DIR="$PROJECT_DIR/third_party/llmd"
BIN_DIR="$LLMD_DIR/bin"
GO_ROOT="$LLMD_DIR/go"
SRC_DIR="$LLMD_DIR/src"
mkdir -p "$BIN_DIR"

# --- Go toolchain (vendored bootstrap; auto-upgrades to the module's version) ---
if [ ! -x "$GO_ROOT/bin/go" ]; then
echo "[install_llmd] downloading bootstrap Go $GO_BOOTSTRAP_VERSION"
rm -rf "$GO_ROOT"
curl -fsSL "https://go.dev/dl/go${GO_BOOTSTRAP_VERSION}.linux-amd64.tar.gz" | tar -xz -C "$LLMD_DIR"
fi
export GOROOT="$GO_ROOT"
export PATH="$GO_ROOT/bin:$PATH"
export GOTOOLCHAIN=auto
echo "[install_llmd] bootstrap $(go version)"

# --- Fetch llm-d-router source at the pinned ref ---
echo "[install_llmd] fetching ${LLMD_ROUTER_REPO}@${LLMD_ROUTER_REF}"
if [ ! -d "$SRC_DIR/.git" ]; then
rm -rf "$SRC_DIR"
git clone --quiet "$LLMD_ROUTER_REPO" "$SRC_DIR"
fi
git -C "$SRC_DIR" remote set-url origin "$LLMD_ROUTER_REPO"
git -C "$SRC_DIR" fetch --quiet origin
git -C "$SRC_DIR" checkout --quiet "$LLMD_ROUTER_REF"

# --- Build epp + pd-sidecar ---
echo "[install_llmd] building epp + pd-sidecar"
( cd "$SRC_DIR" && go build -o "$BIN_DIR/epp" ./cmd/epp && go build -o "$BIN_DIR/pd-sidecar" ./cmd/pd-sidecar )

# --- Envoy static binary (extract from the release image; keep if present) ---
if [ ! -x "$BIN_DIR/envoy" ]; then
echo "[install_llmd] extracting Envoy ${ENVOY_VERSION} from envoyproxy/envoy image"
cid=$(docker create "envoyproxy/envoy:v${ENVOY_VERSION}")
docker cp "${cid}:/usr/local/bin/envoy" "$BIN_DIR/envoy"
docker rm "$cid" >/dev/null
chmod +x "$BIN_DIR/envoy"
fi

echo "[install_llmd] installed binaries in $BIN_DIR:"
"$BIN_DIR/epp" --version 2>&1 | head -1 || true
"$BIN_DIR/envoy" --version 2>&1 | head -1 || true
"$BIN_DIR/pd-sidecar" --help >/dev/null 2>&1 && echo " pd-sidecar OK" || true
echo "[install_llmd] done"
10 changes: 10 additions & 0 deletions skills/install/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,16 @@ Flags: `--workspace DIR`, `--deepep-ref REF` (default `73b6ea4`), `--nvshmem-ver

Verify: `uv run python -c 'import deep_ep; print(deep_ep.__file__)'`.

### llm-d router backend

Multi-node / disaggregated deployments can route through the upstream llm-d Endpoint Picker instead of `vllm-router` (set `[...deployment.router] type = "llm-d"`). It needs three native binaries — install once:

```bash
bash scripts/install_llmd.sh # builds epp + pd-sidecar from a pinned llm-d-router commit (vendored Go), fetches envoy
```

Binaries land in `third_party/llmd/bin/{epp,envoy,pd-sidecar}` (a shared path, so SLURM nodes see them). `epp` is pinned to the commit that includes the `vllmhttp-parser` (PR #1248) so prime-rl's renderer/TITO `/inference/v1/generate` path routes correctly. Override the pin with `LLMD_ROUTER_REF=<sha>`. The EPP + Envoy + endpoints configs are rendered from `templates/llmd/*.yaml.j2` (included into the SLURM script); only the per-node IPv4 addresses are filled in inline at launch time.

## Key files

- `pyproject.toml` — dependencies, extras, dependency groups
Expand Down
9 changes: 4 additions & 5 deletions src/prime_rl/entrypoints/inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def write_slurm_script(config: InferenceConfig, config_path: Path, script_path:
dp_per_node=dp_per_node,
num_nodes=getattr(config.deployment, "num_nodes", 1),
port=config.server.port,
disaggregated=is_disaggregated,
is_disaggregated=is_disaggregated,
kv_offload=offload is not None,
kv_offload_mooncake=is_mooncake,
kv_offload_cpu_bytes=int(offload.cpu.num_bytes) if is_mooncake else 0,
Expand All @@ -65,20 +65,19 @@ def write_slurm_script(config: InferenceConfig, config_path: Path, script_path:
num_decode_replicas=config.deployment.num_decode_replicas,
prefill_port=config.deployment.prefill_port,
decode_port=config.deployment.decode_port,
router_port=config.deployment.router.port,
router_policy=config.deployment.router.policy,
router=config.deployment.router,
data_parallel_rpc_port=config.data_parallel_rpc_port,
use_deep_gemm=config.use_deep_gemm,
prefill_env_overrides=config.deployment.prefill_env_overrides,
decode_env_overrides=config.deployment.decode_env_overrides,
)
elif is_multi_node:
template_vars.update(
router_port=config.deployment.router.port,
router=config.deployment.router,
backend_port=config.deployment.backend_port,
router_policy=config.deployment.router.policy,
data_parallel_rpc_port=config.data_parallel_rpc_port,
enable_expert_parallel=config.enable_expert_parallel,
infer_nodes_per_replica=config.deployment.num_nodes,
)

script = template.render(**template_vars)
Expand Down
Loading
Loading