Skip to content

feat: llm-d (EPP+Envoy) router backend#2697

Open
S1ro1 wants to merge 2 commits into
mainfrom
feat/llm-d-router-v2
Open

feat: llm-d (EPP+Envoy) router backend#2697
S1ro1 wants to merge 2 commits into
mainfrom
feat/llm-d-router-v2

Conversation

@S1ro1
Copy link
Copy Markdown
Collaborator

@S1ro1 S1ro1 commented Jun 3, 2026

What

Adds llm-d (the upstream llm-d Endpoint Picker + Envoy) as a second router backend alongside vllm-router, selected via [inference.deployment.router] type = "llm-d". Built on the external-LB substrate from #2696 (merged) — the per-rank vLLM engines launch identically; only the router control plane differs.

2 commits:

  1. refactor: drop client-side DP-rank pinning for external-LB — external-LB gives each DP rank its own endpoint (the URL is the rank selector), so the client no longer needs the hybrid-LB X-data-parallel-rank header. Removes dp_rank_count + the per-rank client expansion (one client per base URL; the router balances). Also fixes llm-d: the EPP forwards the header to the dp=1 backend, which rejected it.
  2. feat: llm-d (EPP+Envoy) router backendLlmdRouterConfig (discriminated-union member: scorers + per-profile prefill_scorer_overrides/decode_scorer_overrides, non_cached_tokens, decode_sidecar_port, known-scorer validator); an llm-d branch in the shared launch_router helper (renders per-replica EPP + Envoy + file-discovery endpoints, launches epp+envoy); the pd-sidecar on decode nodes for P/D; entrypoints pass the router config object to the templates; SLURM cleanup clears stale epp/envoy/pd-sidecar procs; presets under templates/llmd/ + scripts/install_llmd.sh; docs + install skill. Rejects enable_return_routed_experts / trainer.enable_router_replay with llm-d (breaks P/D, unverified for multi-node).

E2E verification (SLURM)

Path Result
llm-d RL multi-node (Qwen3-0.6B) EPP+Envoy balanced across 8 per-rank endpoints, completed all steps, 0 errors
llm-d RL P/D (Qwen3-30B-A3B) routed to all 8 prefill + all 8 decode ranks, pd-sidecar + NIXL KV transfer, errored=0, RL trainer finished (incl. weight broadcast)
vllm-router (MN + P/D) regression — still balanced, 0 errors
routed-experts rejection rejected on both InferenceConfig (direct enable_return_routed_experts) and RLConfig (trainer.enable_router_replay); vllm-router + router replay passes

decode ≫ prefill request counts are intended: short / prefix-cached prompts skip remote prefill (non_cached_tokens).

Install

bash scripts/install_llmd.sh builds epp/envoy/pd-sidecar into third_party/llmd/bin.

Supersedes #2691 (the pre-external-LB version).


Note

Medium Risk
Touches multi-node/SLURM inference routing, P/D sidecars, and a breaking removal of dp_rank_count; misconfiguration is mostly caught by validators, but operational risk is in new native binaries and routing behavior changes.

Overview
Adds llm-d (EPP + Envoy) as a second inference router backend alongside vllm-router, selected via a discriminated [inference.deployment.router] block (type = "llm-d"). New LlmdRouterConfig exposes scorer weights, P/D prefill/decode overrides, non_cached_tokens, and decode_sidecar_port, with validation for unknown scorers and a hard error when router replay / enable_return_routed_experts is combined with llm-d (on both InferenceConfig and RLConfig).

SLURM launch is extended: _launch_router.sh.j2 renders per-replica EPP/Envoy/file-discovery configs from templates/llmd/, starts epp and envoy, and on P/D decode nodes starts pd-sidecar; cleanup kills stale llm-d processes. scripts/install_llmd.sh and install docs pin-build epp, pd-sidecar, and Envoy into third_party/llmd/bin.

Breaking client change: removes orchestrator.student.client.dp_rank_count and per-rank client expansion via X-data-parallel-rank—with external-LB, each DP rank is its own URL and the router load-balances across endpoints (documented in CHANGELOG.md).

Entrypoints pass the full router object into templates (replacing separate port/policy vars); inference/RL sbatch templates use is_disaggregated and infer_nodes_per_replica for llm-d endpoint wiring.

Reviewed by Cursor Bugbot for commit 5d0f4ea. Bugbot is set up for automated code reviews on this repo. Configure here.

@S1ro1 S1ro1 marked this pull request as ready for review June 3, 2026 03:01
Comment thread src/prime_rl/templates/_launch_router.sh.j2
Comment thread packages/prime-rl-configs/src/prime_rl/configs/shared.py
@S1ro1 S1ro1 force-pushed the feat/llm-d-router-v2 branch from 0701783 to 1f39804 Compare June 3, 2026 03:03
Comment thread src/prime_rl/templates/multi_node_rl.sbatch.j2
@S1ro1 S1ro1 force-pushed the feat/llm-d-router-v2 branch from 1f39804 to 74c911a Compare June 3, 2026 03:13
Comment thread packages/prime-rl-configs/src/prime_rl/configs/rl.py
@S1ro1 S1ro1 force-pushed the feat/llm-d-router-v2 branch from 74c911a to 44e47f9 Compare June 3, 2026 08:33
S1ro1 and others added 2 commits June 3, 2026 14:46
With external-LB data parallelism each DP rank is its own API server on its
own port (the URL is the rank selector), so the client no longer needs the
hybrid-LB `X-data-parallel-rank` header to pin a rollout to an internal DP
shard. Remove the `dp_rank_count` client field + its auto-setup and the
per-rank client expansion: one client per base URL, no rank header. The
router (vllm-router or llm-d EPP) balances across the per-rank endpoints.

This also fixes llm-d routing: the EPP forwards the header to the dp=1
backend, which rejected it ("data_parallel_rank N is out of range").

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add llm-d as a second router backend alongside vllm-router, selected via
`[inference.deployment.router] type = "llm-d"`. Built on the external-LB
launch substrate — the per-rank vLLM engines launch identically; only the
router control plane differs.

- `LlmdRouterConfig`: discriminated-union member with `scorers` (base) +
  per-profile `prefill_scorer_overrides`/`decode_scorer_overrides`,
  `non_cached_tokens`, `decode_sidecar_port`, and a known-scorer validator.
- `launch_router` helper gains an llm-d branch: renders per-replica EPP +
  Envoy + file-discovery endpoints and launches `epp` + `envoy` instead of
  vllm-router. Call sites stay router-agnostic.
- pd-sidecar on each decode node (P/D) for remote-prefill orchestration + NIXL.
- Entrypoints pass the `router` config object to the templates.
- Reject `enable_return_routed_experts` with llm-d (breaks P/D, unverified
  for multi-node).
- SLURM cleanup also clears stale `epp`/`envoy`/`pd-sidecar` processes.
- Presets under `templates/llmd/` + `scripts/install_llmd.sh`; docs + install skill.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@S1ro1 S1ro1 force-pushed the feat/llm-d-router-v2 branch from 44e47f9 to 5d0f4ea Compare June 3, 2026 09:17
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5d0f4ea. Configure here.


# Start the router on node 0 (balances across per-rank endpoints — no intra-node DP header)
if [ "$INFER_NODE_RANK" -eq 0 ]; then
launch_router regular "$ROUTER_ARGS" "$ROUTER_PORT" "{{ router_policy }}" "$OUTPUT_DIR/logs/inference/router.log"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decode sidecar missing extra nodes

Medium Severity

With llm-d P/D, file-discovery lists a decode endpoint per DP rank on every decode node (sidecar base port + rank). pd-sidecar is only started when ROLE_RANK is 0, so when num_decode_nodes exceeds num_decode_replicas (e.g. two decode nodes, one replica), non-head decode nodes never run a sidecar while the EPP still routes traffic there.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5d0f4ea. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant