feat: llm-d (EPP+Envoy) router backend#2697
Open
S1ro1 wants to merge 2 commits into
Open
Conversation
0701783 to
1f39804
Compare
1f39804 to
74c911a
Compare
74c911a to
44e47f9
Compare
With external-LB data parallelism each DP rank is its own API server on its
own port (the URL is the rank selector), so the client no longer needs the
hybrid-LB `X-data-parallel-rank` header to pin a rollout to an internal DP
shard. Remove the `dp_rank_count` client field + its auto-setup and the
per-rank client expansion: one client per base URL, no rank header. The
router (vllm-router or llm-d EPP) balances across the per-rank endpoints.
This also fixes llm-d routing: the EPP forwards the header to the dp=1
backend, which rejected it ("data_parallel_rank N is out of range").
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add llm-d as a second router backend alongside vllm-router, selected via `[inference.deployment.router] type = "llm-d"`. Built on the external-LB launch substrate — the per-rank vLLM engines launch identically; only the router control plane differs. - `LlmdRouterConfig`: discriminated-union member with `scorers` (base) + per-profile `prefill_scorer_overrides`/`decode_scorer_overrides`, `non_cached_tokens`, `decode_sidecar_port`, and a known-scorer validator. - `launch_router` helper gains an llm-d branch: renders per-replica EPP + Envoy + file-discovery endpoints and launches `epp` + `envoy` instead of vllm-router. Call sites stay router-agnostic. - pd-sidecar on each decode node (P/D) for remote-prefill orchestration + NIXL. - Entrypoints pass the `router` config object to the templates. - Reject `enable_return_routed_experts` with llm-d (breaks P/D, unverified for multi-node). - SLURM cleanup also clears stale `epp`/`envoy`/`pd-sidecar` processes. - Presets under `templates/llmd/` + `scripts/install_llmd.sh`; docs + install skill. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
44e47f9 to
5d0f4ea
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5d0f4ea. Configure here.
|
|
||
| # Start the router on node 0 (balances across per-rank endpoints — no intra-node DP header) | ||
| if [ "$INFER_NODE_RANK" -eq 0 ]; then | ||
| launch_router regular "$ROUTER_ARGS" "$ROUTER_PORT" "{{ router_policy }}" "$OUTPUT_DIR/logs/inference/router.log" |
There was a problem hiding this comment.
Decode sidecar missing extra nodes
Medium Severity
With llm-d P/D, file-discovery lists a decode endpoint per DP rank on every decode node (sidecar base port + rank). pd-sidecar is only started when ROLE_RANK is 0, so when num_decode_nodes exceeds num_decode_replicas (e.g. two decode nodes, one replica), non-head decode nodes never run a sidecar while the EPP still routes traffic there.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 5d0f4ea. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


What
Adds llm-d (the upstream llm-d Endpoint Picker + Envoy) as a second router backend alongside
vllm-router, selected via[inference.deployment.router] type = "llm-d". Built on the external-LB substrate from #2696 (merged) — the per-rank vLLM engines launch identically; only the router control plane differs.2 commits:
refactor: drop client-side DP-rank pinning for external-LB— external-LB gives each DP rank its own endpoint (the URL is the rank selector), so the client no longer needs the hybrid-LBX-data-parallel-rankheader. Removesdp_rank_count+ the per-rank client expansion (one client per base URL; the router balances). Also fixes llm-d: the EPP forwards the header to thedp=1backend, which rejected it.feat: llm-d (EPP+Envoy) router backend—LlmdRouterConfig(discriminated-union member:scorers+ per-profileprefill_scorer_overrides/decode_scorer_overrides,non_cached_tokens,decode_sidecar_port, known-scorer validator); an llm-d branch in the sharedlaunch_routerhelper (renders per-replica EPP + Envoy + file-discovery endpoints, launchesepp+envoy); the pd-sidecar on decode nodes for P/D; entrypoints pass therouterconfig object to the templates; SLURM cleanup clears staleepp/envoy/pd-sidecarprocs; presets undertemplates/llmd/+scripts/install_llmd.sh; docs + install skill. Rejectsenable_return_routed_experts/trainer.enable_router_replaywith llm-d (breaks P/D, unverified for multi-node).E2E verification (SLURM)
errored=0,RL trainer finished(incl. weight broadcast)InferenceConfig(directenable_return_routed_experts) andRLConfig(trainer.enable_router_replay); vllm-router + router replay passesdecode ≫ prefillrequest counts are intended: short / prefix-cached prompts skip remote prefill (non_cached_tokens).Install
bash scripts/install_llmd.shbuildsepp/envoy/pd-sidecarintothird_party/llmd/bin.Supersedes #2691 (the pre-external-LB version).
Note
Medium Risk
Touches multi-node/SLURM inference routing, P/D sidecars, and a breaking removal of
dp_rank_count; misconfiguration is mostly caught by validators, but operational risk is in new native binaries and routing behavior changes.Overview
Adds llm-d (EPP + Envoy) as a second inference router backend alongside
vllm-router, selected via a discriminated[inference.deployment.router]block (type = "llm-d"). NewLlmdRouterConfigexposes scorer weights, P/D prefill/decode overrides,non_cached_tokens, anddecode_sidecar_port, with validation for unknown scorers and a hard error when router replay /enable_return_routed_expertsis combined with llm-d (on bothInferenceConfigandRLConfig).SLURM launch is extended:
_launch_router.sh.j2renders per-replica EPP/Envoy/file-discovery configs fromtemplates/llmd/, startseppandenvoy, and on P/D decode nodes startspd-sidecar; cleanup kills stale llm-d processes.scripts/install_llmd.shand install docs pin-buildepp,pd-sidecar, and Envoy intothird_party/llmd/bin.Breaking client change: removes
orchestrator.student.client.dp_rank_countand per-rank client expansion viaX-data-parallel-rank—with external-LB, each DP rank is its own URL and the router load-balances across endpoints (documented inCHANGELOG.md).Entrypoints pass the full
routerobject into templates (replacing separate port/policy vars); inference/RL sbatch templates useis_disaggregatedandinfer_nodes_per_replicafor llm-d endpoint wiring.Reviewed by Cursor Bugbot for commit 5d0f4ea. Bugbot is set up for automated code reviews on this repo. Configure here.