Skip to content

fix(configs): remove vestigial client dp_rank_count / X-data-parallel-rank#2708

Open
S1ro1 wants to merge 1 commit into
mainfrom
fix/drop-vestigial-dp-rank-header
Open

fix(configs): remove vestigial client dp_rank_count / X-data-parallel-rank#2708
S1ro1 wants to merge 1 commit into
mainfrom
fix/drop-vestigial-dp-rank-header

Conversation

@S1ro1
Copy link
Copy Markdown
Collaborator

@S1ro1 S1ro1 commented Jun 4, 2026

Summary

Removes the vestigial client-side dp_rank_count / X-data-parallel-rank mechanism.

Background

dp_rank_count + the X-data-parallel-rank header come from #1940 ("add dp-rank-routing"): with a single hybrid-LB server fronting multiple DP engines, the client pinned each rollout to a DP shard for KV reuse. #2696 ("use external-LB everywhere, drop hybrid-LB") replaced that with one endpoint per DP rank, router-balanced — so the header no longer has a job.

It isn't just dead, it's harmful. The regular vllm-router forwards client headers to the worker, and vLLM validates the rank:

data_parallel_rank N is out of range [0, num_ranks).

Single-DP backends (--data-parallel-size 1, how dense external-LB launches) have num_ranks = 1, so any request carrying rank > 0 is rejected. Affinity / KV reuse is now the router's job (prefix-cache scoring), not the client's.

Change

  • Remove client.dp_rank_count from ClientConfig.
  • setup_clients: one client per base_url, no X-data-parallel-rank header.
  • Drop the dp_rank_count auto-set in the RL resolver and the field pass-through in the elastic pool.
  • client_identity keys on api_base_url alone (each DP rank is its own endpoint).
  • CHANGELOG + tests updated.

No checked-in config sets dp_rank_count, and the server-side header reader already handles its absence. Breaking only for external configs that set it (extra="forbid" rejects it); nothing to migrate.

Related: #2707 (NCCL world-size) — together they fully fix dense multi-node external-LB.

🤖 Generated with Claude Code


Note

Medium Risk
Breaking config removal (extra="forbid") and changed inference routing identity; fixes harmful rank headers on single-DP vLLM but requires one base URL per DP rank in user configs.

Overview
Removes client-side dp_rank_count and the X-data-parallel-rank header path that expanded each base_url into multiple pinned DP clients. With external-LB, each DP rank is already a separate endpoint in client.base_url, so the orchestrator now builds one inference client per URL and load-balances on api_base_url only.

ClientConfig drops the field (breaking for configs that still set it). setup_clients no longer loops ranks or sets rank headers. The RL auto_setup_inference_client hook stops copying inference DP size into the client. Elastic client rebuilds no longer forward dp_rank_count. CHANGELOG documents the migration (delete the key; no replacement).

Reviewed by Cursor Bugbot for commit 55d3fa8. Bugbot is set up for automated code reviews on this repo. Configure here.

…-rank

dp-rank routing (#1940) let the client pin a rollout to a DP shard within a single hybrid-LB server via the X-data-parallel-rank header. #2696 moved to external-LB everywhere (one endpoint per DP rank, router-balanced) and dropped hybrid-LB, leaving the header vestigial. Worse, the regular vllm-router forwards client headers to the worker and a single-DP backend rejects an out-of-range rank ("data_parallel_rank N is out of range"), so dense multi-node external-LB could fail on every request with rank > 0.

Remove client.dp_rank_count and the per-rank client expansion: setup_clients now creates one client per base_url with no DP-rank header, and the router load-balances across the per-rank endpoints. The server-side reader already tolerates an absent header.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@S1ro1 S1ro1 marked this pull request as ready for review June 4, 2026 02:19
Copy link
Copy Markdown
Member

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah wait actually do we have router on single node ? if not we cant remove

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants