[codex] Fix external-LB inference config sizing#2705
Draft
samsja wants to merge 2 commits into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
inference_world_sizematches allocated inference GPUs, notnodes * global_api_server_count * tp.dp_rank_count = 1so requests do not send invalid vLLMX-data-parallel-rankheaders; admin URLs still cover every backend for weight updates.inference_world_sizeor routerdp_rank_countfail during config resolution.Details
For external-LB multi-node dense inference,
api_server_countis already the global backend count exposed through the router/admin URL list. Multiplying it by the inference-node count double-counts workers on 2 inference nodes and makes NCCL wait for ranks that do not exist.Those dense backends are independent TP-sharded servers with vLLM DP size 1, so the router should handle request distribution instead of the client sending global DP-rank headers.
The resolver now sets those values correctly by default and rejects explicit overrides that would reintroduce the mismatch, so this is enforced by
RLConfigrather than a manual dry-run inspection note.Validation
UV_NO_SYNC=1 uv run pytest tests/unit/test_configs.py::test_multi_node_dense_nccl_world_size_matches_inference_gpu_count_and_router_client tests/unit/test_configs.py::test_multi_node_dense_rejects_invalid_router_dp_rank_count tests/unit/test_configs.py::test_multi_node_nccl_rejects_invalid_inference_world_size_overrideUV_NO_SYNC=1 uv run ruff check packages/prime-rl-configs/src/prime_rl/configs/rl.py tests/unit/test_configs.pyUV_NO_SYNC=1 uv run rl @ configs/nemotron_debug/rl.toml --dry-run --output-dir /tmp/prime-rl-nemotron-dryrun-config-prresolvedapi_server_count=4,data_parallel_size_local=2,inference_world_size=16, anddp_rank_count=1.