Skip to content

[codex] Fix external-LB inference config sizing#2705

Draft
samsja wants to merge 2 commits into
mainfrom
fix/external-lb-inference-config
Draft

[codex] Fix external-LB inference config sizing#2705
samsja wants to merge 2 commits into
mainfrom
fix/external-lb-inference-config

Conversation

@samsja
Copy link
Copy Markdown
Member

@samsja samsja commented Jun 4, 2026

Summary

  • Fix dense multi-node external-LB NCCL broadcast sizing so inference_world_size matches allocated inference GPUs, not nodes * global_api_server_count * tp.
  • Keep dense router-backed student clients at dp_rank_count = 1 so requests do not send invalid vLLM X-data-parallel-rank headers; admin URLs still cover every backend for weight updates.
  • Validate the resolved multi-node inference shape so explicit bad overrides for NCCL inference_world_size or router dp_rank_count fail during config resolution.

Details

For external-LB multi-node dense inference, api_server_count is already the global backend count exposed through the router/admin URL list. Multiplying it by the inference-node count double-counts workers on 2 inference nodes and makes NCCL wait for ranks that do not exist.

Those dense backends are independent TP-sharded servers with vLLM DP size 1, so the router should handle request distribution instead of the client sending global DP-rank headers.

The resolver now sets those values correctly by default and rejects explicit overrides that would reintroduce the mismatch, so this is enforced by RLConfig rather than a manual dry-run inspection note.

Validation

  • UV_NO_SYNC=1 uv run pytest tests/unit/test_configs.py::test_multi_node_dense_nccl_world_size_matches_inference_gpu_count_and_router_client tests/unit/test_configs.py::test_multi_node_dense_rejects_invalid_router_dp_rank_count tests/unit/test_configs.py::test_multi_node_nccl_rejects_invalid_inference_world_size_override
  • UV_NO_SYNC=1 uv run ruff check packages/prime-rl-configs/src/prime_rl/configs/rl.py tests/unit/test_configs.py
  • UV_NO_SYNC=1 uv run rl @ configs/nemotron_debug/rl.toml --dry-run --output-dir /tmp/prime-rl-nemotron-dryrun-config-pr resolved api_server_count=4, data_parallel_size_local=2, inference_world_size=16, and dp_rank_count=1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant