[codex] Fix external-LB inference config sizing by samsja · Pull Request #2705 · PrimeIntellect-ai/prime-rl

samsja · 2026-06-04T01:31:01Z

Summary

Fix dense multi-node external-LB NCCL broadcast sizing so inference_world_size matches allocated inference GPUs, not nodes * global_api_server_count * tp.
Keep dense router-backed student clients at dp_rank_count = 1 so requests do not send invalid vLLM X-data-parallel-rank headers; admin URLs still cover every backend for weight updates.
Validate the resolved multi-node inference shape so explicit bad overrides for NCCL inference_world_size or router dp_rank_count fail during config resolution.

Details

For external-LB multi-node dense inference, api_server_count is already the global backend count exposed through the router/admin URL list. Multiplying it by the inference-node count double-counts workers on 2 inference nodes and makes NCCL wait for ranks that do not exist.

Those dense backends are independent TP-sharded servers with vLLM DP size 1, so the router should handle request distribution instead of the client sending global DP-rank headers.

The resolver now sets those values correctly by default and rejects explicit overrides that would reintroduce the mismatch, so this is enforced by RLConfig rather than a manual dry-run inspection note.

Validation

UV_NO_SYNC=1 uv run pytest tests/unit/test_configs.py::test_multi_node_dense_nccl_world_size_matches_inference_gpu_count_and_router_client tests/unit/test_configs.py::test_multi_node_dense_rejects_invalid_router_dp_rank_count tests/unit/test_configs.py::test_multi_node_nccl_rejects_invalid_inference_world_size_override
UV_NO_SYNC=1 uv run ruff check packages/prime-rl-configs/src/prime_rl/configs/rl.py tests/unit/test_configs.py
UV_NO_SYNC=1 uv run rl @ configs/nemotron_debug/rl.toml --dry-run --output-dir /tmp/prime-rl-nemotron-dryrun-config-pr resolved api_server_count=4, data_parallel_size_local=2, inference_world_size=16, and dp_rank_count=1.

fix multi-node inference broadcast sizing

1ea47c1

samsja mentioned this pull request Jun 4, 2026

[codex] Fix vLLM layerwise reload alias buffers #2701

Draft

validate multi-node inference config invariants

cf06820

samsja mentioned this pull request Jun 4, 2026

[codex] Remove vLLM layerwise reload alias patch #2706

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Fix external-LB inference config sizing#2705

[codex] Fix external-LB inference config sizing#2705
samsja wants to merge 2 commits into
mainfrom
fix/external-lb-inference-config

samsja commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samsja commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

samsja commented Jun 4, 2026 •

edited

Loading