[TRTLLM-12714][feat] KV pool rebalance: gate for multi-GPU and coordinate attention-DP by thorjohnsen · Pull Request #15518 · NVIDIA/TensorRT-LLM

thorjohnsen · 2026-06-22T16:25:40Z

Summary

Broadens the KVCacheManagerV2 pool-rebalance hook (added in #14578, single-GPU
aggregated only) toward multi-GPU. This PR lands the first two staged steps:
make the opt-in flag safe under any parallelism, then enable it for attention DP.

The rebalance hook decides (need_adjustment) and derives target pool ratios
from rank-local state only. Any multi-rank parallelism welds the executor
loop together with collectives (TP/CP all-reduce in the forward, attention-DP
tp_allgather of batch sizes, PP send/recv ring), so a rank that pauses to
adjust() while its peers proceed deadlocks NCCL.

Changes

1. Gate the flag off for multi-GPU (safety fix).
_can_pause_for_rebalance previously guarded only pp_size > 1, so setting
enable_kv_pool_rebalance=True under TP/CP/attention-DP passed the gate and hit
the collective deadlock. Added the missing guards + rationale docstring.

2. Coordinate rebalance across attention-DP ranks.
With attention DP each rank owns its own KVCacheManagerV2 pools and request
stream, so the decision and resulting ratios legitimately stay rank-local — only
the pause must be synchronized. The gate now permits attention DP (plain
TP/CP and PP stay gated off; they need a bit-identical pool layout or a
full-pipeline drain that is not wired up yet). _maybe_rebalance_kv_pools takes
a tp_allreduce(MAX) (logical-OR) vote on need_adjustment; if any rank wants
in, all ranks drain → suspend (skipping dummy requests) → adjust() only if
locally needed → resume → tp_barrier().

Test coverage

tests/unittest/_torch/executor/test_kv_pool_rebalance.py — 23 unit tests pass,
including the new gate cases and three collective-path tests (vote-yes fires the
full cycle and barrier, vote-no is a no-op, a rank with no local need still
pauses for a peer but does not adjust its own pools).

Note: the unit tests are mocked and verify the call chain / vote logic; they
cannot exercise the real NCCL lockstep. A multi-GPU attention-DP integration
run is still needed before merge.

Not in this PR

Plain TP/CP (ratio broadcast + agreed rollback) and PP (full-pipeline drain)
remain gated off and are tracked as follow-ups under TRTLLM-12714.

Summary by CodeRabbit

Bug Fixes
- Improved KV-pool rebalancing coordination in multi-rank distributed training scenarios with enhanced deadlock prevention mechanisms.
- Extended eligibility checks for KV-pool rebalancing across tensor-parallel and pipeline-parallel configurations.
Tests
- Expanded test coverage for KV-pool rebalancing across distributed configurations, including attention-DP mode and edge cases.

…ttention-DP) The KVCacheManagerV2 pool-rebalance hook makes its decision (need_adjustment) and derives target pool ratios from rank-local state only. Any multi-rank parallelism welds the executor loop together with collectives (TP/CP all-reduce in the forward, attention-DP tp_allgather of batch sizes, PP send/recv ring), so a rank that pauses to adjust() while its peers proceed deadlocks NCCL. _can_pause_for_rebalance previously guarded only pp_size > 1, leaving tp_size > 1, cp_size > 1, and enable_attention_dp able to pass the gate and hit the deadlock once enable_kv_pool_rebalance was set. Add the missing early-return guards and expand the docstring with the rationale. Add unit tests covering the three new gates and extend _make_executor with tp_size/cp_size/enable_attention_dp params. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

… ranks With attention data parallelism each rank owns its KVCacheManagerV2 pools and its own request stream, so the rebalance decision (need_adjustment) and the resulting per-rank pool ratios are legitimately rank-local. Only the pause has to be synchronized, because every TP rank is welded to the others by the per-iteration tp_allgather of batch sizes in scheduling: a rank that paused while its peers proceeded -- or skipped a pause its peers took -- would deadlock NCCL. Allow attention DP through _can_pause_for_rebalance (plain TP/CP and PP stay gated off, since those need a bit-identical pool layout or a full-pipeline drain that is not wired up yet). In _maybe_rebalance_kv_pools take a MAX (logical-OR) vote on need_adjustment over the TP group; if any rank wants to rebalance, all ranks drain and suspend in lockstep, each adjusts only its own pools and only when it locally needs to, then all resume and re-converge at a tp_barrier. Skip dummy requests (e.g. the attention-DP padding request) in the suspend/resume loop. Add unit tests for the new gate behavior and the collective vote paths (vote yes fires the full cycle and barrier, vote no is a no-op, a rank with no local need still pauses for a peer but does not adjust). Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

…-multigpu-gate

thorjohnsen · 2026-06-22T16:29:31Z

/bot run --disable-fail-fast

coderabbitai · 2026-06-22T16:34:08Z

📝 Walkthrough

Walkthrough

Extends KV pool rebalance eligibility in _can_pause_for_rebalance to block rebalance when cp_size > 1 and when tp_size > 1 without attention-DP enabled. Rewrites _maybe_rebalance_kv_pools to perform TP-wide MAX voting, skip dummy requests during pause/resume, conditionally invoke adjust() per-rank, and insert TP barriers for rank reconvergence. Tests are extended with tp_size, cp_size, enable_attention_dp, and is_dummy configuration knobs to cover all new paths.

Changes

TP-Coordinated KV Pool Rebalance

Layer / File(s)	Summary
`_can_pause_for_rebalance` eligibility expansion and gate tests `tensorrt_llm/_torch/pyexecutor/py_executor.py`, `tests/unittest/_torch/executor/test_kv_pool_rebalance.py`	Added `cp_size > 1` (always disable) and `tp_size > 1` without `enable_attention_dp` (disable) guards to `_can_pause_for_rebalance`. Extended `_make_executor` to accept and mock `tp_size`, `cp_size`, `enable_attention_dp`; extended `_make_request` with `is_dummy`; added gate tests for all new TP/CP/attention-DP combinations.
TP-collective `_maybe_rebalance_kv_pools` and attention-DP tests `tensorrt_llm/_torch/pyexecutor/py_executor.py`, `tests/unittest/_torch/executor/test_kv_pool_rebalance.py`	Rewrote `_maybe_rebalance_kv_pools` to broadcast `need_adjustment` via `tp_allreduce(MAX)`, skip dummy requests during pause/resume, call `adjust()` only on locally-voting ranks, and insert `tp_barrier` calls for lockstep synchronization. Added tests for vote-yes full cycle, vote-no no-op (no barrier/CUDA sync), dummy-request exclusion, and peer-voted lockstep pause/resume without local `adjust`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.18% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: adding gates for multi-GPU safety and enabling coordination for attention-DP in KV pool rebalancing.
Description check	✅ Passed	The description provides a clear summary of the issue, changes, and test coverage. All key sections are well-documented, explaining the safety fix and attention-DP coordination.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 3213-3214: The call to `_consume_previous_batch_for_rebalance()`
after torch.cuda.current_stream().synchronize() can cause TP collective deadlock
under attention-DP because ranks with a pending previous_batch will enter
collective operations in `_flush_pending_transfer_responses()` or
`_handle_responses()`, while ranks where previous_batch is None will skip
straight to tp_barrier(), creating a collective-order mismatch. Add a TP-wide
vote (using allreduce) before the `_consume_previous_batch_for_rebalance()` call
to check if any rank has a pending previous_batch, and only proceed with
draining when all ranks collectively agree to participate, ensuring all ranks
execute the same sequence of collective operations regardless of their
individual previous_batch state.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7ed7ab62-67a3-45b1-a039-51fda26b1952

📥 Commits

Reviewing files that changed from the base of the PR and between 4d7bf0d and 7371b4e.

📒 Files selected for processing (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/unittest/_torch/executor/test_kv_pool_rebalance.py

tensorrt-cicd · 2026-06-22T16:35:29Z

PR_Github #55051 [ run ] triggered by Bot. Commit: d2eb230 Link to invocation

tensorrt-cicd · 2026-06-23T01:09:22Z

PR_Github #55051 [ run ] completed with state FAILURE. Commit: d2eb230
/LLM/main/L0_MergeRequest_PR pipeline #44042 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

thorjohnsen · 2026-06-25T15:28:42Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T15:35:33Z

PR_Github #55811 [ run ] triggered by Bot. Commit: 0730062 Link to invocation

… divergent previous_batch _consume_previous_batch_for_rebalance() runs two attention-DP tp_gathers (in _flush_pending_transfer_responses and _process_previous_batch -> _handle_responses) but skips them when previous_batch is None. previous_batch legitimately diverges across attention-DP ranks (a rank whose local batch was empty clears it), so a draining rank would block in tp_gather while a peer with nothing to drain races ahead to tp_barrier() -> NCCL deadlock. Add a tp_allreduce(SUM) vote on previous_batch presence after the need_adjustment MAX vote; only drain when every rank agrees, otherwise defer the whole rebalance. need_adjustment is sticky, so nothing is lost. Update the stale docstring that claimed the gate already excluded multi-rank divergence. Add a regression test for the divergent-previous_batch deferral and update the existing attention-DP tests for the second collective. Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>

tensorrt-cicd · 2026-06-25T23:07:57Z

PR_Github #55811 [ run ] completed with state FAILURE. Commit: 0730062
/LLM/main/L0_MergeRequest_PR pipeline #44707 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

thorjohnsen · 2026-06-25T23:25:10Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T23:31:21Z

PR_Github #55906 [ run ] triggered by Bot. Commit: 8d1795b Link to invocation

tensorrt-cicd · 2026-06-26T07:56:47Z

PR_Github #55906 [ run ] completed with state SUCCESS. Commit: 8d1795b
/LLM/main/L0_MergeRequest_PR pipeline #44795 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

thorjohnsen · 2026-06-29T14:11:18Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-29T14:18:04Z

PR_Github #56387 [ run ] triggered by Bot. Commit: 7d89756 Link to invocation

tensorrt-cicd · 2026-06-29T19:29:25Z

PR_Github #56387 [ run ] completed with state SUCCESS. Commit: 7d89756
/LLM/main/L0_MergeRequest_PR pipeline #45230 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

thorjohnsen · 2026-06-29T21:34:40Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-29T21:46:10Z

PR_Github #56467 [ run ] triggered by Bot. Commit: a95dd48 Link to invocation

tensorrt-cicd · 2026-06-30T03:47:55Z

PR_Github #56467 [ run ] completed with state FAILURE. Commit: a95dd48
/LLM/main/L0_MergeRequest_PR pipeline #45310 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

thorjohnsen · 2026-07-01T17:39:03Z

CI failure triage (builds 44707 / 44795 / 45230 / 45310)

I went through the failure analyses for the recent /bot run attempts. The failures are pre-existing main-flaky tests and infra issues, not regressions from this PR — and the changed code path is provably dormant in CI.

Why this PR cannot be the cause

The feature is opt-in, default off. Every new line lives behind _can_pause_for_rebalance() → _maybe_rebalance_kv_pools(); both call sites are gated by self._is_kv_manager_v2 and self._can_pause_for_rebalance(), whose first statement is if not self.enable_kv_pool_rebalance: return False. No CI test sets enable_kv_pool_rebalance=True, so the new code never executes.
Empirically confirmed dormant — build #44707's analysis notes the failing DeepSeek run logged enable_kv_pool_rebalance=False; the new path was not entered.
The PR's own tests pass — build #45230 shows test_kv_pool_rebalance.py PASSED and test_kv_pool_rebalance_accuracy.py::…test_rebalance_matches_baseline[no_overlap / overlap] PASSED.

What actually failed

Run	Failures	Category
#44707	`TestDeepSeekR1::test_nvfp4_multi_gpus[throughput]` (executor worker error), Ray update-weights `part3`, DGX_B200 8-GPU Slurm FAILED	flaky/infra (feature off)
#44795	`Sampling failed` (CUDA OOM), `TestDeepSeekR1::test_nvfp4_multi_gpus[throughput]`, visual-gen 8-GPU, single-GPU ABORTED	OOM/infra
#45230	`FileNotFoundError: /root/.triton/cache` across many unrelated tests; Slurm cleanup	infra
#45310	`TestDeepSeekV32Exp::test_auto_dtype_with_helix[pp1tp2cp2 / pp1tp1cp4]` (accuracy), MNNVL NVFP4 regex, GB200 perf-sanity NCCL/OOM, `test_disagg_server_restart` timeout, Ray update-weights	flaky/base-regression/infra

These are the same failures currently hitting other open PRs — DeepSeekV32Exp helix accuracy, the MNNVL test_mnnvl_nvfp4_rejects_fp32_before_launch regex mismatch (a base-branch error-message change, not a file this PR touches), GB200 perf-sanity OOM/NCCL, Triton-cache FileNotFoundError, Ray orchestrator, visual-gen. Several map to open nvbugs on main (helix → 6322076 / 6120535; Ray inner cases skipped under 6372690). None involve enable_kv_pool_rebalance, _maybe_rebalance_kv_pools, or the rebalance functions in py_executor.py.

Re-triggering CI to get a clean run. (Separately, the multi-GPU attention-DP integration validation called out in the PR description is still the real pre-merge item — unrelated to these failures.)

thorjohnsen · 2026-07-01T17:39:17Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-01T17:44:53Z

PR_Github #56987 [ run ] triggered by Bot. Commit: 5160ae3 Link to invocation

tensorrt-cicd · 2026-07-01T23:21:00Z

PR_Github #56987 [ run ] completed with state SUCCESS. Commit: 5160ae3
/LLM/main/L0_MergeRequest_PR pipeline #45790 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

thorjohnsen · 2026-07-02T15:27:56Z

/bot run

tensorrt-cicd · 2026-07-02T15:33:47Z

PR_Github #57205 [ run ] triggered by Bot. Commit: 5160ae3 Link to invocation

tensorrt-cicd · 2026-07-02T22:04:17Z

PR_Github #57205 [ run ] completed with state SUCCESS. Commit: 5160ae3
/LLM/main/L0_MergeRequest_PR pipeline #45981 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

thorjohnsen · 2026-07-02T22:48:57Z

/bot run

thorjohnsen added 2 commits June 22, 2026 16:21

thorjohnsen requested a review from a team as a code owner June 22, 2026 16:25

thorjohnsen requested a review from joyang-nv June 22, 2026 16:25

github-actions Bot assigned thorjohnsen Jun 22, 2026

Merge remote-tracking branch 'origin/main' into fix/kv-pool-rebalance…

d2eb230

…-multigpu-gate

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py

Merge branch 'main' into fix/kv-pool-rebalance-multigpu-gate

0730062

Merge branch 'main' into fix/kv-pool-rebalance-multigpu-gate

8d1795b

Merge branch 'main' into fix/kv-pool-rebalance-multigpu-gate

7d89756

Merge branch 'main' into fix/kv-pool-rebalance-multigpu-gate

a95dd48

Merge branch 'main' into fix/kv-pool-rebalance-multigpu-gate

5160ae3

Uh oh!

Conversation

thorjohnsen commented Jun 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test coverage

Not in this PR

Summary by CodeRabbit

Uh oh!

thorjohnsen commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

thorjohnsen commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

thorjohnsen commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 26, 2026

Uh oh!

thorjohnsen commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

thorjohnsen commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 29, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

thorjohnsen commented Jul 1, 2026

CI failure triage (builds 44707 / 44795 / 45230 / 45310)

Why this PR cannot be the cause

What actually failed

Uh oh!

thorjohnsen commented Jul 1, 2026

Uh oh!

tensorrt-cicd commented Jul 1, 2026

Uh oh!

tensorrt-cicd commented Jul 1, 2026

Uh oh!

thorjohnsen commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

thorjohnsen commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thorjohnsen commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading