Skip to content

[TRTLLM-12714][feat] KV pool rebalance: gate for multi-GPU and coordinate attention-DP#15518

Open
thorjohnsen wants to merge 9 commits into
NVIDIA:mainfrom
thorjohnsen:fix/kv-pool-rebalance-multigpu-gate
Open

[TRTLLM-12714][feat] KV pool rebalance: gate for multi-GPU and coordinate attention-DP#15518
thorjohnsen wants to merge 9 commits into
NVIDIA:mainfrom
thorjohnsen:fix/kv-pool-rebalance-multigpu-gate

Conversation

@thorjohnsen

@thorjohnsen thorjohnsen commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

Broadens the KVCacheManagerV2 pool-rebalance hook (added in #14578, single-GPU
aggregated only) toward multi-GPU. This PR lands the first two staged steps:
make the opt-in flag safe under any parallelism, then enable it for attention DP.

The rebalance hook decides (need_adjustment) and derives target pool ratios
from rank-local state only. Any multi-rank parallelism welds the executor
loop together with collectives (TP/CP all-reduce in the forward, attention-DP
tp_allgather of batch sizes, PP send/recv ring), so a rank that pauses to
adjust() while its peers proceed deadlocks NCCL.

Changes

1. Gate the flag off for multi-GPU (safety fix).
_can_pause_for_rebalance previously guarded only pp_size > 1, so setting
enable_kv_pool_rebalance=True under TP/CP/attention-DP passed the gate and hit
the collective deadlock. Added the missing guards + rationale docstring.

2. Coordinate rebalance across attention-DP ranks.
With attention DP each rank owns its own KVCacheManagerV2 pools and request
stream, so the decision and resulting ratios legitimately stay rank-local — only
the pause must be synchronized. The gate now permits attention DP (plain
TP/CP and PP stay gated off; they need a bit-identical pool layout or a
full-pipeline drain that is not wired up yet). _maybe_rebalance_kv_pools takes
a tp_allreduce(MAX) (logical-OR) vote on need_adjustment; if any rank wants
in, all ranks drain → suspend (skipping dummy requests) → adjust() only if
locally needed
→ resume → tp_barrier().

Test coverage

tests/unittest/_torch/executor/test_kv_pool_rebalance.py — 23 unit tests pass,
including the new gate cases and three collective-path tests (vote-yes fires the
full cycle and barrier, vote-no is a no-op, a rank with no local need still
pauses for a peer but does not adjust its own pools).

Note: the unit tests are mocked and verify the call chain / vote logic; they
cannot exercise the real NCCL lockstep. A multi-GPU attention-DP integration
run is still needed before merge.

Not in this PR

Plain TP/CP (ratio broadcast + agreed rollback) and PP (full-pipeline drain)
remain gated off and are tracked as follow-ups under TRTLLM-12714.

Summary by CodeRabbit

  • Bug Fixes

    • Improved KV-pool rebalancing coordination in multi-rank distributed training scenarios with enhanced deadlock prevention mechanisms.
    • Extended eligibility checks for KV-pool rebalancing across tensor-parallel and pipeline-parallel configurations.
  • Tests

    • Expanded test coverage for KV-pool rebalancing across distributed configurations, including attention-DP mode and edge cases.

…ttention-DP)

The KVCacheManagerV2 pool-rebalance hook makes its decision
(need_adjustment) and derives target pool ratios from rank-local state
only. Any multi-rank parallelism welds the executor loop together with
collectives (TP/CP all-reduce in the forward, attention-DP tp_allgather
of batch sizes, PP send/recv ring), so a rank that pauses to adjust()
while its peers proceed deadlocks NCCL.

_can_pause_for_rebalance previously guarded only pp_size > 1, leaving
tp_size > 1, cp_size > 1, and enable_attention_dp able to pass the gate
and hit the deadlock once enable_kv_pool_rebalance was set. Add the
missing early-return guards and expand the docstring with the rationale.

Add unit tests covering the three new gates and extend _make_executor
with tp_size/cp_size/enable_attention_dp params.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
… ranks

With attention data parallelism each rank owns its KVCacheManagerV2 pools
and its own request stream, so the rebalance decision (need_adjustment)
and the resulting per-rank pool ratios are legitimately rank-local. Only
the pause has to be synchronized, because every TP rank is welded to the
others by the per-iteration tp_allgather of batch sizes in scheduling: a
rank that paused while its peers proceeded -- or skipped a pause its peers
took -- would deadlock NCCL.

Allow attention DP through _can_pause_for_rebalance (plain TP/CP and PP
stay gated off, since those need a bit-identical pool layout or a
full-pipeline drain that is not wired up yet). In _maybe_rebalance_kv_pools
take a MAX (logical-OR) vote on need_adjustment over the TP group; if any
rank wants to rebalance, all ranks drain and suspend in lockstep, each
adjusts only its own pools and only when it locally needs to, then all
resume and re-converge at a tp_barrier. Skip dummy requests (e.g. the
attention-DP padding request) in the suspend/resume loop.

Add unit tests for the new gate behavior and the collective vote paths
(vote yes fires the full cycle and barrier, vote no is a no-op, a rank
with no local need still pauses for a peer but does not adjust).

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
@thorjohnsen thorjohnsen requested a review from a team as a code owner June 22, 2026 16:25
@thorjohnsen thorjohnsen requested a review from joyang-nv June 22, 2026 16:25
@thorjohnsen

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Extends KV pool rebalance eligibility in _can_pause_for_rebalance to block rebalance when cp_size > 1 and when tp_size > 1 without attention-DP enabled. Rewrites _maybe_rebalance_kv_pools to perform TP-wide MAX voting, skip dummy requests during pause/resume, conditionally invoke adjust() per-rank, and insert TP barriers for rank reconvergence. Tests are extended with tp_size, cp_size, enable_attention_dp, and is_dummy configuration knobs to cover all new paths.

Changes

TP-Coordinated KV Pool Rebalance

Layer / File(s) Summary
_can_pause_for_rebalance eligibility expansion and gate tests
tensorrt_llm/_torch/pyexecutor/py_executor.py, tests/unittest/_torch/executor/test_kv_pool_rebalance.py
Added cp_size > 1 (always disable) and tp_size > 1 without enable_attention_dp (disable) guards to _can_pause_for_rebalance. Extended _make_executor to accept and mock tp_size, cp_size, enable_attention_dp; extended _make_request with is_dummy; added gate tests for all new TP/CP/attention-DP combinations.
TP-collective _maybe_rebalance_kv_pools and attention-DP tests
tensorrt_llm/_torch/pyexecutor/py_executor.py, tests/unittest/_torch/executor/test_kv_pool_rebalance.py
Rewrote _maybe_rebalance_kv_pools to broadcast need_adjustment via tp_allreduce(MAX), skip dummy requests during pause/resume, call adjust() only on locally-voting ranks, and insert tp_barrier calls for lockstep synchronization. Added tests for vote-yes full cycle, vote-no no-op (no barrier/CUDA sync), dummy-request exclusion, and peer-voted lockstep pause/resume without local adjust.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding gates for multi-GPU safety and enabling coordination for attention-DP in KV pool rebalancing.
Description check ✅ Passed The description provides a clear summary of the issue, changes, and test coverage. All key sections are well-documented, explaining the safety fix and attention-DP coordination.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 3213-3214: The call to `_consume_previous_batch_for_rebalance()`
after torch.cuda.current_stream().synchronize() can cause TP collective deadlock
under attention-DP because ranks with a pending previous_batch will enter
collective operations in `_flush_pending_transfer_responses()` or
`_handle_responses()`, while ranks where previous_batch is None will skip
straight to tp_barrier(), creating a collective-order mismatch. Add a TP-wide
vote (using allreduce) before the `_consume_previous_batch_for_rebalance()` call
to check if any rank has a pending previous_batch, and only proceed with
draining when all ranks collectively agree to participate, ensuring all ranks
execute the same sequence of collective operations regardless of their
individual previous_batch state.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7ed7ab62-67a3-45b1-a039-51fda26b1952

📥 Commits

Reviewing files that changed from the base of the PR and between 4d7bf0d and 7371b4e.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tests/unittest/_torch/executor/test_kv_pool_rebalance.py

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55051 [ run ] triggered by Bot. Commit: d2eb230 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55051 [ run ] completed with state FAILURE. Commit: d2eb230
/LLM/main/L0_MergeRequest_PR pipeline #44042 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@thorjohnsen

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55811 [ run ] triggered by Bot. Commit: 0730062 Link to invocation

… divergent previous_batch

_consume_previous_batch_for_rebalance() runs two attention-DP tp_gathers
(in _flush_pending_transfer_responses and _process_previous_batch ->
_handle_responses) but skips them when previous_batch is None. previous_batch
legitimately diverges across attention-DP ranks (a rank whose local batch was
empty clears it), so a draining rank would block in tp_gather while a peer with
nothing to drain races ahead to tp_barrier() -> NCCL deadlock.

Add a tp_allreduce(SUM) vote on previous_batch presence after the
need_adjustment MAX vote; only drain when every rank agrees, otherwise defer the
whole rebalance. need_adjustment is sticky, so nothing is lost. Update the stale
docstring that claimed the gate already excluded multi-rank divergence.

Add a regression test for the divergent-previous_batch deferral and update the
existing attention-DP tests for the second collective.

Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55811 [ run ] completed with state FAILURE. Commit: 0730062
/LLM/main/L0_MergeRequest_PR pipeline #44707 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@thorjohnsen

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55906 [ run ] triggered by Bot. Commit: 8d1795b Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55906 [ run ] completed with state SUCCESS. Commit: 8d1795b
/LLM/main/L0_MergeRequest_PR pipeline #44795 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@thorjohnsen

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56387 [ run ] triggered by Bot. Commit: 7d89756 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56387 [ run ] completed with state SUCCESS. Commit: 7d89756
/LLM/main/L0_MergeRequest_PR pipeline #45230 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@thorjohnsen

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56467 [ run ] triggered by Bot. Commit: a95dd48 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56467 [ run ] completed with state FAILURE. Commit: a95dd48
/LLM/main/L0_MergeRequest_PR pipeline #45310 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@thorjohnsen

Copy link
Copy Markdown
Collaborator Author

CI failure triage (builds 44707 / 44795 / 45230 / 45310)

I went through the failure analyses for the recent /bot run attempts. The failures are pre-existing main-flaky tests and infra issues, not regressions from this PR — and the changed code path is provably dormant in CI.

Why this PR cannot be the cause

  1. The feature is opt-in, default off. Every new line lives behind _can_pause_for_rebalance()_maybe_rebalance_kv_pools(); both call sites are gated by self._is_kv_manager_v2 and self._can_pause_for_rebalance(), whose first statement is if not self.enable_kv_pool_rebalance: return False. No CI test sets enable_kv_pool_rebalance=True, so the new code never executes.
  2. Empirically confirmed dormant — build #44707's analysis notes the failing DeepSeek run logged enable_kv_pool_rebalance=False; the new path was not entered.
  3. The PR's own tests pass — build #45230 shows test_kv_pool_rebalance.py PASSED and test_kv_pool_rebalance_accuracy.py::…test_rebalance_matches_baseline[no_overlap / overlap] PASSED.

What actually failed

Run Failures Category
#44707 TestDeepSeekR1::test_nvfp4_multi_gpus[throughput] (executor worker error), Ray update-weights part3, DGX_B200 8-GPU Slurm FAILED flaky/infra (feature off)
#44795 Sampling failed (CUDA OOM), TestDeepSeekR1::test_nvfp4_multi_gpus[throughput], visual-gen 8-GPU, single-GPU ABORTED OOM/infra
#45230 FileNotFoundError: /root/.triton/cache across many unrelated tests; Slurm cleanup infra
#45310 TestDeepSeekV32Exp::test_auto_dtype_with_helix[pp1tp2cp2 / pp1tp1cp4] (accuracy), MNNVL NVFP4 regex, GB200 perf-sanity NCCL/OOM, test_disagg_server_restart timeout, Ray update-weights flaky/base-regression/infra

These are the same failures currently hitting other open PRs — DeepSeekV32Exp helix accuracy, the MNNVL test_mnnvl_nvfp4_rejects_fp32_before_launch regex mismatch (a base-branch error-message change, not a file this PR touches), GB200 perf-sanity OOM/NCCL, Triton-cache FileNotFoundError, Ray orchestrator, visual-gen. Several map to open nvbugs on main (helix → 6322076 / 6120535; Ray inner cases skipped under 6372690). None involve enable_kv_pool_rebalance, _maybe_rebalance_kv_pools, or the rebalance functions in py_executor.py.

Re-triggering CI to get a clean run. (Separately, the multi-GPU attention-DP integration validation called out in the PR description is still the real pre-merge item — unrelated to these failures.)

@thorjohnsen

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56987 [ run ] triggered by Bot. Commit: 5160ae3 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56987 [ run ] completed with state SUCCESS. Commit: 5160ae3
/LLM/main/L0_MergeRequest_PR pipeline #45790 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@thorjohnsen

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57205 [ run ] triggered by Bot. Commit: 5160ae3 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57205 [ run ] completed with state SUCCESS. Commit: 5160ae3
/LLM/main/L0_MergeRequest_PR pipeline #45981 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@thorjohnsen

Copy link
Copy Markdown
Collaborator Author

/bot run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants