[TRTLLM-12714][feat] KV pool rebalance: gate for multi-GPU and coordinate attention-DP#15518
[TRTLLM-12714][feat] KV pool rebalance: gate for multi-GPU and coordinate attention-DP#15518thorjohnsen wants to merge 9 commits into
Conversation
…ttention-DP) The KVCacheManagerV2 pool-rebalance hook makes its decision (need_adjustment) and derives target pool ratios from rank-local state only. Any multi-rank parallelism welds the executor loop together with collectives (TP/CP all-reduce in the forward, attention-DP tp_allgather of batch sizes, PP send/recv ring), so a rank that pauses to adjust() while its peers proceed deadlocks NCCL. _can_pause_for_rebalance previously guarded only pp_size > 1, leaving tp_size > 1, cp_size > 1, and enable_attention_dp able to pass the gate and hit the deadlock once enable_kv_pool_rebalance was set. Add the missing early-return guards and expand the docstring with the rationale. Add unit tests covering the three new gates and extend _make_executor with tp_size/cp_size/enable_attention_dp params. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
… ranks With attention data parallelism each rank owns its KVCacheManagerV2 pools and its own request stream, so the rebalance decision (need_adjustment) and the resulting per-rank pool ratios are legitimately rank-local. Only the pause has to be synchronized, because every TP rank is welded to the others by the per-iteration tp_allgather of batch sizes in scheduling: a rank that paused while its peers proceeded -- or skipped a pause its peers took -- would deadlock NCCL. Allow attention DP through _can_pause_for_rebalance (plain TP/CP and PP stay gated off, since those need a bit-identical pool layout or a full-pipeline drain that is not wired up yet). In _maybe_rebalance_kv_pools take a MAX (logical-OR) vote on need_adjustment over the TP group; if any rank wants to rebalance, all ranks drain and suspend in lockstep, each adjusts only its own pools and only when it locally needs to, then all resume and re-converge at a tp_barrier. Skip dummy requests (e.g. the attention-DP padding request) in the suspend/resume loop. Add unit tests for the new gate behavior and the collective vote paths (vote yes fires the full cycle and barrier, vote no is a no-op, a rank with no local need still pauses for a peer but does not adjust). Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughExtends KV pool rebalance eligibility in ChangesTP-Coordinated KV Pool Rebalance
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 3213-3214: The call to `_consume_previous_batch_for_rebalance()`
after torch.cuda.current_stream().synchronize() can cause TP collective deadlock
under attention-DP because ranks with a pending previous_batch will enter
collective operations in `_flush_pending_transfer_responses()` or
`_handle_responses()`, while ranks where previous_batch is None will skip
straight to tp_barrier(), creating a collective-order mismatch. Add a TP-wide
vote (using allreduce) before the `_consume_previous_batch_for_rebalance()` call
to check if any rank has a pending previous_batch, and only proceed with
draining when all ranks collectively agree to participate, ensuring all ranks
execute the same sequence of collective operations regardless of their
individual previous_batch state.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7ed7ab62-67a3-45b1-a039-51fda26b1952
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/py_executor.pytests/unittest/_torch/executor/test_kv_pool_rebalance.py
|
PR_Github #55051 [ run ] triggered by Bot. Commit: |
|
PR_Github #55051 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #55811 [ run ] triggered by Bot. Commit: |
… divergent previous_batch _consume_previous_batch_for_rebalance() runs two attention-DP tp_gathers (in _flush_pending_transfer_responses and _process_previous_batch -> _handle_responses) but skips them when previous_batch is None. previous_batch legitimately diverges across attention-DP ranks (a rank whose local batch was empty clears it), so a draining rank would block in tp_gather while a peer with nothing to drain races ahead to tp_barrier() -> NCCL deadlock. Add a tp_allreduce(SUM) vote on previous_batch presence after the need_adjustment MAX vote; only drain when every rank agrees, otherwise defer the whole rebalance. need_adjustment is sticky, so nothing is lost. Update the stale docstring that claimed the gate already excluded multi-rank divergence. Add a regression test for the divergent-previous_batch deferral and update the existing attention-DP tests for the second collective. Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
|
PR_Github #55811 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #55906 [ run ] triggered by Bot. Commit: |
|
PR_Github #55906 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #56387 [ run ] triggered by Bot. Commit: |
|
PR_Github #56387 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #56467 [ run ] triggered by Bot. Commit: |
|
PR_Github #56467 [ run ] completed with state
|
CI failure triage (builds 44707 / 44795 / 45230 / 45310)I went through the failure analyses for the recent Why this PR cannot be the cause
What actually failed
These are the same failures currently hitting other open PRs — DeepSeekV32Exp helix accuracy, the MNNVL Re-triggering CI to get a clean run. (Separately, the multi-GPU attention-DP integration validation called out in the PR description is still the real pre-merge item — unrelated to these failures.) |
|
/bot run --disable-fail-fast |
|
PR_Github #56987 [ run ] triggered by Bot. Commit: |
|
PR_Github #56987 [ run ] completed with state
|
|
/bot run |
|
PR_Github #57205 [ run ] triggered by Bot. Commit: |
|
PR_Github #57205 [ run ] completed with state
|
|
/bot run |
Summary
Broadens the
KVCacheManagerV2pool-rebalance hook (added in #14578, single-GPUaggregated only) toward multi-GPU. This PR lands the first two staged steps:
make the opt-in flag safe under any parallelism, then enable it for attention DP.
The rebalance hook decides (
need_adjustment) and derives target pool ratiosfrom rank-local state only. Any multi-rank parallelism welds the executor
loop together with collectives (TP/CP all-reduce in the forward, attention-DP
tp_allgatherof batch sizes, PP send/recv ring), so a rank that pauses toadjust()while its peers proceed deadlocks NCCL.Changes
1. Gate the flag off for multi-GPU (safety fix).
_can_pause_for_rebalancepreviously guarded onlypp_size > 1, so settingenable_kv_pool_rebalance=Trueunder TP/CP/attention-DP passed the gate and hitthe collective deadlock. Added the missing guards + rationale docstring.
2. Coordinate rebalance across attention-DP ranks.
With attention DP each rank owns its own
KVCacheManagerV2pools and requeststream, so the decision and resulting ratios legitimately stay rank-local — only
the pause must be synchronized. The gate now permits attention DP (plain
TP/CP and PP stay gated off; they need a bit-identical pool layout or a
full-pipeline drain that is not wired up yet).
_maybe_rebalance_kv_poolstakesa
tp_allreduce(MAX)(logical-OR) vote onneed_adjustment; if any rank wantsin, all ranks drain → suspend (skipping dummy requests) →
adjust()only iflocally needed → resume →
tp_barrier().Test coverage
tests/unittest/_torch/executor/test_kv_pool_rebalance.py— 23 unit tests pass,including the new gate cases and three collective-path tests (vote-yes fires the
full cycle and barrier, vote-no is a no-op, a rank with no local need still
pauses for a peer but does not adjust its own pools).
Not in this PR
Plain TP/CP (ratio broadcast + agreed rollback) and PP (full-pipeline drain)
remain gated off and are tracked as follow-ups under TRTLLM-12714.
Summary by CodeRabbit
Bug Fixes
Tests