[https://nvbugs/6179661][fix] Fix disagg generation-side KV transfer timeout deadlocks and teardown crashes#15363
Conversation
📝 WalkthroughWalkthroughReplaces unbounded, timeout-driven generation KV transfer completion in ChangesDisaggregated KV Transfer Timeout & Polling Rework
UCX Environment Cleanup Fix
Sequence Diagram(s)sequenceDiagram
participant Exec as PyExecutor
participant Cancel as _cancel_timed_out_gen_transfers
participant TP as TP Collective
participant Transceiver as kv_cache_transceiver (C++)
participant Poll as checkGenTransferStatus
rect rgba(100, 149, 237, 0.5)
note over Exec,Poll: Per-iteration disagg gen transfer check
Exec->>Poll: checkGenTransferStatus()
Poll->>Poll: allgatherIntValue (rank-align ready count)
loop up to poll budget
Poll->>Poll: check futures (wait_for 0ms)
Poll->>Poll: sleep if not ready
end
Poll-->>Exec: completed requests erased from mRequesterFutures
end
rect rgba(255, 140, 0, 0.5)
note over Exec,Cancel: Centralized timeout cancellation (new)
Exec->>Cancel: _cancel_timed_out_gen_transfers()
Cancel->>TP: allreduce timed-out request IDs
TP-->>Cancel: aligned set across TP ranks
Cancel->>Transceiver: cancel_request(request)
Cancel->>Cancel: mark DISAGG_TRANS_ERROR
Cancel->>Exec: _handle_errors(charge_budget=False)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1)
707-708: ⚡ Quick winUse TensorRT-LLM constant naming for the new poll constants.
The new constants should use camelCase with a
kprefix, e.g.kMaxPollIterationsandkPollInterval. As per coding guidelines, “Constants should follow naming convention: camelCase with prefix 'k'.”♻️ Proposed rename
- static constexpr int kMAX_POLL_ITERATIONS = 32; - static constexpr auto kPOLL_INTERVAL = std::chrono::milliseconds(2); + static constexpr int kMaxPollIterations = 32; + static constexpr auto kPollInterval = std::chrono::milliseconds(2);Also update the later references in this function accordingly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp` around lines 707 - 708, The constants kMAX_POLL_ITERATIONS and kPOLL_INTERVAL do not follow the TensorRT-LLM naming convention. Rename kMAX_POLL_ITERATIONS to kMaxPollIterations and kPOLL_INTERVAL to kPollInterval to use camelCase with the 'k' prefix as per the coding guidelines. Then update all references to these constants throughout the function to use the new names.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`:
- Around line 714-718: The checkGenTransferStatus() method in
cacheTransceiver.cpp with the default atLeastRequestNum parameter now only polls
for a limited time before returning, while transfers may still be in flight.
This breaks the destructor's assumption in trtGptModelInflightBatching.cpp that
checkGenTransferComplete() will be true after calling checkGenTransferStatus().
Either update the default behavior of checkGenTransferStatus() to fully complete
all transfers when atLeastRequestNum is nullopt (restoring backward
compatibility), or ensure all callers explicitly loop and reinvoke
checkGenTransferStatus() until checkGenTransferComplete() returns true instead
of relying on a single call with the default. Additionally, rename the constants
kMAX_POLL_ITERATIONS and kPOLL_INTERVAL at lines 707–708 to use camelCase naming
convention (kMaxPollIterations and kPollInterval) to match coding guidelines.
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 4117-4120: The `_check_cache_transfer_errors()` method currently
calls `_handle_errors()` only when the local rank has errors, which can cause
rank inconsistency when transfer errors appear on a subset of ADP ranks. Both
the location at lines 4117-4120 and the similar location at lines 4442-4447 need
to be modified to ensure error handling occurs after TP consensus across all
ranks, consistent with how the timeout path handles errors. Restructure the
error handling in both locations so that `_handle_errors()` is called after TP
consensus has completed, rather than conditionally only for ranks with local
errors, ensuring all ranks remain synchronized when entering
`_enqueue_responses()` TP gather.
- Line 4046: The method _cancel_timed_out_gen_transfers() at line 4046 depends
on timeout flags being set by _check_kv_transfer_timeout(), but this check is
not being called unconditionally before the cancellation method. In pipeline
parallel generation-only transfer stalls, the py_kv_transfer_timed_out flag may
never be set, causing the cancellation logic to remain idle. Add a call to
_check_kv_transfer_timeout() immediately before the
_cancel_timed_out_gen_transfers() call to ensure timeout flags are refreshed and
the cancellation path can execute properly.
- Line 4051: Add a return type annotation of `-> None` to the method signature
of `_cancel_timed_out_gen_transfers` at line 4051, since the method has multiple
early returns and does not return a value. Additionally, locate the `zip()` call
at line 4106 that combines `global_ok` and `global_timed_out` (which are
explicitly constructed to have matching lengths) and add the `strict=True`
parameter to it to satisfy the Ruff B905 check and enforce that both iterables
have the same length during iteration.
---
Nitpick comments:
In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`:
- Around line 707-708: The constants kMAX_POLL_ITERATIONS and kPOLL_INTERVAL do
not follow the TensorRT-LLM naming convention. Rename kMAX_POLL_ITERATIONS to
kMaxPollIterations and kPOLL_INTERVAL to kPollInterval to use camelCase with the
'k' prefix as per the coding guidelines. Then update all references to these
constants throughout the function to use the new names.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5d852be3-a13a-4c59-9480-0e4a898c264b
📒 Files selected for processing (6)
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.hcpp/tensorrt_llm/batch_manager/cacheTransceiver.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.cppjenkins/scripts/perf/local/submit.pytensorrt_llm/_torch/pyexecutor/llm_request.pytensorrt_llm/_torch/pyexecutor/py_executor.py
… deadlocks and teardown crashes Signed-off-by: Tingfeng Xian <289617005+nv-xtf@users.noreply.github.com>
c94ad5a to
68ff9d8
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #54449 [ run ] triggered by Bot. Commit: |
|
PR_Github #54449 [ run ] completed with state
|
|
/bot rerun |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot run --disable-fail-fast --reuse-test |
|
PR_Github #54476 [ run ] triggered by Bot. Commit: |
|
PR_Github #54476 [ run ] completed with state
|
chienchunhung
left a comment
There was a problem hiding this comment.
Following up from the related #15356 thread: thanks for putting this together, and for calling out the ADP timeout/cancellation issue clearly there.
After reviewing this PR against the current cancellation PR chain, I plan to split the work across multiple PRs for better scoping and deduced risks:
- #15422 carries the teardown/lifetime hardening from this PR. That keeps shutdown safety independent and lets it land before #15238 if possible.
- #15238 should own the generation-side timeout/cancellation semantics, including rank-consistent agreement on the timeout/cancellable request set before
_handle_errors()/_enqueue_responses()can entertp_gather. - #15356 -> #15238 remains the bounded-polling / request-cancellation chain. #15356 itself should stay focused on V2/Python transceiver bounded context polling, not cancellation semantics.
tl;dr: the ADP tp_gather participant mismatch is a real gap, intentionally not addressed in #15356, and should be handled in #15238 through rank-consistent cancellation. The teardown hardening is independent from the dependency graph and is split into #15422. I’ll port/reimplement the relevant cancellation-consensus logic from this PR into #15238.
Summary by CodeRabbit
Bug Fixes
Chores
Description
In disaggregated serving, when a generation-side KV cache transfer does not complete in time (e.g. under heavy transport load), the PyTorch executor/cache-transceiver path could deadlock or crash the engine. This PR fixes the following issues:
Unbounded blocking in
checkGenTransferStatus(C++). To satisfyatLeastRequestNum, the generation-side status check force-selected not-yet-ready requests and then blocked infuture::get()with no bound. A transfer that never completes wedged that rank insideget()while sibling ranks advanced to the next collective, deadlocking the whole sync group. It is now a bounded poll: only futures ready on every rank of the sync group are completed, the wait foratLeastRequestNumis capped, and incomplete transfers are left in flight for later checks. All loop-exit decisions are derived from rank-identical data so every rank runs the same number of collectives.Generation-side transfer timeouts were not acted upon consistently (Python). The timeout flag was set but its only consumer was unreachable behind the blocking call above, and it was rank-local.
_cancel_timed_out_gen_transfersnow runs every iteration on every rank and reaches a TP-wide consensus before changing request state: an OR-gate allreduce to skip when nothing timed out, an allgather union of the timed-out ids (so wall-clock skew between ranks does not diverge the set), and a second allgather on the per-rank cancel result so a request is only failed once every rank tracking it has cancelled its share.Collective gated on rank-local error state (Python).
_check_cache_transfer_errorsonly entered_handle_errors(→ _enqueue_responses → tp_gather) when the local rank had errored requests. Under attention DP a cancel succeeds on a subset of ranks, so some ranks entered the gather while others skipped it, deadlocking the TP group. The generation-side path now flushes errors on every rank unconditionally.Use-after-free on teardown (C++).
~CacheTransceiverrelied on member destruction order, butConnectionManageris declared after — and thus destroyed before —CacheSender/CacheReceiver, whose background threads still dereferenced it, causing a segfault during shutdown. The destructor now resets the sender and receiver first, while the manager is still alive.Two shutdown-path crashes in
CacheSender::response()(C++). (a)recvRequestInforeturning a null connection was only handled when the manager had already stopped; otherwise an emptyRequestInfoflowed into processing and threwbad_optional_access. (b) Terminating while waiting in the inner loop left the iterator atend(), which was then dereferenced insendResponse. Both now bail out cleanly when terminating.A bounded-poll metric (waited calls / total poll iterations / budget-exhaustion count) is logged on budget exhaustion to keep the throughput impact of the non-blocking rewrite observable.
Test Coverage
Validated with the disaggregated perf-sanity case
perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-gb200_kimi-k25-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb0_mtp0_ccb-NIXL]on GB200 (gen TP16 / attention-DP, ctx TP4, NIXL)PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.