[https://nvbugs/6217533][fix] Fix disagg KV transfer timeout cleanup#14633
[https://nvbugs/6217533][fix] Fix disagg KV transfer timeout cleanup#14633liji-nv wants to merge 4 commits into
Conversation
📝 WalkthroughWalkthroughThis PR improves error resilience in disaggregated KV-cache transfer by preventing error-state overwrites in the C++ transceiver, introducing explicit pre-assigned buffer lifecycle management, and enhancing the Python executor to detect terminal transfer conditions and emit granular per-request error responses rather than failing all pending work at once. ChangesDisaggregated Transfer Error Handling and Buffer Management
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)
1087-1112:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftException safety gap: pre-assigned buffers may leak if
receiveReadySignalorreceiveSyncthrows.If an exception is thrown at line 1096 (
receiveReadySignal) or line 1105 (receiveSync), theTransferSessiondestructor does not callreleasePreAssignedRecvBuffers(), leaving the pre-assigned buffer indices unreturned to their managers. Over time, this could exhaust the buffer pool.Consider either:
- Adding
releasePreAssignedRecvBuffers()to theTransferSessiondestructor for any non-emptymPreAssignedRecvBuffers, or- Wrapping the receive logic in a try-catch that releases on exception.
Option 1: RAII cleanup in destructor (preferred)
In
dataTransceiver.h, add a destructor toTransferSession:~TransferSession() { releasePreAssignedRecvBuffers(); }And change
clearPreAssignedRecvBufferssemantics if needed, or leave it as-is since calling release on an already-cleared vector is a no-op.Option 2: Try-catch in requestSync
void requestSync(LlmRequest& llmRequest) { TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(), "Start calling requestSync for request ID: %zu, context request ID: %zu.", llmRequest.mRequestId, llmRequest.getContextPhaseParams().value().getReqId()); llmRequest.setKvCacheTransferStart(std::chrono::steady_clock::now()); TLLM_CUDA_CHECK(cudaSetDevice(mDeviceId)); auto session = sendRequestInfo(llmRequest); session.setTime(TransferSession::kTimeRequestInfo); + try + { bool isReady = receiveReadySignal(session); if (!isReady) { // Reuse the error state for the cancelled request. llmRequest.setState(LlmRequestState::kDISAGG_TRANS_ERROR); llmRequest.setKvCacheTransferEnd(std::chrono::steady_clock::now()); session.releasePreAssignedRecvBuffers(); return; } receiveSync(session); session.clearPreAssignedRecvBuffers(); + } + catch (...) + { + session.releasePreAssignedRecvBuffers(); + throw; + } llmRequest.setKvCacheTransferEnd(std::chrono::steady_clock::now()); ... }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp` around lines 1087 - 1112, The code in requestSync may leak pre-assigned buffers if receiveReadySignal or receiveSync throw because TransferSession::releasePreAssignedRecvBuffers is only called on the normal path; fix by making buffer release RAII: add a TransferSession destructor that calls releasePreAssignedRecvBuffers() (ensuring it's safe to call when the vector is empty) so any exception during receiveReadySignal or receiveSync returns buffers, or alternatively wrap the receiveReadySignal/receiveSync calls in a try-catch inside requestSync that calls session.releasePreAssignedRecvBuffers() before rethrowing; update clearPreAssignedRecvBuffers semantics if needed to remain idempotent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`:
- Around line 1087-1112: The code in requestSync may leak pre-assigned buffers
if receiveReadySignal or receiveSync throw because
TransferSession::releasePreAssignedRecvBuffers is only called on the normal
path; fix by making buffer release RAII: add a TransferSession destructor that
calls releasePreAssignedRecvBuffers() (ensuring it's safe to call when the
vector is empty) so any exception during receiveReadySignal or receiveSync
returns buffers, or alternatively wrap the receiveReadySignal/receiveSync calls
in a try-catch inside requestSync that calls
session.releasePreAssignedRecvBuffers() before rethrowing; update
clearPreAssignedRecvBuffers semantics if needed to remain idempotent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4851f820-b48d-496d-86e4-e9009fcb2c04
📒 Files selected for processing (4)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.cppcpp/tensorrt_llm/batch_manager/dataTransceiver.htensorrt_llm/_torch/pyexecutor/py_executor.py
cd5fb0a to
f4f4078
Compare
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #50679 [ run ] triggered by Bot. Commit: |
|
PR_Github #50679 [ run ] completed with state
|
Handle disaggregated KV cache transfer failures as terminal request states instead of allowing them to leave stale transfer state behind. Generation transfer completion now preserves DISAGG_TRANS_ERROR instead of overwriting it with DISAGG_GENERATION_TRANS_COMPLETE. Successful request cancellation also removes the corresponding raw request pointer from the sender/requester future lists, preventing later status polling from dereferencing a request that has already been terminated. The receiver path now tracks pre-assigned receive buffers and releases them when a ready signal reports that the transfer cannot proceed. This keeps cancelled or failed transfers from leaking receive buffer ownership into later requests. On the Python executor side, benchmark disaggregated fill gating now reaps terminal transfer requests before continuing to wait for the batch to become forwardable. Generation-side transfer errors are deferred into request-level error responses, so a single failed transfer can terminate that request cleanly without raising a process-wide error or hanging the benchmark. Validated with the 6217533 benchmark configuration: sbatch build_sqsh.sh completed successfully; python3 examples/disaggregated/slurm/benchmark/submit.py -c 6217533.yaml completed 2048 / 2048 requests with 0 failed requests; no free(), invalid pointer, UCX ERROR, NIXL crash, or Python traceback was observed in the clean benchmark run. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
bbffd2a to
524f102
Compare
|
/bot run --disable-fail-fast --add-multi-gpu-test |
1 similar comment
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #51023 [ run ] triggered by Bot. Commit: |
|
PR_Github #51023 [ run ] completed with state
|
|
Related fix: #14937 |
Signed-off-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com>
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #55240 [ run ] triggered by Bot. Commit: |
|
PR_Github #55240 [ run ] completed with state
|
Handle disaggregated KV cache transfer failures as terminal request states instead of allowing them to leave stale transfer state behind.
Generation transfer completion now preserves DISAGG_TRANS_ERROR instead of overwriting it with DISAGG_GENERATION_TRANS_COMPLETE. Successful request cancellation also removes the corresponding raw request pointer from the sender/requester future lists, preventing later status polling from dereferencing a request that has already been terminated.
The receiver path now tracks pre-assigned receive buffers and releases them when a ready signal reports that the transfer cannot proceed. This keeps cancelled or failed transfers from leaking receive buffer ownership into later requests.
On the Python executor side, benchmark disaggregated fill gating now reaps terminal transfer requests before continuing to wait for the batch to become forwardable. Generation-side transfer errors are deferred into request-level error responses, so a single failed transfer can terminate that request cleanly without raising a process-wide error or hanging the benchmark.
Validated with the 6217533 benchmark configuration: sbatch build_sqsh.sh completed successfully; python3 examples/disaggregated/slurm/benchmark/submit.py -c 6217533.yaml completed 2048 / 2048 requests with 0 failed requests; no free(), invalid pointer, UCX ERROR, NIXL crash, or Python traceback was observed in the clean benchmark run.
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.