Skip to content

[https://nvbugs/6217533][fix] Fix disagg KV transfer timeout cleanup#14633

Open
liji-nv wants to merge 4 commits into
NVIDIA:mainfrom
liji-nv:disagg-kv-transfer-timeout-cleanup
Open

[https://nvbugs/6217533][fix] Fix disagg KV transfer timeout cleanup#14633
liji-nv wants to merge 4 commits into
NVIDIA:mainfrom
liji-nv:disagg-kv-transfer-timeout-cleanup

Conversation

@liji-nv

@liji-nv liji-nv commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Handle disaggregated KV cache transfer failures as terminal request states instead of allowing them to leave stale transfer state behind.

Generation transfer completion now preserves DISAGG_TRANS_ERROR instead of overwriting it with DISAGG_GENERATION_TRANS_COMPLETE. Successful request cancellation also removes the corresponding raw request pointer from the sender/requester future lists, preventing later status polling from dereferencing a request that has already been terminated.

The receiver path now tracks pre-assigned receive buffers and releases them when a ready signal reports that the transfer cannot proceed. This keeps cancelled or failed transfers from leaking receive buffer ownership into later requests.

On the Python executor side, benchmark disaggregated fill gating now reaps terminal transfer requests before continuing to wait for the batch to become forwardable. Generation-side transfer errors are deferred into request-level error responses, so a single failed transfer can terminate that request cleanly without raising a process-wide error or hanging the benchmark.

Validated with the 6217533 benchmark configuration: sbatch build_sqsh.sh completed successfully; python3 examples/disaggregated/slurm/benchmark/submit.py -c 6217533.yaml completed 2048 / 2048 requests with 0 failed requests; no free(), invalid pointer, UCX ERROR, NIXL crash, or Python traceback was observed in the clean benchmark run.

Summary by CodeRabbit

  • Bug Fixes
    • Improved error state handling in disaggregated execution to prevent overwriting critical error information.
    • Enhanced timeout logging with more detailed timing information for better debugging of transfer issues.
    • Fixed response collection to handle edge cases and malformed payloads more robustly.
    • Improved cleanup of transfer requests to prevent resource leaks during cancellation.

Review Change Stack

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@liji-nv liji-nv requested review from a team as code owners May 27, 2026 14:10
@liji-nv liji-nv requested review from bo-nv and schetlur-nv May 27, 2026 14:10
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR improves error resilience in disaggregated KV-cache transfer by preventing error-state overwrites in the C++ transceiver, introducing explicit pre-assigned buffer lifecycle management, and enhancing the Python executor to detect terminal transfer conditions and emit granular per-request error responses rather than failing all pending work at once.

Changes

Disaggregated Transfer Error Handling and Buffer Management

Layer / File(s) Summary
PreAssignedRecvBuffer contract and TransferSession API
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
Adds PreAssignedRecvBuffer struct (containing manager pointer and optional buffer id) and three public TransferSession methods: setPreAssignedRecvBuffers, releasePreAssignedRecvBuffers, and clearPreAssignedRecvBuffers. Includes <optional> and updates copyright year.
TransferSession buffer management implementation
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp, cpp/tensorrt_llm/batch_manager/dataTransceiver.h
Implements buffer lifecycle methods: setPreAssignedRecvBuffers stores assignments, releasePreAssignedRecvBuffers frees indices via manager pointers and clears the list, and clearPreAssignedRecvBuffers clears without freeing. Adds private mPreAssignedRecvBuffers member.
CacheReceiver buffer integration
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
sendRequestInfo constructs PreAssignedRecvBuffer entries with manager pointers and optional ids when agent connections exist. requestSync calls setPreAssignedRecvBuffers at session creation and releases/clears buffers on error and success paths.
CacheTransceiver error-state safeguards and cleanup
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
requestAndReceiveSync and checkGenTransferStatus now guard against overwriting kDISAGG_TRANS_ERROR with completion transitions. cancelRequest removes successfully cancelled requests from future-tracking vectors. Timeout logging expanded to include request id and milliseconds.
Python executor deferred error mode
tensorrt_llm/_torch/pyexecutor/py_executor.py
_check_cache_transfer_errors gains defer_response parameter; when enabled, attaches error messages to request objects instead of immediately failing them. Generation-side disagg cache-transfer status calls this with defer_response=True.
Python executor terminal condition detection and response handling
tensorrt_llm/_torch/pyexecutor/py_executor.py
Benchmark disagg forward gate detects terminal conditions (error state or KV timeout) and calls _handle_responses to emit failures. _enqueue_responses normalizes payloads across nested structures. _handle_responses converts DISAGG_TRANS_ERROR and KV-timeout requests into per-request error responses.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.90% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly identifies the fix as addressing disagg KV transfer timeout cleanup with the NVBugs ID 6217533, directly matching the main change objective.
Description check ✅ Passed The PR description explains the issue and solution clearly, with detailed validation on benchmark config 6217533, though some template sections lack complete fill-out.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

1087-1112: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Exception safety gap: pre-assigned buffers may leak if receiveReadySignal or receiveSync throws.

If an exception is thrown at line 1096 (receiveReadySignal) or line 1105 (receiveSync), the TransferSession destructor does not call releasePreAssignedRecvBuffers(), leaving the pre-assigned buffer indices unreturned to their managers. Over time, this could exhaust the buffer pool.

Consider either:

  1. Adding releasePreAssignedRecvBuffers() to the TransferSession destructor for any non-empty mPreAssignedRecvBuffers, or
  2. Wrapping the receive logic in a try-catch that releases on exception.
Option 1: RAII cleanup in destructor (preferred)

In dataTransceiver.h, add a destructor to TransferSession:

~TransferSession()
{
    releasePreAssignedRecvBuffers();
}

And change clearPreAssignedRecvBuffers semantics if needed, or leave it as-is since calling release on an already-cleared vector is a no-op.

Option 2: Try-catch in requestSync
 void requestSync(LlmRequest& llmRequest)
 {
     TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(),
         "Start calling requestSync for request ID: %zu, context request ID: %zu.", llmRequest.mRequestId,
         llmRequest.getContextPhaseParams().value().getReqId());
     llmRequest.setKvCacheTransferStart(std::chrono::steady_clock::now());
     TLLM_CUDA_CHECK(cudaSetDevice(mDeviceId));
     auto session = sendRequestInfo(llmRequest);
     session.setTime(TransferSession::kTimeRequestInfo);
+    try
+    {
         bool isReady = receiveReadySignal(session);
         if (!isReady)
         {
             // Reuse the error state for the cancelled request.
             llmRequest.setState(LlmRequestState::kDISAGG_TRANS_ERROR);
             llmRequest.setKvCacheTransferEnd(std::chrono::steady_clock::now());
             session.releasePreAssignedRecvBuffers();
             return;
         }
         receiveSync(session);
         session.clearPreAssignedRecvBuffers();
+    }
+    catch (...)
+    {
+        session.releasePreAssignedRecvBuffers();
+        throw;
+    }
     llmRequest.setKvCacheTransferEnd(std::chrono::steady_clock::now());
     ...
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp` around lines 1087 - 1112,
The code in requestSync may leak pre-assigned buffers if receiveReadySignal or
receiveSync throw because TransferSession::releasePreAssignedRecvBuffers is only
called on the normal path; fix by making buffer release RAII: add a
TransferSession destructor that calls releasePreAssignedRecvBuffers() (ensuring
it's safe to call when the vector is empty) so any exception during
receiveReadySignal or receiveSync returns buffers, or alternatively wrap the
receiveReadySignal/receiveSync calls in a try-catch inside requestSync that
calls session.releasePreAssignedRecvBuffers() before rethrowing; update
clearPreAssignedRecvBuffers semantics if needed to remain idempotent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`:
- Around line 1087-1112: The code in requestSync may leak pre-assigned buffers
if receiveReadySignal or receiveSync throw because
TransferSession::releasePreAssignedRecvBuffers is only called on the normal
path; fix by making buffer release RAII: add a TransferSession destructor that
calls releasePreAssignedRecvBuffers() (ensuring it's safe to call when the
vector is empty) so any exception during receiveReadySignal or receiveSync
returns buffers, or alternatively wrap the receiveReadySignal/receiveSync calls
in a try-catch inside requestSync that calls
session.releasePreAssignedRecvBuffers() before rethrowing; update
clearPreAssignedRecvBuffers semantics if needed to remain idempotent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4851f820-b48d-496d-86e4-e9009fcb2c04

📥 Commits

Reviewing files that changed from the base of the PR and between 09f5958 and 73ab442.

📒 Files selected for processing (4)
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • tensorrt_llm/_torch/pyexecutor/py_executor.py

@liji-nv liji-nv changed the title • [https://nvbugs/6217533][fix] Fix disagg KV transfer timeout cleanup [https://nvbugs/6217533][fix] Fix disagg KV transfer timeout cleanup May 28, 2026
@liji-nv liji-nv force-pushed the disagg-kv-transfer-timeout-cleanup branch from cd5fb0a to f4f4078 Compare May 28, 2026 03:07
@liji-nv

liji-nv commented May 28, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50679 [ run ] triggered by Bot. Commit: f4f4078 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50679 [ run ] completed with state SUCCESS. Commit: f4f4078
/LLM/main/L0_MergeRequest_PR pipeline #40167 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

liji-nv added 3 commits May 28, 2026 23:55
Handle disaggregated KV cache transfer failures as terminal request states instead of allowing them to leave stale transfer state behind.

Generation transfer completion now preserves DISAGG_TRANS_ERROR instead of overwriting it with DISAGG_GENERATION_TRANS_COMPLETE. Successful request cancellation also removes the corresponding raw request pointer from the sender/requester future lists, preventing later status polling from dereferencing a request that has already been terminated.

The receiver path now tracks pre-assigned receive buffers and releases them when a ready signal reports that the transfer cannot proceed. This keeps cancelled or failed transfers from leaking receive buffer ownership into later requests.

On the Python executor side, benchmark disaggregated fill gating now reaps terminal transfer requests before continuing to wait for the batch to become forwardable. Generation-side transfer errors are deferred into request-level error responses, so a single failed transfer can terminate that request cleanly without raising a process-wide error or hanging the benchmark.

Validated with the 6217533 benchmark configuration: sbatch build_sqsh.sh completed successfully; python3 examples/disaggregated/slurm/benchmark/submit.py -c 6217533.yaml completed 2048 / 2048 requests with 0 failed requests; no free(), invalid pointer, UCX ERROR, NIXL crash, or Python traceback was observed in the clean benchmark run.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
@liji-nv liji-nv force-pushed the disagg-kv-transfer-timeout-cleanup branch from bbffd2a to 524f102 Compare May 29, 2026 06:55
@liji-nv

liji-nv commented May 29, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

1 similar comment
@liji-nv

liji-nv commented May 29, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51023 [ run ] triggered by Bot. Commit: 524f102 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51023 [ run ] completed with state SUCCESS. Commit: 524f102
/LLM/main/L0_MergeRequest_PR pipeline #40471 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@Shixiaowei02

Copy link
Copy Markdown
Collaborator

Related fix: #14937

Signed-off-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com>
@Shixiaowei02

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55240 [ run ] triggered by Bot. Commit: a5cd004 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55240 [ run ] completed with state SUCCESS. Commit: a5cd004
/LLM/main/L0_MergeRequest_PR pipeline #44198 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants