[https://nvbugs/6217533][fix] Fix disagg KV transfer timeout cleanup by liji-nv · Pull Request #14633 · NVIDIA/TensorRT-LLM

liji-nv · 2026-05-27T14:10:40Z

Handle disaggregated KV cache transfer failures as terminal request states instead of allowing them to leave stale transfer state behind.

Generation transfer completion now preserves DISAGG_TRANS_ERROR instead of overwriting it with DISAGG_GENERATION_TRANS_COMPLETE. Successful request cancellation also removes the corresponding raw request pointer from the sender/requester future lists, preventing later status polling from dereferencing a request that has already been terminated.

The receiver path now tracks pre-assigned receive buffers and releases them when a ready signal reports that the transfer cannot proceed. This keeps cancelled or failed transfers from leaking receive buffer ownership into later requests.

On the Python executor side, benchmark disaggregated fill gating now reaps terminal transfer requests before continuing to wait for the batch to become forwardable. Generation-side transfer errors are deferred into request-level error responses, so a single failed transfer can terminate that request cleanly without raising a process-wide error or hanging the benchmark.

Validated with the 6217533 benchmark configuration: sbatch build_sqsh.sh completed successfully; python3 examples/disaggregated/slurm/benchmark/submit.py -c 6217533.yaml completed 2048 / 2048 requests with 0 failed requests; no free(), invalid pointer, UCX ERROR, NIXL crash, or Python traceback was observed in the clean benchmark run.

Summary by CodeRabbit

Bug Fixes
- Improved error state handling in disaggregated execution to prevent overwriting critical error information.
- Enhanced timeout logging with more detailed timing information for better debugging of transfer issues.
- Fixed response collection to handle edge cases and malformed payloads more robustly.
- Improved cleanup of transfer requests to prevent resource leaks during cancellation.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-05-27T14:16:54Z

📝 Walkthrough

Walkthrough

This PR improves error resilience in disaggregated KV-cache transfer by preventing error-state overwrites in the C++ transceiver, introducing explicit pre-assigned buffer lifecycle management, and enhancing the Python executor to detect terminal transfer conditions and emit granular per-request error responses rather than failing all pending work at once.

Changes

Disaggregated Transfer Error Handling and Buffer Management

Layer / File(s)	Summary
PreAssignedRecvBuffer contract and TransferSession API `cpp/tensorrt_llm/batch_manager/dataTransceiver.h`	Adds `PreAssignedRecvBuffer` struct (containing manager pointer and optional buffer id) and three public TransferSession methods: `setPreAssignedRecvBuffers`, `releasePreAssignedRecvBuffers`, and `clearPreAssignedRecvBuffers`. Includes `<optional>` and updates copyright year.
TransferSession buffer management implementation `cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`, `cpp/tensorrt_llm/batch_manager/dataTransceiver.h`	Implements buffer lifecycle methods: `setPreAssignedRecvBuffers` stores assignments, `releasePreAssignedRecvBuffers` frees indices via manager pointers and clears the list, and `clearPreAssignedRecvBuffers` clears without freeing. Adds private `mPreAssignedRecvBuffers` member.
CacheReceiver buffer integration `cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`	`sendRequestInfo` constructs `PreAssignedRecvBuffer` entries with manager pointers and optional ids when agent connections exist. `requestSync` calls `setPreAssignedRecvBuffers` at session creation and releases/clears buffers on error and success paths.
CacheTransceiver error-state safeguards and cleanup `cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`	`requestAndReceiveSync` and `checkGenTransferStatus` now guard against overwriting `kDISAGG_TRANS_ERROR` with completion transitions. `cancelRequest` removes successfully cancelled requests from future-tracking vectors. Timeout logging expanded to include request id and milliseconds.
Python executor deferred error mode `tensorrt_llm/_torch/pyexecutor/py_executor.py`	`_check_cache_transfer_errors` gains `defer_response` parameter; when enabled, attaches error messages to request objects instead of immediately failing them. Generation-side disagg cache-transfer status calls this with `defer_response=True`.
Python executor terminal condition detection and response handling `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Benchmark disagg forward gate detects terminal conditions (error state or KV timeout) and calls `_handle_responses` to emit failures. `_enqueue_responses` normalizes payloads across nested structures. `_handle_responses` converts `DISAGG_TRANS_ERROR` and KV-timeout requests into per-request error responses.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 6.90% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly identifies the fix as addressing disagg KV transfer timeout cleanup with the NVBugs ID 6217533, directly matching the main change objective.
Description check	✅ Passed	The PR description explains the issue and solution clearly, with detailed validation on benchmark config 6217533, though some template sections lack complete fill-out.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)
1087-1112: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Exception safety gap: pre-assigned buffers may leak if receiveReadySignal or receiveSync throws.

If an exception is thrown at line 1096 (receiveReadySignal) or line 1105 (receiveSync), the TransferSession destructor does not call releasePreAssignedRecvBuffers(), leaving the pre-assigned buffer indices unreturned to their managers. Over time, this could exhaust the buffer pool.

Consider either:

Adding releasePreAssignedRecvBuffers() to the TransferSession destructor for any non-empty mPreAssignedRecvBuffers, or

Wrapping the receive logic in a try-catch that releases on exception.
Option 1: RAII cleanup in destructor (preferred)

In dataTransceiver.h, add a destructor to TransferSession:
~TransferSession()
{
    releasePreAssignedRecvBuffers();
}
And change clearPreAssignedRecvBuffers semantics if needed, or leave it as-is since calling release on an already-cleared vector is a no-op.
Option 2: Try-catch in requestSync
 void requestSync(LlmRequest& llmRequest)
 {
     TLLM_LOG_DEBUG(mpi::MpiComm::world().getRank(),
         "Start calling requestSync for request ID: %zu, context request ID: %zu.", llmRequest.mRequestId,
         llmRequest.getContextPhaseParams().value().getReqId());
     llmRequest.setKvCacheTransferStart(std::chrono::steady_clock::now());
     TLLM_CUDA_CHECK(cudaSetDevice(mDeviceId));
     auto session = sendRequestInfo(llmRequest);
     session.setTime(TransferSession::kTimeRequestInfo);
+    try
+    {
         bool isReady = receiveReadySignal(session);
         if (!isReady)
         {
             // Reuse the error state for the cancelled request.
             llmRequest.setState(LlmRequestState::kDISAGG_TRANS_ERROR);
             llmRequest.setKvCacheTransferEnd(std::chrono::steady_clock::now());
             session.releasePreAssignedRecvBuffers();
             return;
         }
         receiveSync(session);
         session.clearPreAssignedRecvBuffers();
+    }
+    catch (...)
+    {
+        session.releasePreAssignedRecvBuffers();
+        throw;
+    }
     llmRequest.setKvCacheTransferEnd(std::chrono::steady_clock::now());
     ...
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp` around lines 1087 - 1112,
The code in requestSync may leak pre-assigned buffers if receiveReadySignal or
receiveSync throw because TransferSession::releasePreAssignedRecvBuffers is only
called on the normal path; fix by making buffer release RAII: add a
TransferSession destructor that calls releasePreAssignedRecvBuffers() (ensuring
it's safe to call when the vector is empty) so any exception during
receiveReadySignal or receiveSync returns buffers, or alternatively wrap the
receiveReadySignal/receiveSync calls in a try-catch inside requestSync that
calls session.releasePreAssignedRecvBuffers() before rethrowing; update
clearPreAssignedRecvBuffers semantics if needed to remain idempotent.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`:
- Around line 1087-1112: The code in requestSync may leak pre-assigned buffers
if receiveReadySignal or receiveSync throw because
TransferSession::releasePreAssignedRecvBuffers is only called on the normal
path; fix by making buffer release RAII: add a TransferSession destructor that
calls releasePreAssignedRecvBuffers() (ensuring it's safe to call when the
vector is empty) so any exception during receiveReadySignal or receiveSync
returns buffers, or alternatively wrap the receiveReadySignal/receiveSync calls
in a try-catch inside requestSync that calls
session.releasePreAssignedRecvBuffers() before rethrowing; update
clearPreAssignedRecvBuffers semantics if needed to remain idempotent.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4851f820-b48d-496d-86e4-e9009fcb2c04

📥 Commits

Reviewing files that changed from the base of the PR and between 09f5958 and 73ab442.

📒 Files selected for processing (4)

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
tensorrt_llm/_torch/pyexecutor/py_executor.py

liji-nv · 2026-05-28T03:08:38Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-05-28T03:15:11Z

PR_Github #50679 [ run ] triggered by Bot. Commit: f4f4078 Link to invocation

tensorrt-cicd · 2026-05-28T11:23:14Z

PR_Github #50679 [ run ] completed with state SUCCESS. Commit: f4f4078
/LLM/main/L0_MergeRequest_PR pipeline #40167 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Handle disaggregated KV cache transfer failures as terminal request states instead of allowing them to leave stale transfer state behind. Generation transfer completion now preserves DISAGG_TRANS_ERROR instead of overwriting it with DISAGG_GENERATION_TRANS_COMPLETE. Successful request cancellation also removes the corresponding raw request pointer from the sender/requester future lists, preventing later status polling from dereferencing a request that has already been terminated. The receiver path now tracks pre-assigned receive buffers and releases them when a ready signal reports that the transfer cannot proceed. This keeps cancelled or failed transfers from leaking receive buffer ownership into later requests. On the Python executor side, benchmark disaggregated fill gating now reaps terminal transfer requests before continuing to wait for the batch to become forwardable. Generation-side transfer errors are deferred into request-level error responses, so a single failed transfer can terminate that request cleanly without raising a process-wide error or hanging the benchmark. Validated with the 6217533 benchmark configuration: sbatch build_sqsh.sh completed successfully; python3 examples/disaggregated/slurm/benchmark/submit.py -c 6217533.yaml completed 2048 / 2048 requests with 0 failed requests; no free(), invalid pointer, UCX ERROR, NIXL crash, or Python traceback was observed in the clean benchmark run. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

liji-nv · 2026-05-29T06:55:07Z

/bot run --disable-fail-fast --add-multi-gpu-test

liji-nv · 2026-05-29T09:20:24Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-05-29T09:26:56Z

PR_Github #51023 [ run ] triggered by Bot. Commit: 524f102 Link to invocation

tensorrt-cicd · 2026-05-29T19:07:49Z

PR_Github #51023 [ run ] completed with state SUCCESS. Commit: 524f102
/LLM/main/L0_MergeRequest_PR pipeline #40471 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Shixiaowei02 · 2026-06-04T05:02:58Z

Related fix: #14937

Signed-off-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com>

Shixiaowei02 · 2026-06-23T12:03:22Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-06-23T12:09:44Z

PR_Github #55240 [ run ] triggered by Bot. Commit: a5cd004 Link to invocation

tensorrt-cicd · 2026-06-23T19:32:34Z

PR_Github #55240 [ run ] completed with state SUCCESS. Commit: a5cd004
/LLM/main/L0_MergeRequest_PR pipeline #44198 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

liji-nv requested review from a team as code owners May 27, 2026 14:10

liji-nv requested review from bo-nv and schetlur-nv May 27, 2026 14:10

github-actions Bot assigned liji-nv May 27, 2026

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

liji-nv changed the title ~~• [https://nvbugs/6217533][fix] Fix disagg KV transfer timeout cleanup~~ [https://nvbugs/6217533][fix] Fix disagg KV transfer timeout cleanup May 28, 2026

liji-nv force-pushed the disagg-kv-transfer-timeout-cleanup branch from cd5fb0a to f4f4078 Compare May 28, 2026 03:07

liji-nv added 3 commits May 28, 2026 23:55

Release recv buffers on transfer exceptions

2c4bcdd

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

[https://nvbugs/6217533][test] Fix benchmark disagg mock state

524f102

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

liji-nv force-pushed the disagg-kv-transfer-timeout-cleanup branch from bbffd2a to 524f102 Compare May 29, 2026 06:55

Merge branch 'main' into disagg-kv-transfer-timeout-cleanup

a5cd004

Signed-off-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com>

Conversation

liji-nv commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

liji-nv commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

liji-nv commented May 29, 2026

Uh oh!

liji-nv commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

Shixiaowei02 commented Jun 4, 2026

Uh oh!

Shixiaowei02 commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liji-nv commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading