[None][feat] Dis-agg transceiver mass integration from the DSV4 branch by Shixiaowei02 · Pull Request #15222 · NVIDIA/TensorRT-LLM

Shixiaowei02 · 2026-06-10T14:11:48Z

This pull request significantly improves the thread safety and lifecycle management of the NIXL-based transfer agent system in the TensorRT-LLM codebase. The main changes include introducing shared locking for concurrent access, adding explicit shutdown logic to safely clean up resources, using shared pointers for agent ownership, and enhancing status tracking and error handling. These changes make the transfer agents more robust in multi-threaded and multi-client scenarios, preventing race conditions and resource leaks.

Thread safety and resource management:

Introduced std::shared_mutex and added shared/unique locks (std::shared_lock/std::unique_lock) to all public methods of NixlTransferAgent and NixlLoopbackAgent that access shared state, ensuring thread-safe concurrent and exclusive access. Added checks to prevent operations after shutdown. [1] [2] [3] [4] [5] [6] [7] [8] [9]
Added explicit shutdown() methods to both NixlTransferAgent and NixlLoopbackAgent for safe, idempotent cleanup of resources and remote state, and ensured destructors call shutdown(). [1] [2]

Ownership and lifecycle improvements:

Changed agent member mRawAgent from std::unique_ptr to std::shared_ptr to enable safe lifetime management across transfer status objects and agent instances. NixlTransferStatus now holds a std::weak_ptr to the agent. [1] [2] [3] [4] [5]
Updated NixlTransferStatus destructor and methods to safely release resources only if the agent is still alive, avoiding use-after-free errors. [1] [2] [3]

Status tracking and error reporting:

Added atomic status tracking (mLastStatus) to NixlTransferStatus, with new methods getLastStatus() and getLastStatusStr() for improved error diagnostics and reporting. [1] [2] [3]

Python bindings and GIL management:

Updated Python bindings to release the Global Interpreter Lock (GIL) for all potentially blocking transfer agent methods, preventing Python thread starvation during long-running operations.

Other improvements:

Refactored per-request parameters in submitTransferRequests() to avoid races during concurrent submissions. [1] [2] [3]

These changes collectively make the transfer agent system safer, more robust, and easier to use in concurrent and multi-client environments.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

Bug Fixes
- Improved stability and thread-safety in disaggregated KV-cache transfer operations through enhanced resource lifecycle management.
- Richer error diagnostics for cache transfer failures, including detailed status and peer information.
- Prevented resource leaks and use-after-shutdown issues in transfer agents.
New Features
- Added context-manager support for transfer agents enabling cleaner resource cleanup.

Shixiaowei02 · 2026-06-10T14:23:07Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-10T14:29:14Z

PR_Github #53318 [ run ] triggered by Bot. Commit: 00fe5f1 Link to invocation

tensorrt-cicd · 2026-06-10T23:45:19Z

PR_Github #53318 [ run ] completed with state FAILURE. Commit: 00fe5f1
/LLM/main/L0_MergeRequest_PR pipeline #42504 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Shixiaowei02 · 2026-06-11T00:28:35Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-11T00:34:31Z

PR_Github #53408 [ run ] triggered by Bot. Commit: 0404ecd Link to invocation

Shixiaowei02 · 2026-06-11T02:31:45Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-11T02:38:06Z

PR_Github #53447 [ run ] triggered by Bot. Commit: 7cfa0c7 Link to invocation

tensorrt-cicd · 2026-06-11T02:42:36Z

PR_Github #53408 [ run ] completed with state ABORTED. Commit: 0404ecd

Link to invocation

tensorrt-cicd · 2026-06-11T04:03:08Z

PR_Github #53447 [ run ] completed with state ABORTED. Commit: 7cfa0c7

Link to invocation

Shixiaowei02 · 2026-06-11T04:10:51Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-11T04:17:32Z

PR_Github #53468 [ run ] triggered by Bot. Commit: 014c379 Link to invocation

tensorrt-cicd · 2026-06-16T08:23:02Z

PR_Github #54547 [ run ] triggered by Bot. Commit: a8cff5d Link to invocation

tensorrt-cicd · 2026-06-16T19:52:57Z

PR_Github #54547 [ run ] completed with state FAILURE. Commit: a8cff5d
/LLM/main/L0_MergeRequest_PR pipeline #43598 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Shixiaowei02 · 2026-06-17T06:04:44Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-17T06:10:29Z

PR_Github #54774 [ run ] triggered by Bot. Commit: 86bb5ce Link to invocation

tensorrt-cicd · 2026-06-17T12:58:32Z

PR_Github #54774 [ run ] completed with state SUCCESS. Commit: 86bb5ce
/LLM/main/L0_MergeRequest_PR pipeline #43792 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Shixiaowei02 · 2026-06-21T03:45:46Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-21T03:51:28Z

PR_Github #54998 [ run ] triggered by Bot. Commit: 4b94d7a Link to invocation

tensorrt-cicd · 2026-06-21T14:00:42Z

PR_Github #54998 [ run ] completed with state SUCCESS. Commit: 4b94d7a
/LLM/main/L0_MergeRequest_PR pipeline #43990 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Shixiaowei02 · 2026-06-22T09:19:55Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-22T09:26:10Z

PR_Github #55033 [ run ] triggered by Bot. Commit: c383543 Link to invocation

tensorrt-cicd · 2026-06-22T17:21:15Z

PR_Github #55033 [ run ] completed with state SUCCESS. Commit: c383543
/LLM/main/L0_MergeRequest_PR pipeline #44024 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Shixiaowei02 · 2026-06-23T03:09:09Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-23T03:17:22Z

PR_Github #55141 [ run ] triggered by Bot. Commit: 8165df7 Link to invocation

Squash of three commits cherry-picked from NVIDIA#13650: * Fix gen-only benchmark for KVCacheManager V2 + improve insufficient KVCache check In gen-only benchmark mode the executor calls _check_benchmark_disagg_gate every iteration and `continue`s when the fill phase is not yet complete. Before that the scheduler has already called try_allocate_generation, which grows each gen request's KV capacity by 1 (+draft_len). Without a matching revert the capacity drifts upward across the many retried iterations until it overflows the host page-index buffer, raising ValueError: User-provided base page indices is too short from KVCacheManagerV2._KVCache.resize. Revert the spurious growth on the should_retry path in both _executor_loop and _executor_loop_overlap, gated by _scheduler_manages_kv_suspend so V1 is unaffected. The previous "Insufficient KV cache for gen-only benchmark mode" guard compared the per-rank num_fetch_requests against the global benchmark_req_queues_size threshold. Under attention DP, prompts are routed across TP ranks via the ADP router, so per-rank fetch counts saturate well below the global threshold and the guard never fires -- deadlocks become silent walltime hangs. Replace it with a liveness watchdog that tracks the per-rank ready-to-forward gen request count (DISAGG_GENERATION_TRANS_COMPLETE + GENERATION_IN_PROGRESS); if the count does not change for >60s, log an explicit error, fail all active requests via _handle_errors, and return None to break the executor loop. The watchdog uses a local-only count so it does not introduce a collective at a point where rank participation can diverge under ADP. * trim kv cache out of window after receive kv cache * fix stall Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

…IDIA#13900) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Signed-off-by: Xiaowei Shi <39303645+Shixiaowei02@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

…ed (NVIDIA#14042) Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>

Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>

- cache_reuse.py: always apply SWA stale_end clamp regardless of enable_block_reuse. The earlier enable_block_reuse=False short-circuit returned [0]*N, dropping the clamp and tripping the SWA assertion in _create_kv_slice on the gen side when reuse is disabled. - router.py: add session kwarg to KvCacheAwareRouter.finish_request and thread it into KvCacheAwareServerState.poll_and_update as an override. Restores the contract the rest of the finish_request methods retained. - test_router.py: update polls_events URL assertion to match _base_url which prepends http:// when the server string is not already a URL. Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

…VIDIA#14162) Signed-off-by: Xianjie Qiao <xqiao@nvidia.com> Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

tensorrt-cicd · 2026-06-23T11:49:23Z

PR_Github #55141 [ run ] completed with state SUCCESS. Commit: 8165df7
/LLM/main/L0_MergeRequest_PR pipeline #44121 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…regation Squash of the post-cherry-pick work layered on top of the 8 DeepSeek-V4 disaggregation cherry-picks. Fixes: - ADP disagg error path: restore per-request hang signal (_event_loop_error), scan all candidates + prefer CTX role for mixed-batch dummy padding, and keep charge_budget=False on KV-transfer timeouts so they don't exhaust the global error budget and shut down the executor. - _count_schedulable_active_requests: gate the GENERATION_TO_COMPLETE exclusion on the V2 KV-cache manager. Only the V2 scheduler skips state >= GENERATION_TO_COMPLETE; the V1 scheduler still forwards those requests, so excluding them under V1 ADP undercounted and spuriously inserted an ADP dummy on top of a real request -- overflowing a small batch and tripping the mamba dummy-mask assert (n <= _dummy_request_mask_host.shape[0]) / "No free slots". Fixes test_ptp_quickstart_advanced_deepseek_v3_lite_4gpus_adp_balance. - transceiver: only short-circuit the tp_allgather skip when pp_size==1 (_ctx_need_pp_sync) -- the PP>1 path asymmetrically flips send/recv markers across pipeline stages and deadlocks the _ctx_consensus pp_allgather. - py_executor: restore main's immediate benchmark fail-fast guard. - resource_manager: do NOT narrow trim_to_history's except (resize() can raise non-ValueError under v2 SWA + uneven-PP; narrowing leaked KV blocks). Tests (added to existing files): - test_py_executor.py: disagg cache-error sync + ADP no-op paths; ADP dummy-role and _pad_attention_dp_dummy_request V1/V2 GENERATION_TO_COMPLETE behavior (adp_balance regression). - test_kv_cache_v2_scheduler.py: trim_to_history. - test_cache_reuse_adapter.py: trim-to-prompt-history + transceiver ctx mgr. - test_router.py: finish_request explicit-session forwarding. - test_agent.py: BindingsNixlTransferStatus + shutdown idempotency (NVIDIA#14137). - transferAgentTest.cpp: status-outlives-agent (weak_ptr UAF safety) + concurrent submitTransferRequests. Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

Shixiaowei02 · 2026-06-23T13:04:53Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-06-23T13:11:27Z

PR_Github #55245 [ run ] triggered by Bot. Commit: cdef3d0 Link to invocation

Shixiaowei02 force-pushed the cherry-pick-dsv4 branch from f80b1da to 1a86f9e Compare June 10, 2026 14:12

github-actions Bot assigned Shixiaowei02 Jun 10, 2026

Shixiaowei02 changed the title ~~[None][feat] Dis-agg mass integration from the DSV4 branch~~ [None][feat] Dis-agg transceiver mass integration from the DSV4 branch Jun 10, 2026

Shixiaowei02 force-pushed the cherry-pick-dsv4 branch 2 times, most recently from 1629cd8 to 00fe5f1 Compare June 10, 2026 14:22

Shixiaowei02 force-pushed the cherry-pick-dsv4 branch 2 times, most recently from d29ca88 to 7cfa0c7 Compare June 11, 2026 02:31

Shixiaowei02 force-pushed the cherry-pick-dsv4 branch from 7cfa0c7 to 014c379 Compare June 11, 2026 04:10

Shixiaowei02 marked this pull request as ready for review June 11, 2026 05:04

Shixiaowei02 requested review from a team as code owners June 11, 2026 05:04

Shixiaowei02 requested review from byshiue, chuangz0, pcastonguay, reasonsolo and syuoni June 11, 2026 05:04

Shixiaowei02 force-pushed the cherry-pick-dsv4 branch from a8cff5d to 86bb5ce Compare June 17, 2026 06:04

Shixiaowei02 force-pushed the cherry-pick-dsv4 branch from 86bb5ce to 4b94d7a Compare June 21, 2026 03:45

Shixiaowei02 force-pushed the cherry-pick-dsv4 branch from 4b94d7a to c383543 Compare June 22, 2026 09:19

Shixiaowei02 force-pushed the cherry-pick-dsv4 branch from c383543 to 8165df7 Compare June 23, 2026 03:08

chuangz0 and others added 8 commits June 23, 2026 04:22

[None][perf] Skip transceiver tp_allgather when no sessions ever open…

1c473ca

…ed (NVIDIA#14042) Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>

[None][test] Add dsv4 dis-agg module-level unit tests (NVIDIA#13936)

41984de

Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>

[None][fix] NIXL agent transfer and shutdown race (NVIDIA#14137)

963cd3a

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

[None][fix] Return 3-tuple from check_gen_transfer_status fast path (N…

a06fcf5

…VIDIA#14162) Signed-off-by: Xianjie Qiao <xqiao@nvidia.com> Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>

[None][fix] fix token_range_end add extra_kv_num_tokens (NVIDIA#14258)

8b5e0d4

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

Shixiaowei02 force-pushed the cherry-pick-dsv4 branch from 8165df7 to cdef3d0 Compare June 23, 2026 12:39

Conversation

Shixiaowei02 commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Summary by CodeRabbit

Uh oh!

Shixiaowei02 commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

Shixiaowei02 commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

Shixiaowei02 commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

Shixiaowei02 commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

Shixiaowei02 commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

Shixiaowei02 commented Jun 21, 2026

Uh oh!

tensorrt-cicd commented Jun 21, 2026

Uh oh!

tensorrt-cicd commented Jun 21, 2026

Uh oh!

Shixiaowei02 commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 22, 2026

Uh oh!

Shixiaowei02 commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

Shixiaowei02 commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Shixiaowei02 commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading