Skip to content

[None][feat] Dis-agg transceiver mass integration from the DSV4 branch#15222

Open
Shixiaowei02 wants to merge 9 commits into
NVIDIA:mainfrom
Shixiaowei02:cherry-pick-dsv4
Open

[None][feat] Dis-agg transceiver mass integration from the DSV4 branch#15222
Shixiaowei02 wants to merge 9 commits into
NVIDIA:mainfrom
Shixiaowei02:cherry-pick-dsv4

Conversation

@Shixiaowei02

@Shixiaowei02 Shixiaowei02 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

This pull request significantly improves the thread safety and lifecycle management of the NIXL-based transfer agent system in the TensorRT-LLM codebase. The main changes include introducing shared locking for concurrent access, adding explicit shutdown logic to safely clean up resources, using shared pointers for agent ownership, and enhancing status tracking and error handling. These changes make the transfer agents more robust in multi-threaded and multi-client scenarios, preventing race conditions and resource leaks.

Thread safety and resource management:

  • Introduced std::shared_mutex and added shared/unique locks (std::shared_lock/std::unique_lock) to all public methods of NixlTransferAgent and NixlLoopbackAgent that access shared state, ensuring thread-safe concurrent and exclusive access. Added checks to prevent operations after shutdown. [1] [2] [3] [4] [5] [6] [7] [8] [9]
  • Added explicit shutdown() methods to both NixlTransferAgent and NixlLoopbackAgent for safe, idempotent cleanup of resources and remote state, and ensured destructors call shutdown(). [1] [2]

Ownership and lifecycle improvements:

  • Changed agent member mRawAgent from std::unique_ptr to std::shared_ptr to enable safe lifetime management across transfer status objects and agent instances. NixlTransferStatus now holds a std::weak_ptr to the agent. [1] [2] [3] [4] [5]
  • Updated NixlTransferStatus destructor and methods to safely release resources only if the agent is still alive, avoiding use-after-free errors. [1] [2] [3]

Status tracking and error reporting:

  • Added atomic status tracking (mLastStatus) to NixlTransferStatus, with new methods getLastStatus() and getLastStatusStr() for improved error diagnostics and reporting. [1] [2] [3]

Python bindings and GIL management:

  • Updated Python bindings to release the Global Interpreter Lock (GIL) for all potentially blocking transfer agent methods, preventing Python thread starvation during long-running operations.

Other improvements:

  • Refactored per-request parameters in submitTransferRequests() to avoid races during concurrent submissions. [1] [2] [3]

These changes collectively make the transfer agent system safer, more robust, and easier to use in concurrent and multi-client environments.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

  • Bug Fixes

    • Improved stability and thread-safety in disaggregated KV-cache transfer operations through enhanced resource lifecycle management.
    • Richer error diagnostics for cache transfer failures, including detailed status and peer information.
    • Prevented resource leaks and use-after-shutdown issues in transfer agents.
  • New Features

    • Added context-manager support for transfer agents enabling cleaner resource cleanup.

@Shixiaowei02 Shixiaowei02 changed the title [None][feat] Dis-agg mass integration from the DSV4 branch [None][feat] Dis-agg transceiver mass integration from the DSV4 branch Jun 10, 2026
@Shixiaowei02 Shixiaowei02 force-pushed the cherry-pick-dsv4 branch 2 times, most recently from 1629cd8 to 00fe5f1 Compare June 10, 2026 14:22
@Shixiaowei02

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53318 [ run ] triggered by Bot. Commit: 00fe5f1 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53318 [ run ] completed with state FAILURE. Commit: 00fe5f1
/LLM/main/L0_MergeRequest_PR pipeline #42504 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@Shixiaowei02

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53408 [ run ] triggered by Bot. Commit: 0404ecd Link to invocation

@Shixiaowei02 Shixiaowei02 force-pushed the cherry-pick-dsv4 branch 2 times, most recently from d29ca88 to 7cfa0c7 Compare June 11, 2026 02:31
@Shixiaowei02

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53447 [ run ] triggered by Bot. Commit: 7cfa0c7 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53408 [ run ] completed with state ABORTED. Commit: 0404ecd

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53447 [ run ] completed with state ABORTED. Commit: 7cfa0c7

Link to invocation

@Shixiaowei02

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53468 [ run ] triggered by Bot. Commit: 014c379 Link to invocation

@Shixiaowei02 Shixiaowei02 marked this pull request as ready for review June 11, 2026 05:04
@Shixiaowei02 Shixiaowei02 requested review from a team as code owners June 11, 2026 05:04
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54547 [ run ] triggered by Bot. Commit: a8cff5d Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54547 [ run ] completed with state FAILURE. Commit: a8cff5d
/LLM/main/L0_MergeRequest_PR pipeline #43598 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Shixiaowei02

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54774 [ run ] triggered by Bot. Commit: 86bb5ce Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54774 [ run ] completed with state SUCCESS. Commit: 86bb5ce
/LLM/main/L0_MergeRequest_PR pipeline #43792 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@Shixiaowei02

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54998 [ run ] triggered by Bot. Commit: 4b94d7a Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54998 [ run ] completed with state SUCCESS. Commit: 4b94d7a
/LLM/main/L0_MergeRequest_PR pipeline #43990 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Shixiaowei02

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55033 [ run ] triggered by Bot. Commit: c383543 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55033 [ run ] completed with state SUCCESS. Commit: c383543
/LLM/main/L0_MergeRequest_PR pipeline #44024 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@Shixiaowei02

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55141 [ run ] triggered by Bot. Commit: 8165df7 Link to invocation

chuangz0 and others added 8 commits June 23, 2026 04:22
Squash of three commits cherry-picked from
NVIDIA#13650:

* Fix gen-only benchmark for KVCacheManager V2 + improve insufficient
  KVCache check

  In gen-only benchmark mode the executor calls
  _check_benchmark_disagg_gate every iteration and `continue`s when the
  fill phase is not yet complete.  Before that the scheduler has already
  called try_allocate_generation, which grows each gen request's KV
  capacity by 1 (+draft_len).  Without a matching revert the capacity
  drifts upward across the many retried iterations until it overflows
  the host page-index buffer, raising

    ValueError: User-provided base page indices is too short

  from KVCacheManagerV2._KVCache.resize.  Revert the spurious growth on
  the should_retry path in both _executor_loop and _executor_loop_overlap,
  gated by _scheduler_manages_kv_suspend so V1 is unaffected.

  The previous "Insufficient KV cache for gen-only benchmark mode" guard
  compared the per-rank num_fetch_requests against the global
  benchmark_req_queues_size threshold.  Under attention DP, prompts are
  routed across TP ranks via the ADP router, so per-rank fetch counts
  saturate well below the global threshold and the guard never fires --
  deadlocks become silent walltime hangs.  Replace it with a liveness
  watchdog that tracks the per-rank ready-to-forward gen request count
  (DISAGG_GENERATION_TRANS_COMPLETE + GENERATION_IN_PROGRESS); if the
  count does not change for >60s, log an explicit error, fail all
  active requests via _handle_errors, and return None to break the
  executor loop.  The watchdog uses a local-only count so it does not
  introduce a collective at a point where rank participation can diverge
  under ADP.

* trim kv cache out of window after receive kv cache

* fix stall

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…IDIA#13900)

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Xiaowei Shi <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…ed (NVIDIA#14042)

Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>
Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>
- cache_reuse.py: always apply SWA stale_end clamp regardless of
  enable_block_reuse. The earlier enable_block_reuse=False short-circuit
  returned [0]*N, dropping the clamp and tripping the SWA assertion in
  _create_kv_slice on the gen side when reuse is disabled.

- router.py: add session kwarg to KvCacheAwareRouter.finish_request and
  thread it into KvCacheAwareServerState.poll_and_update as an override.
  Restores the contract the rest of the finish_request methods retained.

- test_router.py: update polls_events URL assertion to match _base_url
  which prepends http:// when the server string is not already a URL.

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
…VIDIA#14162)

Signed-off-by: Xianjie Qiao <xqiao@nvidia.com>
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55141 [ run ] completed with state SUCCESS. Commit: 8165df7
/LLM/main/L0_MergeRequest_PR pipeline #44121 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…regation

Squash of the post-cherry-pick work layered on top of the 8 DeepSeek-V4
disaggregation cherry-picks.

Fixes:
- ADP disagg error path: restore per-request hang signal (_event_loop_error),
  scan all candidates + prefer CTX role for mixed-batch dummy padding, and
  keep charge_budget=False on KV-transfer timeouts so they don't exhaust the
  global error budget and shut down the executor.
- _count_schedulable_active_requests: gate the GENERATION_TO_COMPLETE exclusion
  on the V2 KV-cache manager. Only the V2 scheduler skips state
  >= GENERATION_TO_COMPLETE; the V1 scheduler still forwards those requests, so
  excluding them under V1 ADP undercounted and spuriously inserted an ADP dummy
  on top of a real request -- overflowing a small batch and tripping the mamba
  dummy-mask assert (n <= _dummy_request_mask_host.shape[0]) / "No free slots".
  Fixes test_ptp_quickstart_advanced_deepseek_v3_lite_4gpus_adp_balance.
- transceiver: only short-circuit the tp_allgather skip when pp_size==1
  (_ctx_need_pp_sync) -- the PP>1 path asymmetrically flips send/recv markers
  across pipeline stages and deadlocks the _ctx_consensus pp_allgather.
- py_executor: restore main's immediate benchmark fail-fast guard.
- resource_manager: do NOT narrow trim_to_history's except (resize() can raise
  non-ValueError under v2 SWA + uneven-PP; narrowing leaked KV blocks).

Tests (added to existing files):
- test_py_executor.py: disagg cache-error sync + ADP no-op paths; ADP dummy-role
  and _pad_attention_dp_dummy_request V1/V2 GENERATION_TO_COMPLETE behavior
  (adp_balance regression).
- test_kv_cache_v2_scheduler.py: trim_to_history.
- test_cache_reuse_adapter.py: trim-to-prompt-history + transceiver ctx mgr.
- test_router.py: finish_request explicit-session forwarding.
- test_agent.py: BindingsNixlTransferStatus + shutdown idempotency (NVIDIA#14137).
- transferAgentTest.cpp: status-outlives-agent (weak_ptr UAF safety) +
  concurrent submitTransferRequests.

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
@Shixiaowei02

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55245 [ run ] triggered by Bot. Commit: cdef3d0 Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants