[None][feat] Dis-agg transceiver mass integration from the DSV4 branch#15222
[None][feat] Dis-agg transceiver mass integration from the DSV4 branch#15222Shixiaowei02 wants to merge 9 commits into
Conversation
f80b1da to
1a86f9e
Compare
1629cd8 to
00fe5f1
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #53318 [ run ] triggered by Bot. Commit: |
|
PR_Github #53318 [ run ] completed with state
|
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #53408 [ run ] triggered by Bot. Commit: |
d29ca88 to
7cfa0c7
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #53447 [ run ] triggered by Bot. Commit: |
|
PR_Github #53408 [ run ] completed with state |
|
PR_Github #53447 [ run ] completed with state |
7cfa0c7 to
014c379
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #53468 [ run ] triggered by Bot. Commit: |
|
PR_Github #54547 [ run ] triggered by Bot. Commit: |
|
PR_Github #54547 [ run ] completed with state
|
a8cff5d to
86bb5ce
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #54774 [ run ] triggered by Bot. Commit: |
|
PR_Github #54774 [ run ] completed with state
|
86bb5ce to
4b94d7a
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #54998 [ run ] triggered by Bot. Commit: |
|
PR_Github #54998 [ run ] completed with state
|
4b94d7a to
c383543
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #55033 [ run ] triggered by Bot. Commit: |
|
PR_Github #55033 [ run ] completed with state
|
c383543 to
8165df7
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #55141 [ run ] triggered by Bot. Commit: |
Squash of three commits cherry-picked from NVIDIA#13650: * Fix gen-only benchmark for KVCacheManager V2 + improve insufficient KVCache check In gen-only benchmark mode the executor calls _check_benchmark_disagg_gate every iteration and `continue`s when the fill phase is not yet complete. Before that the scheduler has already called try_allocate_generation, which grows each gen request's KV capacity by 1 (+draft_len). Without a matching revert the capacity drifts upward across the many retried iterations until it overflows the host page-index buffer, raising ValueError: User-provided base page indices is too short from KVCacheManagerV2._KVCache.resize. Revert the spurious growth on the should_retry path in both _executor_loop and _executor_loop_overlap, gated by _scheduler_manages_kv_suspend so V1 is unaffected. The previous "Insufficient KV cache for gen-only benchmark mode" guard compared the per-rank num_fetch_requests against the global benchmark_req_queues_size threshold. Under attention DP, prompts are routed across TP ranks via the ADP router, so per-rank fetch counts saturate well below the global threshold and the guard never fires -- deadlocks become silent walltime hangs. Replace it with a liveness watchdog that tracks the per-rank ready-to-forward gen request count (DISAGG_GENERATION_TRANS_COMPLETE + GENERATION_IN_PROGRESS); if the count does not change for >60s, log an explicit error, fail all active requests via _handle_errors, and return None to break the executor loop. The watchdog uses a local-only count so it does not introduce a collective at a point where rank participation can diverge under ADP. * trim kv cache out of window after receive kv cache * fix stall Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…IDIA#13900) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Signed-off-by: Xiaowei Shi <39303645+Shixiaowei02@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…ed (NVIDIA#14042) Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>
Signed-off-by: yuhangh <58161490+heyuhhh@users.noreply.github.com>
- cache_reuse.py: always apply SWA stale_end clamp regardless of enable_block_reuse. The earlier enable_block_reuse=False short-circuit returned [0]*N, dropping the clamp and tripping the SWA assertion in _create_kv_slice on the gen side when reuse is disabled. - router.py: add session kwarg to KvCacheAwareRouter.finish_request and thread it into KvCacheAwareServerState.poll_and_update as an override. Restores the contract the rest of the finish_request methods retained. - test_router.py: update polls_events URL assertion to match _base_url which prepends http:// when the server string is not already a URL. Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
…VIDIA#14162) Signed-off-by: Xianjie Qiao <xqiao@nvidia.com> Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
|
PR_Github #55141 [ run ] completed with state
|
…regation Squash of the post-cherry-pick work layered on top of the 8 DeepSeek-V4 disaggregation cherry-picks. Fixes: - ADP disagg error path: restore per-request hang signal (_event_loop_error), scan all candidates + prefer CTX role for mixed-batch dummy padding, and keep charge_budget=False on KV-transfer timeouts so they don't exhaust the global error budget and shut down the executor. - _count_schedulable_active_requests: gate the GENERATION_TO_COMPLETE exclusion on the V2 KV-cache manager. Only the V2 scheduler skips state >= GENERATION_TO_COMPLETE; the V1 scheduler still forwards those requests, so excluding them under V1 ADP undercounted and spuriously inserted an ADP dummy on top of a real request -- overflowing a small batch and tripping the mamba dummy-mask assert (n <= _dummy_request_mask_host.shape[0]) / "No free slots". Fixes test_ptp_quickstart_advanced_deepseek_v3_lite_4gpus_adp_balance. - transceiver: only short-circuit the tp_allgather skip when pp_size==1 (_ctx_need_pp_sync) -- the PP>1 path asymmetrically flips send/recv markers across pipeline stages and deadlocks the _ctx_consensus pp_allgather. - py_executor: restore main's immediate benchmark fail-fast guard. - resource_manager: do NOT narrow trim_to_history's except (resize() can raise non-ValueError under v2 SWA + uneven-PP; narrowing leaked KV blocks). Tests (added to existing files): - test_py_executor.py: disagg cache-error sync + ADP no-op paths; ADP dummy-role and _pad_attention_dp_dummy_request V1/V2 GENERATION_TO_COMPLETE behavior (adp_balance regression). - test_kv_cache_v2_scheduler.py: trim_to_history. - test_cache_reuse_adapter.py: trim-to-prompt-history + transceiver ctx mgr. - test_router.py: finish_request explicit-session forwarding. - test_agent.py: BindingsNixlTransferStatus + shutdown idempotency (NVIDIA#14137). - transferAgentTest.cpp: status-outlives-agent (weak_ptr UAF safety) + concurrent submitTransferRequests. Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
8165df7 to
cdef3d0
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #55245 [ run ] triggered by Bot. Commit: |
This pull request significantly improves the thread safety and lifecycle management of the NIXL-based transfer agent system in the TensorRT-LLM codebase. The main changes include introducing shared locking for concurrent access, adding explicit shutdown logic to safely clean up resources, using shared pointers for agent ownership, and enhancing status tracking and error handling. These changes make the transfer agents more robust in multi-threaded and multi-client scenarios, preventing race conditions and resource leaks.
Thread safety and resource management:
std::shared_mutexand added shared/unique locks (std::shared_lock/std::unique_lock) to all public methods ofNixlTransferAgentandNixlLoopbackAgentthat access shared state, ensuring thread-safe concurrent and exclusive access. Added checks to prevent operations after shutdown. [1] [2] [3] [4] [5] [6] [7] [8] [9]shutdown()methods to bothNixlTransferAgentandNixlLoopbackAgentfor safe, idempotent cleanup of resources and remote state, and ensured destructors callshutdown(). [1] [2]Ownership and lifecycle improvements:
mRawAgentfromstd::unique_ptrtostd::shared_ptrto enable safe lifetime management across transfer status objects and agent instances.NixlTransferStatusnow holds astd::weak_ptrto the agent. [1] [2] [3] [4] [5]NixlTransferStatusdestructor and methods to safely release resources only if the agent is still alive, avoiding use-after-free errors. [1] [2] [3]Status tracking and error reporting:
mLastStatus) toNixlTransferStatus, with new methodsgetLastStatus()andgetLastStatusStr()for improved error diagnostics and reporting. [1] [2] [3]Python bindings and GIL management:
Other improvements:
submitTransferRequests()to avoid races during concurrent submissions. [1] [2] [3]These changes collectively make the transfer agent system safer, more robust, and easier to use in concurrent and multi-client environments.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.Summary by CodeRabbit
Bug Fixes
New Features