[https://nvbugs/6179661][fix] Harden disagg cache transceiver teardown#15422
Conversation
|
/bot run --disable-fail-fast |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe destructor of ChangesCache transceiver shutdown and null-connection robustness
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
PR_Github #54645 [ run ] triggered by Bot. Commit: |
|
PR_Github #54645 [ run ] completed with state |
|
/bot run --disable-fail-fast |
e5b9cc4 to
81bcac9
Compare
|
/bot run |
|
PR_Github #54882 [ run ] triggered by Bot. Commit: |
|
PR_Github #54884 [ run ] triggered by Bot. Commit: |
|
PR_Github #54882 [ run ] completed with state |
|
PR_Github #54884 [ run ] completed with state
|
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
81bcac9 to
a0f8bcc
Compare
|
/bot run |
|
PR_Github #54952 [ run ] triggered by Bot. Commit: |
|
PR_Github #54952 [ run ] completed with state
|
|
/bot run --disable-fail-fast --stage-list "DGX_B200-4_GPUs-PyTorch-Ray-1" |
|
PR_Github #54993 [ run ] triggered by Bot. Commit: |
|
PR_Github #54993 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #55004 [ run ] triggered by Bot. Commit: |
|
PR_Github #55004 [ run ] completed with state |
NVIDIA#15422) Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Signed-off-by: GitLab CI Bot <gitlab-ci@nvidia.com>
NVIDIA#15422) Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com> Signed-off-by: GitLab CI Bot <gitlab-ci@nvidia.com>
Background
PR #15363 by @nv-xtf / Tingfeng Xian identified and prototyped fixes for several disaggregated serving failure modes while debugging generation-side KV transfer timeout deadlocks. This PR extracts only the shutdown/teardown lifetime-hardening pieces from that work into a standalone PR so they can be reviewed independently from bounded polling, timeout cancellation, and request-cancellation semantics.
Summary
This change hardens cache transceiver teardown by:
CacheSenderandCacheReceiverwhile the connection manager and transfer plugin are still alive,RequestInfo,sendResponse()after the response thread wakes for termination without a valid response iterator.The change is intentionally ungated because this is shutdown/lifetime correctness, not a new cancellation feature.
Dependency graph
Arrows point from prerequisite to dependent. PR numbers in graph nodes are clickable.
This PR is based directly on
main; it does not depend on #15181 or #15356. It is shown with a dashed edge into #15238 because it is preferred hardening before the gated cancellation PR, not because it is part of the bounded-polling stack.graph TD PR15139["<a href='https://github.com/NVIDIA/TensorRT-LLM/pull/15139'>#15139</a>: transfer state consensus (merged)"] PR15422["<a href='https://github.com/NVIDIA/TensorRT-LLM/pull/15422'>#15422</a>: teardown hardening (this PR, open for review)"] PR15181["<a href='https://github.com/NVIDIA/TensorRT-LLM/pull/15181'>#15181</a>: bounded C++ transfer status polling (inflight)"] PR15356["<a href='https://github.com/NVIDIA/TensorRT-LLM/pull/15356'>#15356</a>: bounded V2 context transfer polling (inflight)"] PR15238["<a href='https://github.com/NVIDIA/TensorRT-LLM/pull/15238'>#15238</a>: in-flight cancellation + buffer poison (draft)"] WORK_BLOCKALL["blockAll / wait-all cancellation (planned)"] WORK_BUFFER["multi-slot buffers + unpoison recovery (planned)"] PR15139 -->|satisfied| PR15238 PR15181 -->|blocking| PR15356 PR15181 -->|blocking| PR15238 PR15356 -->|blocking| PR15238 PR15422 -.->|preferred hardening| PR15238 PR15238 -.->|planned| WORK_BLOCKALL PR15238 -.->|planned| WORK_BUFFER classDef merged fill:#dcfce7,stroke:#16a34a,color:#14532d; classDef inflight fill:#dbeafe,stroke:#2563eb,color:#1e3a8a; classDef draft fill:#ffedd5,stroke:#f97316,color:#7c2d12; classDef current fill:#ede9fe,stroke:#7c3aed,color:#3b0764,stroke-width:3px; classDef downstream fill:#f3f4f6,stroke:#6b7280,color:#374151,stroke-dasharray:5 5; linkStyle 0 stroke:#16a34a,stroke-width:2px; linkStyle 1,2,3 stroke:#ea580c,stroke-width:3px; linkStyle 4 stroke:#64748b,stroke-width:2px,stroke-dasharray:3 3; linkStyle 5,6 stroke:#6b7280,stroke-width:2px,stroke-dasharray:5 5; class PR15139 merged; class PR15181,PR15356 inflight; class PR15422 current; class PR15238 draft; class WORK_BLOCKALL,WORK_BUFFER downstream;Scope and relationship to related PRs
Validation
git diff --checkgit commit -spre-commit hooks passed, includingclang-format,codespell, duplicate waive checks, and test-list validation. The first hook attempt used system Python 3.9 and failed on Python 3.10 union syntax inscripts/check_test_list.py; rerunning with bundled Python 3.12 onPATHpassed.Summary by CodeRabbit