[https://nvbugs/6342844][fix] Prevent cache sender lost wakeups by chienchunhung · Pull Request #15737 · NVIDIA/TensorRT-LLM

chienchunhung · 2026-06-29T23:18:56Z

Summary

PR #15737 fixes the CacheSender lost-wakeup and data race behind the reported generation-side KV-transfer stalls, targeting NVBug 6342844. With the final relevant behavior, nine of the eleven active perf-sanity cases have valid results: all nine finish, seven pass completely, and two fail only separate performance gates. The remaining two require current-head reruns; no valid final-behavior run has reproduced the original liveness hang.

Fix details

mReadyResponses was protected by mSenderMutex, but readiness was mirrored in mAnyReady under a different mutex. A remover could erase the last response, an enqueuer could insert and notify, and the remover could then overwrite mAnyReady=false; the map remained non-empty while the response worker slept indefinitely. Because the queue belongs to the shared C++ sender implementation, the race can strand both asynchronous and synchronous transfers and is not transport-specific.

Unify sender synchronization: Guard the response map, current request, cancellation state, and condition predicate with mSenderMutex; wait directly on mReadyResponses; and remove mAnyReady, mCondMutex, and the unused responder condition variable.
Keep blocking work outside the lock: Wait for the exact request ID and move or erase responses under the sender lock, while transport, CUDA, and promise operations remain outside the critical section.
Settle terminal work: Complete outstanding response-map and asynchronous-queue promises during failure and shutdown.
Preserve synchronous admission: Skip asynchronous generation-status polling and optional idle-progress collectives only for real synchronous transfer. Continue enforcing max_tokens_in_buffer; bypass transfer admission only for synthetic gen_only_no_context, which performs no KV transfer.
Preserve prior fixes: Retain PR [https://nvbugs/6104831][fix] Port dataTransceiver shared_ptr<LlmRequest> lifetime fix #14979's shared request ownership and PR [TRTLLM-12721][fix] Bound disagg transfer polling and admission #15356's bounded synchronous admission semantics.
Add focused qualification: Add unittest/others/<br>test_kv_cache_transceiver.py::<br>test_cpp_nixl_sync_transfer_stress for 64 sequential C++/NIXL synchronous transfers, plus unit coverage for synchronous admission, no-transfer bypass, asynchronous-poll bypass, rank-aligned error handling, and environment isolation.
Adjust one test capacity: Raise free_gpu_memory_fraction from 0.90 to 0.95 only for the GB200 DeepSeek con3072 benchmark. This does not change a production default.

The change introduces no public API or configuration change. Normal asynchronous Python scheduling remains unchanged and benefits from the generic C++ sender fix. Benchmark fill-gate rank skew and unmatched-request UCX/MPI teardown remain out of scope.

Test outcomes and accounting

🟢 Finishes and passes · 🟠 Finishes; full test fails · 🟡 No valid result for final behavior · ⚪ Out of scope

“Finishes” is the liveness result: the benchmark completes instead of stalling during KV transfer. “Passes” additionally requires all functional and performance gates. The first eleven rows are the active unwaive experiment; the remaining rows record additional reported, regression, or explicitly excluded cases.

Waiver status compares upstream before this PR with the intended state after it. Stage names are the exact assignments observed in the cited PR pipelines; duration-based shard suffixes may move when test lists are rebalanced. NVBug 6342844 is the umbrella regression and PR target; historical model-specific waivers remain listed.

Full test name	Stage	Observed outcome / next step	Waiver before → after	NVBug
`perf/test_perf_sanity.py::test_e2e[ disagg_upload-gen-only-gb200_deepseek-r1-fp4_ 1k1k_con1024_ctx1_dep4_gen1_dep8_eplb0_mtp0_ccb-NIXL]`	`GB200-12_GPUs-3_Nodes-PyTorch-Disagg- PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8- Post-Merge-2`	🟢 Finishes and passes Confirmed #14979 liveness case; PASS in #45622	Waived → unwaived	6342844 _{Waiver 6323889}
`perf/test_perf_sanity.py::test_e2e[ disagg_upload-gen-only-gb200_deepseek-r1-fp4_ 1k1k_con2048_ctx2_dep4_gen1_dep16_eplb0_mtp3_ccb-NIXL]`	`GB200-24_GPUs-6_Nodes-PyTorch-Disagg- PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16- Post-Merge-1`	🟢 Finishes and passes Intermittent #14979 liveness case; PASS in #45622	Waived → unwaived	6342844 _{Waiver 6323889}
`perf/test_perf_sanity.py::test_e2e[ disagg_upload-gen-only-gb200_deepseek-r1-fp4_ 1k1k_con2048_ctx2_dep4_gen1_dep16_eplb288_mtp3_ccb-NIXL]`	`GB200-24_GPUs-6_Nodes-PyTorch-Disagg- PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16- Post-Merge-2`	🟢 Finishes and passes Confirmed #14979 liveness case; PASS in #45622	Waived → unwaived	6342844 _{Waiver 6323889}
`perf/test_perf_sanity.py::test_e2e[ disagg_upload-gen-only-gb200_deepseek-r1-fp4_ 1k1k_con3072_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-NIXL]`	`GB200-8_GPUs-2_Nodes-PyTorch-Disagg- PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4- Post-Merge-3`	🟢 Finishes and passes PASS in #45816 after test-only KV-capacity adjustment	Waived → unwaived	6342844 _{Waiver 6323889}
`perf/test_perf_sanity.py::test_e2e[ disagg_upload-gen-only-gb300_deepseek-r1-fp4_ 1k1k_con3072_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-NIXL]`	`GB300-8_GPUs-2_Nodes-PyTorch-Disagg- PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4- Post-Merge-2`	🟢 Finishes and passes Confirmed #14979 liveness case; PASS in #45622	Waived → unwaived	6342844 _{Waiver 6323889}
`perf/test_perf_sanity.py::test_e2e[ disagg_upload-gen-only-gb200_gpt-oss-120b-fp4_ 8k1k_con1024_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-NIXL]`	`GB200-8_GPUs-2_Nodes-PyTorch-Disagg- PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4- Post-Merge-3`	🟢 Finishes and passes Confirmed #14979 liveness case; passed 3 substantive PR runs	Waived → unwaived	6342844 _{Waiver 6324123}
`perf/test_perf_sanity.py::test_e2e[ disagg_upload-gen-only-gb200_kimi-k25-thinking-fp4_ 1k1k_con4096_ctx1_dep4_gen1_dep8_eplb0_mtp0_ccb-NIXL]`	`GB200-12_GPUs-3_Nodes-PyTorch-Disagg- PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8- Post-Merge-8`	🟢 Finishes and passes Historical pre/post-#14979 liveness regression; PASS in #45622	Waived → unwaived	6342844 _{Waiver 6323074}
`perf/test_perf_sanity.py::test_e2e[ disagg_upload-gen-only-gb200_gpt-oss-120b-fp4_ 1k1k_con2048_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-NIXL]`	`GB200-8_GPUs-2_Nodes-PyTorch-Disagg- PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2- Post-Merge-2`	🟠 Finishes; performance fails 192.2s in #45622; 12.39% device-step regression. Follow up the ADP performance path; PR #15222's cache-error vote is an unconfirmed hypothesis.	Waived → waived _{Temporarily tested}	6342844 _{Report/waiver 6324123}
`perf/test_perf_sanity.py::test_e2e[ disagg_upload-gen-only-gb300_kimi-k25-thinking-fp4_ 1k1k_con4096_ctx1_dep4_gen1_dep8_eplb0_mtp0_ccb-NIXL]`	`GB300-12_GPUs-3_Nodes-PyTorch-Disagg- PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8- Post-Merge-4`	🟠 Finishes; performance fails Confirmed parent-pass/#14979-hang case; 488.4s in #45622, then a 14.98% device-step regression. Follow up performance separately.	Waived → waived _{Temporarily tested}	6342844 _{Waiver 6323074}
`perf/test_perf_sanity.py::test_e2e[ disagg_upload-gen-only-gb300_kimi-k25-thinking-fp4_ 1k1k_con4_ctx1_dep4_gen1_tep4_eplb0_mtp0_ccb-NIXL]`	`GB300-8_GPUs-2_Nodes-PyTorch-Disagg- PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4- Post-Merge-3`	🟡 Final behavior unverified Passed 3 earlier PR revisions, but not after the material `b0cf1ce` synchronous-flow change. Rerun current head; unwaive if it still passes.	Waived → waived _{Temporarily tested}	6342844 _{Waiver 6323074}
`perf/test_perf_sanity.py::test_e2e[ disagg_upload-gen-only-gb300_kimi-k25-thinking-fp4_ 8k1k_con1024_ctx1_dep4_gen1_dep32_eplb416_mtp3_ccb-NIXL]`	`GB300-36_GPUs-9_Nodes-PyTorch-Disagg- PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32- Post-Merge-2`	🟡 No valid completed run Existing records are early wrapper terminations. Run the current-head stage; investigate liveness only if a substantive run still does not finish.	Waived → waived _{Temporarily tested}	6342844 _{Waiver 6323074}
`unittest/others/test_kv_cache_transceiver.py:: test_cpp_nixl_sync_transfer_stress`	`B200_PCIe-PackageSanityCheck- PY312-DLFW` _{Also registered in l0_sanity_check and l0_b200}	🟢 Finishes and passes Direct regression: 64 sequential C++/NIXL synchronous transfers PASS in #45579	Absent → added/unwaived	6342844 _{PR target}
`disaggregated/test_disaggregated.py:: test_disaggregated_stress_test[ input8k-output1k-conc512-qwen3_32b_fp8_stress]` _B200	`LLM_FUNCTION_CLUSTER_TEST` _{qa/llm_function_stress.txt, B200}	🟡 Pending QA result Async C++ sender qualification; run the unwaived B200 workload	Waived → unwaived	6312828 _{Qualification for 6342844}
`perf/test_perf_sanity.py::test_e2e[ disagg_upload-gen-only-gb200_gpt-oss-120b-fp4_ 1k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-NIXL]`	`GB200-8_GPUs-2_Nodes-PyTorch-Disagg- PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2- Post-Merge-3`	🟡 Not run on this PR #14979 A/B was inconclusive. Qualify separately before changing its waiver.	Waived → waived	6342844 _{Waiver 6324123}
`disaggregated/test_disaggregated.py:: test_disaggregated_stress_test[ input8k-output1k-conc512-qwen3_32b_fp8_stress]` _H100	`LLM_FUNCTION_CLUSTER_TEST` _{qa/llm_function_stress.txt, H100}	🟡 Not qualified Run separately if H100 closure is required	Waived → waived	6312828 _{Report/waiver}
`accuracy/test_disaggregated_serving.py:: TestDeepSeekV32Exp::test_auto_dtype_with_helix[ fifo-cudagraph:with_padding-pp1tp1cp4]`	`DGX_B200-8_GPUs-PyTorch-3`	⚪ Out of scope Model-forward/DSA issue outside the captured `CacheSender` wait path	Waived → waived _Untouched	6396413 _{Report/waiver}

Passing an unlisted test or another test sharing one of these NVBug IDs does not expand this PR's claimed scope.

Validation

L0 pipeline #45622 verified six passing perf-sanity rows. GPT-OSS con2048 and GB300 Kimi con4096 also completed in that pipeline and failed only their performance gates.
L0 pipeline #45816 verified the seventh retained gate, GB200 DeepSeek con3072, after its test-only KV-capacity adjustment.
Pipelines #45484, #45538, and #45568 passed GB300 Kimi con4, but those runs predate the material b0cf1ce synchronous-flow change and do not close current behavior.
Stacked child pipeline #45579, which contains this diff, verified test_cpp_nixl_sync_transfer_stress in B200_PCIe-PackageSanityCheck-PY312-DLFW.
Current head 629d47e74f passed changed-file pre-commit hooks, waiver duplicate checking, AST test-list validation, and git diff --check.

Next qualification steps are the current-head GB300 Kimi con4 and 8k1k con1024 stages, followed by the B200 Qwen asynchronous stress workload. The direct C++ stress test is single-process and probabilistic; existing multi-counterpart AsymmetricalCacheTest and AsymmetricalCacheTestWithDP coverage has not yet been rerun.

Dependency graph

Arrows point from prerequisite to dependent. #15238 and #15737 are parallel, independently mergeable inputs to #15794. All PRs target NVIDIA/main; stacked child diffs are cumulative until their parents merge.

#15798 and #15799 are production/default-on hardening follow-ups. They do not block the default-off merges of #15238, #15794, or #15795. #15798 is required for #15738. #15799 may be replaced only by an explicitly approved bounded fail/restart policy that satisfies the same memory-safety requirements.

flowchart LR
    PR15238["#15238<br/>gated cancellation<br/>(default off)"]
    PR15737["#15737<br/>CacheSender lost-wakeup fix"]
    PR15794["#15794<br/>C++ protocol and buffer safety<br/>(default off)"]
    PR15795["#15795<br/>PyExecutor lifecycle safety<br/>(default off)"]
    PR15798["#15798<br/>CTX↔GEN protocol/mode negotiation"]
    PR15799["#15799<br/>peer cancel/terminal protocol<br/>(design scaffold)"]
    PR15738["#15738<br/>qualified default-on policy"]

    PR15238 --> PR15794
    PR15737 --> PR15794
    PR15794 --> PR15795
    PR15794 --> PR15798
    PR15798 --> PR15799
    PR15795 --> PR15738
    PR15798 --> PR15738
    PR15799 -.->|or bounded fail/restart policy| PR15738

chienchunhung · 2026-06-30T00:12:02Z

/bot run --disable-fail-fast --stage-list "B200_PCIe-PackageSanityCheck-PY312-DLFW,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3"

chienchunhung · 2026-06-30T16:26:23Z

/bot run --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-8,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-4,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2"

tensorrt-cicd · 2026-06-30T16:34:16Z

PR_Github #56660 [ run ] triggered by Bot. Commit: 3bea148 Link to invocation

Combine the current NVIDIA#15238, NVIDIA#15737, NVIDIA#15794, and NVIDIA#15795 snapshots into one parent commit for NVIDIA#15738. The source PRs remain independently reviewable and mergeable. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

tensorrt-cicd · 2026-06-30T19:07:57Z

PR_Github #56660 [ run ] completed with state FAILURE. Commit: 3bea148
/LLM/main/L0_MergeRequest_PR pipeline #45484 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Combine the current NVIDIA#15238, NVIDIA#15737, NVIDIA#15794, and NVIDIA#15795 snapshots into one parent commit for NVIDIA#15738. The source PRs remain independently reviewable and mergeable. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung · 2026-06-30T19:44:43Z

/bot run --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-8,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-4,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2"

tensorrt-cicd · 2026-06-30T19:50:57Z

PR_Github #56713 [ run ] triggered by Bot. Commit: a38e7d9 Link to invocation

tensorrt-cicd · 2026-06-30T19:51:08Z

PR_Github #56712 [ run ] triggered by Bot. Commit: a38e7d9 Link to invocation

tensorrt-cicd · 2026-06-30T19:51:11Z

PR_Github #56713 [ run ] completed with state ABORTED. Commit: a38e7d9

Link to invocation

chienchunhung · 2026-06-30T19:52:31Z

/bot kill

tensorrt-cicd · 2026-06-30T19:58:31Z

PR_Github #56717 [ kill ] triggered by Bot. Commit: a38e7d9 Link to invocation

tensorrt-cicd · 2026-06-30T20:01:50Z

PR_Github #56712 [ run ] completed with state ABORTED. Commit: a38e7d9

Link to invocation

tensorrt-cicd · 2026-06-30T20:02:04Z

PR_Github #56717 [ kill ] completed with state SUCCESS. Commit: a38e7d9
Successfully killed previous jobs for commit a38e7d9

Link to invocation

chienchunhung · 2026-06-30T20:02:25Z

/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-8,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-4,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2"

tensorrt-cicd · 2026-06-30T20:09:04Z

PR_Github #56718 [ run ] triggered by Bot. Commit: a38e7d9 Link to invocation

chienchunhung · 2026-06-30T22:44:33Z

/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-8,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-4,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2"

tensorrt-cicd · 2026-06-30T22:48:19Z

PR_Github #56718 [ run ] completed with state FAILURE. Commit: a38e7d9
/LLM/main/L0_MergeRequest_PR pipeline #45538 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

tensorrt-cicd · 2026-06-30T22:50:26Z

PR_Github #56750 [ run ] triggered by Bot. Commit: 0d78feb Link to invocation

Combine the current NVIDIA#15238, NVIDIA#15737, NVIDIA#15794, and NVIDIA#15795 snapshots into one parent commit for NVIDIA#15738. The source PRs remain independently reviewable and mergeable. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

tensorrt-cicd · 2026-07-01T01:26:59Z

PR_Github #56750 [ run ] completed with state FAILURE. Commit: 0d78feb
/LLM/main/L0_MergeRequest_PR pipeline #45568 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung · 2026-07-02T16:23:05Z

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32,GB300-4_GPUs-1_Node-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4"

tensorrt-cicd · 2026-07-02T16:29:07Z

PR_Github #57218 [ run ] triggered by Bot. Commit: 1617bfb Link to invocation

tensorrt-cicd · 2026-07-02T16:41:30Z

PR_Github #57218 [ run ] completed with state FAILURE. Commit: 1617bfb
/LLM/main/L0_MergeRequest_PR pipeline #45989 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung · 2026-07-02T17:00:41Z

/bot run --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3"

tensorrt-cicd · 2026-07-02T17:07:31Z

PR_Github #57225 [ run ] triggered by Bot. Commit: c70ed75 Link to invocation

tensorrt-cicd · 2026-07-02T18:16:39Z

PR_Github #57225 [ run ] completed with state SUCCESS. Commit: c70ed75
/LLM/main/L0_MergeRequest_PR pipeline #45993 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung · 2026-07-02T18:46:02Z

/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3"

tensorrt-cicd · 2026-07-02T18:51:50Z

PR_Github #57244 [ run ] triggered by Bot. Commit: 1b95bd0 Link to invocation

chienchunhung · 2026-07-02T20:42:12Z

/bot kill

chienchunhung · 2026-07-02T20:43:22Z

/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3"

tensorrt-cicd · 2026-07-02T20:48:41Z

PR_Github #57254 [ kill ] triggered by Bot. Commit: 1b95bd0 Link to invocation

tensorrt-cicd · 2026-07-02T20:48:45Z

PR_Github #57255 [ ] completed with state ABORTED. Commit: ``

Link to invocation

tensorrt-cicd · 2026-07-02T20:52:23Z

PR_Github #57244 [ run ] completed with state ABORTED. Commit: 1b95bd0
LLM/main/L0_MergeRequest_PR #46011 (Blue Ocean) completed with status: ABORTED

Link to invocation

tensorrt-cicd · 2026-07-02T20:52:35Z

PR_Github #57254 [ kill ] completed with state SUCCESS. Commit: 1b95bd0
Successfully killed previous jobs for commit 1b95bd0

Link to invocation

chienchunhung · 2026-07-02T20:53:28Z

/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3"

chienchunhung · 2026-07-02T21:00:55Z

/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3"

tensorrt-cicd · 2026-07-02T21:07:21Z

PR_Github #57257 [ run ] triggered by Bot. Commit: 1b95bd0 Link to invocation

tensorrt-cicd · 2026-07-03T08:09:02Z

PR_Github #57257 [ run ] completed with state FAILURE. Commit: 1b95bd0
/LLM/main/L0_MergeRequest_PR pipeline #46022 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

github-actions Bot assigned chienchunhung Jun 29, 2026

chienchunhung mentioned this pull request Jun 29, 2026

[TRTLLM-12721][feat] Enable disaggregated in-flight cancellation by default #15738

Draft

chienchunhung changed the title ~~[https://nvbugs/6323889][fix] Prevent cache sender lost wakeups~~ [https://nvbugs/6342844][fix] Prevent cache sender lost wakeups Jun 29, 2026

This was referenced Jun 30, 2026

[TRTLLM-12721][fix] Harden disagg cancellation consensus and buffer ownership #15794

Draft

[TRTLLM-12721][fix] Add gated C++ disagg in-flight cancellation #15238

Open

This was referenced Jun 30, 2026

[TRTLLM-12721][fix] Harden PyExecutor disagg cancellation lifecycle #15795

Draft

[TRTLLM-12721][fix] Negotiate disaggregated cancellation protocol and mode #15798

Draft

[TRTLLM-12721][docs] Design disaggregated peer cancellation protocol #15799

Draft

chienchunhung added 9 commits July 2, 2026 09:21

[https://nvbugs/6342844][test] Unwaive reported gen-only failures

0c497bc

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

[https://nvbugs/6342844][fix] Separate synchronous KV transfer flow

6d06093

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

[https://nvbugs/6342844][fix] Bound synchronous KV transfer admission

51c75c2

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

[https://nvbugs/6342844][fix] Align synchronous disagg progress

7ac66fb

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

[https://nvbugs/6342844][test] Increase DeepSeek gen-only KV capacity

1e9b405

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

[https://nvbugs/6324123][test] Keep GPT-OSS con2048 perf test waived

2e74408

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

[https://nvbugs/6323074][test] Keep GB300 Kimi perf test waived

a39b31e

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

[https://nvbugs/6323074][test] Restore deferred Kimi waivers

6e59ea1

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

[https://nvbugs/6312828][test] Unwaive Qwen disagg stress reproducer

1617bfb

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung force-pushed the codex/disagg-sender-ready-race branch from 629d47e to 1617bfb Compare July 2, 2026 16:22

[None][chore] sort test waivers

c70ed75

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

[https://nvbugs/6342844][test] Unwaive remaining qualification cases

1b95bd0

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

Uh oh!

Conversation

chienchunhung commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix details

Test outcomes and accounting

Validation

Dependency graph

Uh oh!

chienchunhung commented Jun 30, 2026

Uh oh!

chienchunhung commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

chienchunhung commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

chienchunhung commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

chienchunhung commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

chienchunhung commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jun 30, 2026

Uh oh!

tensorrt-cicd commented Jul 1, 2026

Uh oh!

chienchunhung commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

chienchunhung commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

chienchunhung commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

chienchunhung commented Jul 2, 2026

Uh oh!

chienchunhung commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

chienchunhung commented Jul 2, 2026

Uh oh!

chienchunhung commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

chienchunhung commented Jun 29, 2026 •

edited

Loading