Skip to content

[https://nvbugs/6342844][fix] Prevent cache sender lost wakeups#15737

Draft
chienchunhung wants to merge 13 commits into
NVIDIA:mainfrom
chienchunhung:codex/disagg-sender-ready-race
Draft

[https://nvbugs/6342844][fix] Prevent cache sender lost wakeups#15737
chienchunhung wants to merge 13 commits into
NVIDIA:mainfrom
chienchunhung:codex/disagg-sender-ready-race

Conversation

@chienchunhung

@chienchunhung chienchunhung commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

PR #15737 fixes the CacheSender lost-wakeup and data race behind the reported generation-side KV-transfer stalls, targeting NVBug 6342844. With the final relevant behavior, nine of the eleven active perf-sanity cases have valid results: all nine finish, seven pass completely, and two fail only separate performance gates. The remaining two require current-head reruns; no valid final-behavior run has reproduced the original liveness hang.

Fix details

mReadyResponses was protected by mSenderMutex, but readiness was mirrored in mAnyReady under a different mutex. A remover could erase the last response, an enqueuer could insert and notify, and the remover could then overwrite mAnyReady=false; the map remained non-empty while the response worker slept indefinitely. Because the queue belongs to the shared C++ sender implementation, the race can strand both asynchronous and synchronous transfers and is not transport-specific.

  • Unify sender synchronization: Guard the response map, current request, cancellation state, and condition predicate with mSenderMutex; wait directly on mReadyResponses; and remove mAnyReady, mCondMutex, and the unused responder condition variable.
  • Keep blocking work outside the lock: Wait for the exact request ID and move or erase responses under the sender lock, while transport, CUDA, and promise operations remain outside the critical section.
  • Settle terminal work: Complete outstanding response-map and asynchronous-queue promises during failure and shutdown.
  • Preserve synchronous admission: Skip asynchronous generation-status polling and optional idle-progress collectives only for real synchronous transfer. Continue enforcing max_tokens_in_buffer; bypass transfer admission only for synthetic gen_only_no_context, which performs no KV transfer.
  • Preserve prior fixes: Retain PR [https://nvbugs/6104831][fix] Port dataTransceiver shared_ptr<LlmRequest> lifetime fix #14979's shared request ownership and PR [TRTLLM-12721][fix] Bound disagg transfer polling and admission #15356's bounded synchronous admission semantics.
  • Add focused qualification: Add unittest/others/<br>test_kv_cache_transceiver.py::<br>test_cpp_nixl_sync_transfer_stress for 64 sequential C++/NIXL synchronous transfers, plus unit coverage for synchronous admission, no-transfer bypass, asynchronous-poll bypass, rank-aligned error handling, and environment isolation.
  • Adjust one test capacity: Raise free_gpu_memory_fraction from 0.90 to 0.95 only for the GB200 DeepSeek con3072 benchmark. This does not change a production default.

The change introduces no public API or configuration change. Normal asynchronous Python scheduling remains unchanged and benefits from the generic C++ sender fix. Benchmark fill-gate rank skew and unmatched-request UCX/MPI teardown remain out of scope.

Test outcomes and accounting

🟢 Finishes and passes · 🟠 Finishes; full test fails · 🟡 No valid result for final behavior · ⚪ Out of scope

“Finishes” is the liveness result: the benchmark completes instead of stalling during KV transfer. “Passes” additionally requires all functional and performance gates. The first eleven rows are the active unwaive experiment; the remaining rows record additional reported, regression, or explicitly excluded cases.

Waiver status compares upstream before this PR with the intended state after it. Stage names are the exact assignments observed in the cited PR pipelines; duration-based shard suffixes may move when test lists are rebalanced. NVBug 6342844 is the umbrella regression and PR target; historical model-specific waivers remain listed.

Full test name Stage Observed outcome / next step Waiver
before → after
NVBug
perf/test_perf_sanity.py::test_e2e[
disagg_upload-gen-only-gb200_deepseek-r1-fp4_
1k1k_con1024_ctx1_dep4_gen1_dep8_eplb0_mtp0_ccb-NIXL]
GB200-12_GPUs-3_Nodes-PyTorch-Disagg-
PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-
Post-Merge-2
🟢 Finishes and passes
Confirmed #14979 liveness case; PASS in #45622
Waived → unwaived 6342844
Waiver 6323889
perf/test_perf_sanity.py::test_e2e[
disagg_upload-gen-only-gb200_deepseek-r1-fp4_
1k1k_con2048_ctx2_dep4_gen1_dep16_eplb0_mtp3_ccb-NIXL]
GB200-24_GPUs-6_Nodes-PyTorch-Disagg-
PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-
Post-Merge-1
🟢 Finishes and passes
Intermittent #14979 liveness case; PASS in #45622
Waived → unwaived 6342844
Waiver 6323889
perf/test_perf_sanity.py::test_e2e[
disagg_upload-gen-only-gb200_deepseek-r1-fp4_
1k1k_con2048_ctx2_dep4_gen1_dep16_eplb288_mtp3_ccb-NIXL]
GB200-24_GPUs-6_Nodes-PyTorch-Disagg-
PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-
Post-Merge-2
🟢 Finishes and passes
Confirmed #14979 liveness case; PASS in #45622
Waived → unwaived 6342844
Waiver 6323889
perf/test_perf_sanity.py::test_e2e[
disagg_upload-gen-only-gb200_deepseek-r1-fp4_
1k1k_con3072_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-NIXL]
GB200-8_GPUs-2_Nodes-PyTorch-Disagg-
PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-
Post-Merge-3
🟢 Finishes and passes
PASS in #45816 after test-only KV-capacity adjustment
Waived → unwaived 6342844
Waiver 6323889
perf/test_perf_sanity.py::test_e2e[
disagg_upload-gen-only-gb300_deepseek-r1-fp4_
1k1k_con3072_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-NIXL]
GB300-8_GPUs-2_Nodes-PyTorch-Disagg-
PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-
Post-Merge-2
🟢 Finishes and passes
Confirmed #14979 liveness case; PASS in #45622
Waived → unwaived 6342844
Waiver 6323889
perf/test_perf_sanity.py::test_e2e[
disagg_upload-gen-only-gb200_gpt-oss-120b-fp4_
8k1k_con1024_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-NIXL]
GB200-8_GPUs-2_Nodes-PyTorch-Disagg-
PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-
Post-Merge-3
🟢 Finishes and passes
Confirmed #14979 liveness case; passed 3 substantive PR runs
Waived → unwaived 6342844
Waiver 6324123
perf/test_perf_sanity.py::test_e2e[
disagg_upload-gen-only-gb200_kimi-k25-thinking-fp4_
1k1k_con4096_ctx1_dep4_gen1_dep8_eplb0_mtp0_ccb-NIXL]
GB200-12_GPUs-3_Nodes-PyTorch-Disagg-
PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-
Post-Merge-8
🟢 Finishes and passes
Historical pre/post-#14979 liveness regression; PASS in #45622
Waived → unwaived 6342844
Waiver 6323074
perf/test_perf_sanity.py::test_e2e[
disagg_upload-gen-only-gb200_gpt-oss-120b-fp4_
1k1k_con2048_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-NIXL]
GB200-8_GPUs-2_Nodes-PyTorch-Disagg-
PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-
Post-Merge-2
🟠 Finishes; performance fails
192.2s in #45622; 12.39% device-step regression. Follow up the ADP performance path; PR #15222's cache-error vote is an unconfirmed hypothesis.
Waived → waived
Temporarily tested
6342844
Report/waiver 6324123
perf/test_perf_sanity.py::test_e2e[
disagg_upload-gen-only-gb300_kimi-k25-thinking-fp4_
1k1k_con4096_ctx1_dep4_gen1_dep8_eplb0_mtp0_ccb-NIXL]
GB300-12_GPUs-3_Nodes-PyTorch-Disagg-
PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-
Post-Merge-4
🟠 Finishes; performance fails
Confirmed parent-pass/#14979-hang case; 488.4s in #45622, then a 14.98% device-step regression. Follow up performance separately.
Waived → waived
Temporarily tested
6342844
Waiver 6323074
perf/test_perf_sanity.py::test_e2e[
disagg_upload-gen-only-gb300_kimi-k25-thinking-fp4_
1k1k_con4_ctx1_dep4_gen1_tep4_eplb0_mtp0_ccb-NIXL]
GB300-8_GPUs-2_Nodes-PyTorch-Disagg-
PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-
Post-Merge-3
🟡 Final behavior unverified
Passed 3 earlier PR revisions, but not after the material b0cf1ce synchronous-flow change. Rerun current head; unwaive if it still passes.
Waived → waived
Temporarily tested
6342844
Waiver 6323074
perf/test_perf_sanity.py::test_e2e[
disagg_upload-gen-only-gb300_kimi-k25-thinking-fp4_
8k1k_con1024_ctx1_dep4_gen1_dep32_eplb416_mtp3_ccb-NIXL]
GB300-36_GPUs-9_Nodes-PyTorch-Disagg-
PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-
Post-Merge-2
🟡 No valid completed run
Existing records are early wrapper terminations. Run the current-head stage; investigate liveness only if a substantive run still does not finish.
Waived → waived
Temporarily tested
6342844
Waiver 6323074
unittest/others/test_kv_cache_transceiver.py::
test_cpp_nixl_sync_transfer_stress
B200_PCIe-PackageSanityCheck-
PY312-DLFW

Also registered in l0_sanity_check and l0_b200
🟢 Finishes and passes
Direct regression: 64 sequential C++/NIXL synchronous transfers PASS in #45579
Absent → added/unwaived 6342844
PR target
disaggregated/test_disaggregated.py::
test_disaggregated_stress_test[
input8k-output1k-conc512-qwen3_32b_fp8_stress]

B200
LLM_FUNCTION_CLUSTER_TEST
qa/llm_function_stress.txt, B200
🟡 Pending QA result
Async C++ sender qualification; run the unwaived B200 workload
Waived → unwaived 6312828
Qualification for 6342844
perf/test_perf_sanity.py::test_e2e[
disagg_upload-gen-only-gb200_gpt-oss-120b-fp4_
1k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-NIXL]
GB200-8_GPUs-2_Nodes-PyTorch-Disagg-
PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-
Post-Merge-3
🟡 Not run on this PR
#14979 A/B was inconclusive. Qualify separately before changing its waiver.
Waived → waived 6342844
Waiver 6324123
disaggregated/test_disaggregated.py::
test_disaggregated_stress_test[
input8k-output1k-conc512-qwen3_32b_fp8_stress]

H100
LLM_FUNCTION_CLUSTER_TEST
qa/llm_function_stress.txt, H100
🟡 Not qualified
Run separately if H100 closure is required
Waived → waived 6312828
Report/waiver
accuracy/test_disaggregated_serving.py::
TestDeepSeekV32Exp::test_auto_dtype_with_helix[
fifo-cudagraph:with_padding-pp1tp1cp4]
DGX_B200-8_GPUs-PyTorch-3 Out of scope
Model-forward/DSA issue outside the captured CacheSender wait path
Waived → waived
Untouched
6396413
Report/waiver

Passing an unlisted test or another test sharing one of these NVBug IDs does not expand this PR's claimed scope.

Validation

  • L0 pipeline #45622 verified six passing perf-sanity rows. GPT-OSS con2048 and GB300 Kimi con4096 also completed in that pipeline and failed only their performance gates.
  • L0 pipeline #45816 verified the seventh retained gate, GB200 DeepSeek con3072, after its test-only KV-capacity adjustment.
  • Pipelines #45484, #45538, and #45568 passed GB300 Kimi con4, but those runs predate the material b0cf1ce synchronous-flow change and do not close current behavior.
  • Stacked child pipeline #45579, which contains this diff, verified test_cpp_nixl_sync_transfer_stress in B200_PCIe-PackageSanityCheck-PY312-DLFW.
  • Current head 629d47e74f passed changed-file pre-commit hooks, waiver duplicate checking, AST test-list validation, and git diff --check.

Next qualification steps are the current-head GB300 Kimi con4 and 8k1k con1024 stages, followed by the B200 Qwen asynchronous stress workload. The direct C++ stress test is single-process and probabilistic; existing multi-counterpart AsymmetricalCacheTest and AsymmetricalCacheTestWithDP coverage has not yet been rerun.

Dependency graph

Arrows point from prerequisite to dependent. #15238 and #15737 are parallel, independently mergeable inputs to #15794. All PRs target NVIDIA/main; stacked child diffs are cumulative until their parents merge.

#15798 and #15799 are production/default-on hardening follow-ups. They do not block the default-off merges of #15238, #15794, or #15795. #15798 is required for #15738. #15799 may be replaced only by an explicitly approved bounded fail/restart policy that satisfies the same memory-safety requirements.

flowchart LR
    PR15238["#15238<br/>gated cancellation<br/>(default off)"]
    PR15737["#15737<br/>CacheSender lost-wakeup fix"]
    PR15794["#15794<br/>C++ protocol and buffer safety<br/>(default off)"]
    PR15795["#15795<br/>PyExecutor lifecycle safety<br/>(default off)"]
    PR15798["#15798<br/>CTX↔GEN protocol/mode negotiation"]
    PR15799["#15799<br/>peer cancel/terminal protocol<br/>(design scaffold)"]
    PR15738["#15738<br/>qualified default-on policy"]

    PR15238 --> PR15794
    PR15737 --> PR15794
    PR15794 --> PR15795
    PR15794 --> PR15798
    PR15798 --> PR15799
    PR15795 --> PR15738
    PR15798 --> PR15738
    PR15799 -.->|or bounded fail/restart policy| PR15738
Loading

@chienchunhung chienchunhung changed the title [https://nvbugs/6323889][fix] Prevent cache sender lost wakeups [https://nvbugs/6342844][fix] Prevent cache sender lost wakeups Jun 29, 2026
@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "B200_PCIe-PackageSanityCheck-PY312-DLFW,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3"

@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-8,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-4,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56660 [ run ] triggered by Bot. Commit: 3bea148 Link to invocation

chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request Jun 30, 2026
Combine the current NVIDIA#15238, NVIDIA#15737, NVIDIA#15794, and NVIDIA#15795 snapshots into one parent commit for NVIDIA#15738. The source PRs remain independently reviewable and mergeable.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56660 [ run ] completed with state FAILURE. Commit: 3bea148
/LLM/main/L0_MergeRequest_PR pipeline #45484 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request Jun 30, 2026
Combine the current NVIDIA#15238, NVIDIA#15737, NVIDIA#15794, and NVIDIA#15795 snapshots into one parent commit for NVIDIA#15738. The source PRs remain independently reviewable and mergeable.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request Jun 30, 2026
Combine the current NVIDIA#15238, NVIDIA#15737, NVIDIA#15794, and NVIDIA#15795 snapshots into one parent commit for NVIDIA#15738. The source PRs remain independently reviewable and mergeable.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-8,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-4,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56713 [ run ] triggered by Bot. Commit: a38e7d9 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56712 [ run ] triggered by Bot. Commit: a38e7d9 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56713 [ run ] completed with state ABORTED. Commit: a38e7d9

Link to invocation

@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot kill

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56717 [ kill ] triggered by Bot. Commit: a38e7d9 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56712 [ run ] completed with state ABORTED. Commit: a38e7d9

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56717 [ kill ] completed with state SUCCESS. Commit: a38e7d9
Successfully killed previous jobs for commit a38e7d9

Link to invocation

@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-8,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-4,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56718 [ run ] triggered by Bot. Commit: a38e7d9 Link to invocation

@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-8,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-4,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56718 [ run ] completed with state FAILURE. Commit: a38e7d9
/LLM/main/L0_MergeRequest_PR pipeline #45538 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56750 [ run ] triggered by Bot. Commit: 0d78feb Link to invocation

chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request Jun 30, 2026
Combine the current NVIDIA#15238, NVIDIA#15737, NVIDIA#15794, and NVIDIA#15795 snapshots into one parent commit for NVIDIA#15738. The source PRs remain independently reviewable and mergeable.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56750 [ run ] completed with state FAILURE. Commit: 0d78feb
/LLM/main/L0_MergeRequest_PR pipeline #45568 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung chienchunhung force-pushed the codex/disagg-sender-ready-race branch from 629d47e to 1617bfb Compare July 2, 2026 16:22
@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32,GB300-4_GPUs-1_Node-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57218 [ run ] triggered by Bot. Commit: 1617bfb Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57218 [ run ] completed with state FAILURE. Commit: 1617bfb
/LLM/main/L0_MergeRequest_PR pipeline #45989 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57225 [ run ] triggered by Bot. Commit: c70ed75 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57225 [ run ] completed with state SUCCESS. Commit: c70ed75
/LLM/main/L0_MergeRequest_PR pipeline #45993 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57244 [ run ] triggered by Bot. Commit: 1b95bd0 Link to invocation

@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot kill

@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57254 [ kill ] triggered by Bot. Commit: 1b95bd0 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57255 [ ] completed with state ABORTED. Commit: ``

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57244 [ run ] completed with state ABORTED. Commit: 1b95bd0
LLM/main/L0_MergeRequest_PR #46011 (Blue Ocean) completed with status: ABORTED

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57254 [ kill ] completed with state SUCCESS. Commit: 1b95bd0
Successfully killed previous jobs for commit 1b95bd0

Link to invocation

@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3"

1 similar comment
@chienchunhung

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57257 [ run ] triggered by Bot. Commit: 1b95bd0 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57257 [ run ] completed with state FAILURE. Commit: 1b95bd0
/LLM/main/L0_MergeRequest_PR pipeline #46022 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants