[https://nvbugs/6342844][fix] Prevent cache sender lost wakeups#15737
[https://nvbugs/6342844][fix] Prevent cache sender lost wakeups#15737chienchunhung wants to merge 13 commits into
Conversation
|
/bot run --disable-fail-fast --stage-list "B200_PCIe-PackageSanityCheck-PY312-DLFW,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3" |
|
/bot run --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-8,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-4,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2" |
|
PR_Github #56660 [ run ] triggered by Bot. Commit: |
Combine the current NVIDIA#15238, NVIDIA#15737, NVIDIA#15794, and NVIDIA#15795 snapshots into one parent commit for NVIDIA#15738. The source PRs remain independently reviewable and mergeable. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
|
PR_Github #56660 [ run ] completed with state
|
Combine the current NVIDIA#15238, NVIDIA#15737, NVIDIA#15794, and NVIDIA#15795 snapshots into one parent commit for NVIDIA#15738. The source PRs remain independently reviewable and mergeable. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Combine the current NVIDIA#15238, NVIDIA#15737, NVIDIA#15794, and NVIDIA#15795 snapshots into one parent commit for NVIDIA#15738. The source PRs remain independently reviewable and mergeable. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
|
/bot run --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-8,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-4,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2" |
|
PR_Github #56713 [ run ] triggered by Bot. Commit: |
|
PR_Github #56712 [ run ] triggered by Bot. Commit: |
|
PR_Github #56713 [ run ] completed with state |
|
/bot kill |
|
PR_Github #56717 [ kill ] triggered by Bot. Commit: |
|
PR_Github #56712 [ run ] completed with state |
|
PR_Github #56717 [ kill ] completed with state |
|
/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-8,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-4,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2" |
|
PR_Github #56718 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-8,GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-4,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2" |
|
PR_Github #56718 [ run ] completed with state
|
|
PR_Github #56750 [ run ] triggered by Bot. Commit: |
Combine the current NVIDIA#15238, NVIDIA#15737, NVIDIA#15794, and NVIDIA#15795 snapshots into one parent commit for NVIDIA#15738. The source PRs remain independently reviewable and mergeable. Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
|
PR_Github #56750 [ run ] completed with state
|
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
629d47e to
1617bfb
Compare
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32,GB300-4_GPUs-1_Node-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4" |
|
PR_Github #57218 [ run ] triggered by Bot. Commit: |
|
PR_Github #57218 [ run ] completed with state
|
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
|
/bot run --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3" |
|
PR_Github #57225 [ run ] triggered by Bot. Commit: |
|
PR_Github #57225 [ run ] completed with state |
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
|
/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3" |
|
PR_Github #57244 [ run ] triggered by Bot. Commit: |
|
/bot kill |
|
/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3" |
|
PR_Github #57254 [ kill ] triggered by Bot. Commit: |
|
PR_Github #57255 [ ] completed with state |
|
PR_Github #57244 [ run ] completed with state |
|
PR_Github #57254 [ kill ] completed with state |
|
/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3" |
1 similar comment
|
/bot run --disable-reuse-test --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-Post-Merge-2,GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3" |
|
PR_Github #57257 [ run ] triggered by Bot. Commit: |
|
PR_Github #57257 [ run ] completed with state
|
Summary
PR #15737 fixes the
CacheSenderlost-wakeup and data race behind the reported generation-side KV-transfer stalls, targeting NVBug 6342844. With the final relevant behavior, nine of the eleven active perf-sanity cases have valid results: all nine finish, seven pass completely, and two fail only separate performance gates. The remaining two require current-head reruns; no valid final-behavior run has reproduced the original liveness hang.Fix details
mReadyResponseswas protected bymSenderMutex, but readiness was mirrored inmAnyReadyunder a different mutex. A remover could erase the last response, an enqueuer could insert and notify, and the remover could then overwritemAnyReady=false; the map remained non-empty while the response worker slept indefinitely. Because the queue belongs to the shared C++ sender implementation, the race can strand both asynchronous and synchronous transfers and is not transport-specific.mSenderMutex; wait directly onmReadyResponses; and removemAnyReady,mCondMutex, and the unused responder condition variable.max_tokens_in_buffer; bypass transfer admission only for syntheticgen_only_no_context, which performs no KV transfer.unittest/others/<br>test_kv_cache_transceiver.py::<br>test_cpp_nixl_sync_transfer_stressfor 64 sequential C++/NIXL synchronous transfers, plus unit coverage for synchronous admission, no-transfer bypass, asynchronous-poll bypass, rank-aligned error handling, and environment isolation.free_gpu_memory_fractionfrom 0.90 to 0.95 only for the GB200 DeepSeek con3072 benchmark. This does not change a production default.The change introduces no public API or configuration change. Normal asynchronous Python scheduling remains unchanged and benefits from the generic C++ sender fix. Benchmark fill-gate rank skew and unmatched-request UCX/MPI teardown remain out of scope.
Test outcomes and accounting
🟢 Finishes and passes · 🟠 Finishes; full test fails · 🟡 No valid result for final behavior · ⚪ Out of scope
“Finishes” is the liveness result: the benchmark completes instead of stalling during KV transfer. “Passes” additionally requires all functional and performance gates. The first eleven rows are the active unwaive experiment; the remaining rows record additional reported, regression, or explicitly excluded cases.
Waiver status compares upstream before this PR with the intended state after it. Stage names are the exact assignments observed in the cited PR pipelines; duration-based shard suffixes may move when test lists are rebalanced. NVBug 6342844 is the umbrella regression and PR target; historical model-specific waivers remain listed.
before → after
perf/test_perf_sanity.py::test_e2e[disagg_upload-gen-only-gb200_deepseek-r1-fp4_
1k1k_con1024_ctx1_dep4_gen1_dep8_eplb0_mtp0_ccb-NIXL]
GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-
Post-Merge-2
Confirmed #14979 liveness case; PASS in #45622
Waiver 6323889
perf/test_perf_sanity.py::test_e2e[disagg_upload-gen-only-gb200_deepseek-r1-fp4_
1k1k_con2048_ctx2_dep4_gen1_dep16_eplb0_mtp3_ccb-NIXL]
GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-
Post-Merge-1
Intermittent #14979 liveness case; PASS in #45622
Waiver 6323889
perf/test_perf_sanity.py::test_e2e[disagg_upload-gen-only-gb200_deepseek-r1-fp4_
1k1k_con2048_ctx2_dep4_gen1_dep16_eplb288_mtp3_ccb-NIXL]
GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-
Post-Merge-2
Confirmed #14979 liveness case; PASS in #45622
Waiver 6323889
perf/test_perf_sanity.py::test_e2e[disagg_upload-gen-only-gb200_deepseek-r1-fp4_
1k1k_con3072_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-NIXL]
GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-
Post-Merge-3
PASS in #45816 after test-only KV-capacity adjustment
Waiver 6323889
perf/test_perf_sanity.py::test_e2e[disagg_upload-gen-only-gb300_deepseek-r1-fp4_
1k1k_con3072_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-NIXL]
GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-
Post-Merge-2
Confirmed #14979 liveness case; PASS in #45622
Waiver 6323889
perf/test_perf_sanity.py::test_e2e[disagg_upload-gen-only-gb200_gpt-oss-120b-fp4_
8k1k_con1024_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-NIXL]
GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-
Post-Merge-3
Confirmed #14979 liveness case; passed 3 substantive PR runs
Waiver 6324123
perf/test_perf_sanity.py::test_e2e[disagg_upload-gen-only-gb200_kimi-k25-thinking-fp4_
1k1k_con4096_ctx1_dep4_gen1_dep8_eplb0_mtp0_ccb-NIXL]
GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-
Post-Merge-8
Historical pre/post-#14979 liveness regression; PASS in #45622
Waiver 6323074
perf/test_perf_sanity.py::test_e2e[disagg_upload-gen-only-gb200_gpt-oss-120b-fp4_
1k1k_con2048_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-NIXL]
GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-
Post-Merge-2
192.2s in #45622; 12.39% device-step regression. Follow up the ADP performance path; PR #15222's cache-error vote is an unconfirmed hypothesis.
Temporarily tested
Report/waiver 6324123
perf/test_perf_sanity.py::test_e2e[disagg_upload-gen-only-gb300_kimi-k25-thinking-fp4_
1k1k_con4096_ctx1_dep4_gen1_dep8_eplb0_mtp0_ccb-NIXL]
GB300-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-
Post-Merge-4
Confirmed parent-pass/#14979-hang case; 488.4s in #45622, then a 14.98% device-step regression. Follow up performance separately.
Temporarily tested
Waiver 6323074
perf/test_perf_sanity.py::test_e2e[disagg_upload-gen-only-gb300_kimi-k25-thinking-fp4_
1k1k_con4_ctx1_dep4_gen1_tep4_eplb0_mtp0_ccb-NIXL]
GB300-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-
Post-Merge-3
Passed 3 earlier PR revisions, but not after the material
b0cf1cesynchronous-flow change. Rerun current head; unwaive if it still passes.Temporarily tested
Waiver 6323074
perf/test_perf_sanity.py::test_e2e[disagg_upload-gen-only-gb300_kimi-k25-thinking-fp4_
8k1k_con1024_ctx1_dep4_gen1_dep32_eplb416_mtp3_ccb-NIXL]
GB300-36_GPUs-9_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE8-GPU32-
Post-Merge-2
Existing records are early wrapper terminations. Run the current-head stage; investigate liveness only if a substantive run still does not finish.
Temporarily tested
Waiver 6323074
unittest/others/test_kv_cache_transceiver.py::test_cpp_nixl_sync_transfer_stress
B200_PCIe-PackageSanityCheck-PY312-DLFW
Also registered in
l0_sanity_checkandl0_b200Direct regression: 64 sequential C++/NIXL synchronous transfers PASS in #45579
PR target
disaggregated/test_disaggregated.py::test_disaggregated_stress_test[
input8k-output1k-conc512-qwen3_32b_fp8_stress]
B200
LLM_FUNCTION_CLUSTER_TESTqa/llm_function_stress.txt, B200Async C++ sender qualification; run the unwaived B200 workload
Qualification for 6342844
perf/test_perf_sanity.py::test_e2e[disagg_upload-gen-only-gb200_gpt-oss-120b-fp4_
1k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-NIXL]
GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-
Post-Merge-3
#14979 A/B was inconclusive. Qualify separately before changing its waiver.
Waiver 6324123
disaggregated/test_disaggregated.py::test_disaggregated_stress_test[
input8k-output1k-conc512-qwen3_32b_fp8_stress]
H100
LLM_FUNCTION_CLUSTER_TESTqa/llm_function_stress.txt, H100Run separately if H100 closure is required
Report/waiver
accuracy/test_disaggregated_serving.py::TestDeepSeekV32Exp::test_auto_dtype_with_helix[
fifo-cudagraph:with_padding-pp1tp1cp4]
DGX_B200-8_GPUs-PyTorch-3Model-forward/DSA issue outside the captured
CacheSenderwait pathUntouched
Report/waiver
Passing an unlisted test or another test sharing one of these NVBug IDs does not expand this PR's claimed scope.
Validation
b0cf1cesynchronous-flow change and do not close current behavior.test_cpp_nixl_sync_transfer_stressinB200_PCIe-PackageSanityCheck-PY312-DLFW.629d47e74fpassed changed-file pre-commit hooks, waiver duplicate checking, AST test-list validation, andgit diff --check.Next qualification steps are the current-head GB300 Kimi con4 and 8k1k con1024 stages, followed by the B200 Qwen asynchronous stress workload. The direct C++ stress test is single-process and probabilistic; existing multi-counterpart
AsymmetricalCacheTestandAsymmetricalCacheTestWithDPcoverage has not yet been rerun.Dependency graph
Arrows point from prerequisite to dependent. #15238 and #15737 are parallel, independently mergeable inputs to #15794. All PRs target
NVIDIA/main; stacked child diffs are cumulative until their parents merge.#15798 and #15799 are production/default-on hardening follow-ups. They do not block the default-off merges of #15238, #15794, or #15795. #15798 is required for #15738. #15799 may be replaced only by an explicitly approved bounded fail/restart policy that satisfies the same memory-safety requirements.
flowchart LR PR15238["#15238<br/>gated cancellation<br/>(default off)"] PR15737["#15737<br/>CacheSender lost-wakeup fix"] PR15794["#15794<br/>C++ protocol and buffer safety<br/>(default off)"] PR15795["#15795<br/>PyExecutor lifecycle safety<br/>(default off)"] PR15798["#15798<br/>CTX↔GEN protocol/mode negotiation"] PR15799["#15799<br/>peer cancel/terminal protocol<br/>(design scaffold)"] PR15738["#15738<br/>qualified default-on policy"] PR15238 --> PR15794 PR15737 --> PR15794 PR15794 --> PR15795 PR15794 --> PR15798 PR15798 --> PR15799 PR15795 --> PR15738 PR15798 --> PR15738 PR15799 -.->|or bounded fail/restart policy| PR15738