fix(dsv4 topk_v2): honor cluster contract in fused kernel SMALL/TRIVIAL branches#25575
fix(dsv4 topk_v2): honor cluster contract in fused kernel SMALL/TRIVIAL branches#25575GavinZhu-GMI wants to merge 1 commit into
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
|
Pinging maintainers — the comment bot is at its daily quota so I can't trigger CI myself. Could a maintainer add the |
|
This makes no sense to me. I never heard of any document about such cluster contract. As long as there's no synchronization between cluster ctas, it's ok to early exit. Could you please provide some reference to this bug and a minimal reproduction? It might be caused by some other bug |
Hi I mean by contract: "if a clustered kernel is launched with PDL, ensure all cluster CTAs issue PDL primitives consistently (or hoist the PDL calls out of branches that early-exit) |
bbe8fae to
1efb3bd
Compare
|
Thanks @DarkSharpness for the pushback. You're right that "cluster contract" was a hand-wavy framing on my part — CUDA does permit cluster CTAs to early-exit as long as no Better hypothesis (also wrong, FYI)Looking at the same kernel more carefully, I noticed an asymmetry: the FUSED kernel is launched with What I triedHoist if (seq_len <= K) {
device::PDLWaitPrimary<true>();
if (cluster_rank != 0) return;
impl::trivial_transform(transform, seq_len, K);
} else if (seq_len <= Small::kMax1PassLength) {
device::PDLWaitPrimary<true>();
if (cluster_rank != 0) return;
Small::run(.., /*use_pdl=*/false);
Small::transform(transform);
}This is closer to what the LARGE branch does (modulo not running through ResultThis also crashes with the same CUDA_ERROR_ILLEGAL_ADDRESS at conc=32 ISL=8192 OSL=1024 TP=8+DP=8+EP=8+MegaMoE W4A4 on B300. So PDL symmetry alone is not the fix. Going back to Honest updateEmpirical evidence on B300/SM100 with this specific config (
I don't have a documented mechanism for why early-exit fails on SM100 specifically when I've amended the patch comment to drop the misleading "cluster contract" wording and describe what we actually observe (no speculation about the scheduler/PDL state machine internals). Minimal reproducerI don't have a kernel-isolated reproducer outside sglang yet — the full repro needs DSv4-Pro weights + DP-attention to drive c4_seq_lens into the SMALL branch with batch_size_per_rank≤kNumClusters=15. Repro instructions are in #25574. If a kernel-only repro using synthetic inputs would help, I'm happy to put one together. Also happy to defer this PR while we investigate the underlying SM100 behavior more deeply if you'd prefer. FinallyFYI after this patch, we can get single host B300 to achieve 2370.1 tok/s/gpu without MTP. |
|
I see. Thanks for the detailed explanation. This bug really looks strange to me. I actually never encountered that before. I will try to reproduce it locally first. But we can start the ci first anyway. BTW do you have any performance results on this? I'm curious about the performance regression (extra cluster sync vs no cluster sync). |
|
/tag-and-rerun-ci |
| if (cluster_rank == 0) { | ||
| impl::trivial_transform(transform, seq_len, K); | ||
| } | ||
| cooperative_groups::this_cluster().sync(); |
There was a problem hiding this comment.
- I wonder what's the overhead here. What if the cooperative_groups::this_cluster().sync(); is moved before PDLWaitPrimary. Will it be faster while resolve the crash?
- Do we need cluster sync guard for both code path (trivial/small)?
|
Thanks @DarkSharpness for triggering CI and the reproduction effort. Perf A/B answer: ran the current patch (Cand A,
Opposite-signed sub-0.3% deltas → perf-neutral within noise. Cluster.sync overhead in the SMALL branch is below the measurement floor. "Sync-before" is slightly cleaner conceptually (ranks 1-7 exit the SM as soon as they synchronize instead of sitting idle while rank 0 runs Sync-before snippet for reference (the two upper branches, LARGE unchanged): if (seq_len <= K) {
cooperative_groups::this_cluster().sync();
if (cluster_rank != 0) return;
impl::trivial_transform(transform, seq_len, K);
} else if (seq_len <= Small::kMax1PassLength) {
cooperative_groups::this_cluster().sync();
if (cluster_rank != 0) return;
Small::run(params.get_scores(batch_id), s_topk_indices, seq_len, smem, /*use_pdl=*/true);
Small::transform(transform);
} else { /* LARGE unchanged */ }Re minimal reproducer — happy to put together a kernel-only repro if it would help your local investigation; let me know. |
…AL branches `topk_fused_transform` is annotated `__cluster_dims__(1, kClusterSize=8, 1)` via FUSED_COMBINE_KERNEL, but its SMALL (`seq_len <= Small::kMax1PassLength`) and TRIVIAL (`seq_len <= K`) branches let 7 of every 8 cluster blocks `return` immediately without ever calling `cluster.sync()`. On SM90 and SM100 with TP=4/DP=4 (the configuration covered by CI) this is tolerated, but on B300 / SM100 at TP=8/DP=8 with DeepEP it surfaces as `CUDA_ERROR_ILLEGAL_ADDRESS` reported at downstream sync points (CUDA graph replay, dense FP8 GEMM, etc.) -- the actual failing kernel is async. Reproducer: deepseek-ai/DeepSeek-V4-Pro on 8x B300 SXM6, `--tp 8 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep` with `SGLANG_OPT_USE_TOPK_V2=1` (the default since sgl-project#25406). conc=32 ISL=8192 OSL=1024 triggers the fused 1-stage path (batch_size_per_rank=4 <= kNumClusters=15) and crashes within ~10 seconds of decode. Workaround: `SGLANG_OPT_USE_TOPK_V2=0` falls back to the v1 kernel. Fix: restructure both branches so rank-0 does the work and all 8 ranks meet at `cooperative_groups::this_cluster().sync()` before exit, honoring the implicit cluster contract. Validated end-to-end on a B300 box: SGLANG_OPT_USE_TOPK_V2=1 + the patch + the same workload completes cleanly at 958.5 tok/s/gpu, vs 963.7 with the TOPK_V2=0 fallback (well within run-to-run noise -- the 7 dummy ranks were already pinned to the cluster's SM allotment whether they early-returned or sat at the barrier).
1efb3bd to
95550cb
Compare
|
Updated the PR to a fuller PDL pattern (commit 95550cb). Force-pushed, ready for CI rerun. What changed
Updated the kernel comment to describe the empirical findings honestly and drop the "cluster contract" language. Perf — A/B/C/D matrixSame fb3 host (8× B300 SXM6, TP=8 + DP=8 + EP=8 + MegaMoE W4A4, ISL=8192 OSL=1024). Each row a fresh container start, May 17 PR knobs ( conc=32 (96 prompts)
conc=64 (192 prompts)
All four variants are perf-equivalent within ≤0.3% run-to-run noise. The fuller pattern (D) doesn't degrade and is slightly better at c=64. Why D vs A
The mid-bench non-determinism we noted yesterday ( |
| // kernel before reading params.seq_lens / scores. This both (1) ensures | ||
| // primary-produced data is visible and (2) gives every CTA a symmetric | ||
| // PDL state, sidestepping the SM100 cluster crash described below. | ||
| device::PDLWaitPrimary<true>(); |
There was a problem hiding this comment.
This wait primiary can be moved after loading seq_len. This is because seq_len is metadata, which is not related to the input. We can safely prefetch this part.
Summary
Fixes #25574. This is a corner case that I came across when achieving peak performance on B300.
Another fix candidate would require a cache of seq length, I believe it is a bigger patch, so I've chosen to follow current C++ only 15 line patch.
topk_fused_transform(inpython/sglang/jit_kernel/csrc/deepseek_v4/topk_v2.cuh) is launched with__cluster_dims__(1, kClusterSize=8, 1)via theFUSED_COMBINE_KERNELmacro, but its SMALL (seq_len <= Small::kMax1PassLength) and TRIVIAL (seq_len <= K) branches let 7 of every 8 cluster blocksreturnimmediately without ever callingcluster.sync(). On SM90 / SM100-TP=4 this is tolerated; on B300 / SM100 at TP=8 + DP-attention + DeepEP (so 4 clusters × 8 blocks per launch in the fused 1-stage path) the cluster scheduler / PDL state machine tripsCUDA_ERROR_ILLEGAL_ADDRESS, reported at downstream sync points.Modifications
Single-file, 15-line diff in
topk_v2.cuh: rank-0 does the work inside anif (cluster_rank == 0) { ... }block, then all 8 ranks meet atcooperative_groups::this_cluster().sync()before exit. Applied symmetrically to both the SMALL and TRIVIAL branches. The LARGE branch is unchanged (it already uses cluster ops correctly).Performance
The added cluster barrier is a hardware-fast operation. The 7 dummy ranks were already pinned to the cluster's SM allotment whether they early-returned or sat at the barrier — CUDA can't reclaim cluster SM slots block-by-block, the cluster only releases after all blocks exit. So no observable cost.
Validated on B300 (8× SXM6 single-host) at the configuration that previously crashed (TP=8 + DP=8 + EP=8 + MegaMoE W4A4, conc=32 ISL=8192 OSL=1024 num_prompts=320):
SGLANG_OPT_USE_TOPK_V2=1(upstream, unpatched)SGLANG_OPT_USE_TOPK_V2=0(v1 kernel fallback)SGLANG_OPT_USE_TOPK_V2=1, patched v2)Within 0.5% of the fallback (run-to-run noise).
Accuracy Tests
The kernel output is logically unchanged — same
Small::run+Small::transform(orimpl::trivial_transform) by rank 0; the only behavioral delta is that ranks 1-7 now sit at a cluster barrier before exit instead of returning instantly. No data path changes.Test plan
SGLANG_OPT_USE_TOPK_V2=1It would be a good follow-up to add a B300 (or at least TP=8) CI lane covering DSv4-Pro with
SGLANG_OPT_USE_TOPK_V2=1+ DP-attention + DeepEP, sincetest/registered/dsv4/test_deepseek_v4_flash_fp4_megamoe_b200.pyonly exercises B200/TP=4 today.Checklist
CI States
Latest PR Test (Base): 🚫 Run #26147082776
Latest PR Test (Extra): ❌ Run #26147082664