Skip to content

fix(dsv4 topk_v2): honor cluster contract in fused kernel SMALL/TRIVIAL branches#25575

Open
GavinZhu-GMI wants to merge 1 commit into
sgl-project:mainfrom
GavinZhu-GMI:fix-dsv4-topk-v2-cluster-contract
Open

fix(dsv4 topk_v2): honor cluster contract in fused kernel SMALL/TRIVIAL branches#25575
GavinZhu-GMI wants to merge 1 commit into
sgl-project:mainfrom
GavinZhu-GMI:fix-dsv4-topk-v2-cluster-contract

Conversation

@GavinZhu-GMI
Copy link
Copy Markdown
Contributor

@GavinZhu-GMI GavinZhu-GMI commented May 18, 2026

Summary

Fixes #25574. This is a corner case that I came across when achieving peak performance on B300.
Another fix candidate would require a cache of seq length, I believe it is a bigger patch, so I've chosen to follow current C++ only 15 line patch.

topk_fused_transform (in python/sglang/jit_kernel/csrc/deepseek_v4/topk_v2.cuh) is launched with __cluster_dims__(1, kClusterSize=8, 1) via the FUSED_COMBINE_KERNEL macro, but its SMALL (seq_len <= Small::kMax1PassLength) and TRIVIAL (seq_len <= K) branches let 7 of every 8 cluster blocks return immediately without ever calling cluster.sync(). On SM90 / SM100-TP=4 this is tolerated; on B300 / SM100 at TP=8 + DP-attention + DeepEP (so 4 clusters × 8 blocks per launch in the fused 1-stage path) the cluster scheduler / PDL state machine trips CUDA_ERROR_ILLEGAL_ADDRESS, reported at downstream sync points.

Modifications

Single-file, 15-line diff in topk_v2.cuh: rank-0 does the work inside an if (cluster_rank == 0) { ... } block, then all 8 ranks meet at cooperative_groups::this_cluster().sync() before exit. Applied symmetrically to both the SMALL and TRIVIAL branches. The LARGE branch is unchanged (it already uses cluster ops correctly).

Performance

The added cluster barrier is a hardware-fast operation. The 7 dummy ranks were already pinned to the cluster's SM allotment whether they early-returned or sat at the barrier — CUDA can't reclaim cluster SM slots block-by-block, the cluster only releases after all blocks exit. So no observable cost.

Validated on B300 (8× SXM6 single-host) at the configuration that previously crashed (TP=8 + DP=8 + EP=8 + MegaMoE W4A4, conc=32 ISL=8192 OSL=1024 num_prompts=320):

Config tput/gpu mean TPOT mean E2E Outcome
SGLANG_OPT_USE_TOPK_V2=1 (upstream, unpatched) Crashes with CUDA_ERROR_ILLEGAL_ADDRESS
SGLANG_OPT_USE_TOPK_V2=0 (v1 kernel fallback) 963.7 30.74 ms 38.22 s OK
This PR (SGLANG_OPT_USE_TOPK_V2=1, patched v2) 958.5 30.95 ms 38.43 s OK

Within 0.5% of the fallback (run-to-run noise).

Accuracy Tests

The kernel output is logically unchanged — same Small::run + Small::transform (or impl::trivial_transform) by rank 0; the only behavioral delta is that ranks 1-7 now sit at a cluster barrier before exit instead of returning instantly. No data path changes.

Test plan

  • Manual repro of the original crash on B300 with SGLANG_OPT_USE_TOPK_V2=1
  • Manual confirmation that the patched kernel completes the same workload cleanly
  • CI green (this PR)

It would be a good follow-up to add a B300 (or at least TP=8) CI lane covering DSv4-Pro with SGLANG_OPT_USE_TOPK_V2=1 + DP-attention + DeepEP, since test/registered/dsv4/test_deepseek_v4_flash_fp4_megamoe_b200.py only exercises B200/TP=4 today.

Checklist

  • Format your code according to the Format code with pre-commit. (single-file C++ patch matching surrounding style)
  • Add unit tests according to the Run and add unit tests. (this is a kernel correctness fix that requires multi-rank cluster execution on SM100; adding meaningful coverage requires expanding the existing dsv4 CI matrix — happy to follow up if maintainers prefer this in this PR)
  • Update documentation according to Write documentations. (kernel-level comment added explaining the cluster contract requirement)
  • Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed. (see table above)
  • Follow the SGLang code style guidance.

CI States

Latest PR Test (Base): 🚫 Run #26147082776
Latest PR Test (Extra): ❌ Run #26147082664

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@GavinZhu-GMI
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@GavinZhu-GMI
Copy link
Copy Markdown
Contributor Author

Pinging maintainers — the comment bot is at its daily quota so I can't trigger CI myself. Could a maintainer add the run-ci label when convenient? The fix is single-file (15-line diff to topk_v2.cuh), validated end-to-end on B300 against the reproducer described in #25574. Happy to address any review comments.

@DarkSharpness
Copy link
Copy Markdown
Collaborator

This makes no sense to me. I never heard of any document about such cluster contract. As long as there's no synchronization between cluster ctas, it's ok to early exit. Could you please provide some reference to this bug and a minimal reproduction? It might be caused by some other bug

@GavinZhu-GMI
Copy link
Copy Markdown
Contributor Author

This makes no sense to me. I never heard of any document about such cluster contract. As long as there's no synchronization between cluster ctas, it's ok to early exit. Could you please provide some reference to this bug and a minimal reproduction? It might be caused by some other bug

Hi I mean by contract: "if a clustered kernel is launched with PDL, ensure all cluster CTAs issue PDL primitives consistently (or hoist the PDL calls out of branches that early-exit)

@GavinZhu-GMI GavinZhu-GMI force-pushed the fix-dsv4-topk-v2-cluster-contract branch from bbe8fae to 1efb3bd Compare May 18, 2026 08:14
@GavinZhu-GMI
Copy link
Copy Markdown
Contributor Author

GavinZhu-GMI commented May 18, 2026

Thanks @DarkSharpness for the pushback. You're right that "cluster contract" was a hand-wavy framing on my part — CUDA does permit cluster CTAs to early-exit as long as no cluster.sync() is called. I refined the analysis and tested a more principled alternative before responding. Here's what I now have:

Better hypothesis (also wrong, FYI)

Looking at the same kernel more carefully, I noticed an asymmetry: the FUSED kernel is launched with .enable_pdl(true) (line 475). In the LARGE branch (line 351) all 8 cluster CTAs explicitly call device::PDLWaitPrimary<true>() before any cluster.sync / early-exit. But in the SMALL branch only rank 0 reaches the Small::run(use_pdl=true) call, so only rank 0 emits griddepcontrol.wait. That asymmetry across cluster CTAs under PDL is undocumented territory, so I suspected that.

What I tried

Hoist PDLWaitPrimary out of Small::run and into the kernel body so all 8 cluster CTAs issue griddepcontrol.wait, and let the early-exit happen normally with no cluster.sync(). You can try it out by replacing the current PR diff with:

if (seq_len <= K) {
    device::PDLWaitPrimary<true>();
    if (cluster_rank != 0) return;
    impl::trivial_transform(transform, seq_len, K);
} else if (seq_len <= Small::kMax1PassLength) {
    device::PDLWaitPrimary<true>();
    if (cluster_rank != 0) return;
    Small::run(.., /*use_pdl=*/false);
    Small::transform(transform);
}

This is closer to what the LARGE branch does (modulo not running through Large::stage1_*).

Result

This also crashes with the same CUDA_ERROR_ILLEGAL_ADDRESS at conc=32 ISL=8192 OSL=1024 TP=8+DP=8+EP=8+MegaMoE W4A4 on B300. So PDL symmetry alone is not the fix.

Going back to cluster.sync() + early-exit-inside-if (the current PR diff) → bench completes cleanly. The cluster.sync is empirically the load-bearing bit, not the PDL hoist.

Honest update

Empirical evidence on B300/SM100 with this specific config (__cluster_dims__(1,8,1) + enable_pdl(true)):

Variant conc=32 result
Upstream (unmodified) crashes
PDL symmetric, no cluster.sync (8 ranks emit griddepcontrol.wait, then 7 early-exit) crashes
cluster.sync, PDL not hoisted (current PR — 8 ranks meet at cluster.sync before any exits) works
LARGE branch (already does cluster.sync at line 355 before early-exit) works (existing code)

I don't have a documented mechanism for why early-exit fails on SM100 specifically when __cluster_dims__ + enable_pdl(true) are both set — it's contrary to what plain cluster semantics say. But the LARGE branch already does cluster.sync-before-early-exit (line 355), and the SMALL/TRIVIAL branches don't; mirroring LARGE's pattern is consistent and empirically fixes the crash.

I've amended the patch comment to drop the misleading "cluster contract" wording and describe what we actually observe (no speculation about the scheduler/PDL state machine internals).

Minimal reproducer

I don't have a kernel-isolated reproducer outside sglang yet — the full repro needs DSv4-Pro weights + DP-attention to drive c4_seq_lens into the SMALL branch with batch_size_per_rank≤kNumClusters=15. Repro instructions are in #25574.

If a kernel-only repro using synthetic inputs would help, I'm happy to put one together. Also happy to defer this PR while we investigate the underlying SM100 behavior more deeply if you'd prefer.

Finally

FYI after this patch, we can get single host B300 to achieve 2370.1 tok/s/gpu without MTP.
So: documented CUDA says cluster CTAs can early-exit when no cluster.sync is called; documented PDL says it's a per-CTA stream-dependency thing. Their interaction on SM100 + cluster_dims + enable_pdl(true) is undocumented, and at least one corner of it requires the cluster.sync() barrier before early-exit to keep the scheduler happy — which is exactly what the patch does (and what the LARGE branch always did).

@DarkSharpness
Copy link
Copy Markdown
Collaborator

I see. Thanks for the detailed explanation. This bug really looks strange to me. I actually never encountered that before. I will try to reproduce it locally first. But we can start the ci first anyway. BTW do you have any performance results on this? I'm curious about the performance regression (extra cluster sync vs no cluster sync).

@DarkSharpness
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

if (cluster_rank == 0) {
impl::trivial_transform(transform, seq_len, K);
}
cooperative_groups::this_cluster().sync();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I wonder what's the overhead here. What if the cooperative_groups::this_cluster().sync(); is moved before PDLWaitPrimary. Will it be faster while resolve the crash?
  2. Do we need cluster sync guard for both code path (trivial/small)?

@GavinZhu-GMI
Copy link
Copy Markdown
Contributor Author

Thanks @DarkSharpness for triggering CI and the reproduction effort.

Perf A/B answer: ran the current patch (Cand A, cluster.sync() after the rank-0 work) against the "sync-before" variant you suggested (cluster.sync() moved to the top of the branch, then if (cluster_rank != 0) return;, then rank 0 alone does the work — no second sync). Same fb3 host (8x B300 SXM6, TP=8 + DP=8 + EP=8 + MegaMoE W4A4, ISL=8192 OSL=1024), same launch script except the topk_v2.cuh mount.

Workload Cand A (PR) Sync-before Δ
conc=32 (96 prompts) 914.43 tok/s/gpu 912.98 −0.16%
conc=64 (192 prompts) 1348.37 1351.67 +0.24%
TPOT c=32 27.22 ms 27.28 ms +0.23%
TPOT c=64 36.62 ms 36.65 ms +0.08%

Opposite-signed sub-0.3% deltas → perf-neutral within noise. Cluster.sync overhead in the SMALL branch is below the measurement floor.

"Sync-before" is slightly cleaner conceptually (ranks 1-7 exit the SM as soon as they synchronize instead of sitting idle while rank 0 runs Small::run). If you prefer that pattern for code style reasons, I'm happy to update the PR to it — they're functionally equivalent in our testing. Otherwise I'll keep the current Cand A version since the LARGE branch already uses a sync-after-work pattern (line 369 in your file) and matching the existing style there feels less disruptive.

Sync-before snippet for reference (the two upper branches, LARGE unchanged):

if (seq_len <= K) {
  cooperative_groups::this_cluster().sync();
  if (cluster_rank != 0) return;
  impl::trivial_transform(transform, seq_len, K);
} else if (seq_len <= Small::kMax1PassLength) {
  cooperative_groups::this_cluster().sync();
  if (cluster_rank != 0) return;
  Small::run(params.get_scores(batch_id), s_topk_indices, seq_len, smem, /*use_pdl=*/true);
  Small::transform(transform);
} else { /* LARGE unchanged */ }

Re minimal reproducer — happy to put together a kernel-only repro if it would help your local investigation; let me know.

…AL branches

`topk_fused_transform` is annotated `__cluster_dims__(1, kClusterSize=8, 1)`
via FUSED_COMBINE_KERNEL, but its SMALL (`seq_len <= Small::kMax1PassLength`)
and TRIVIAL (`seq_len <= K`) branches let 7 of every 8 cluster blocks
`return` immediately without ever calling `cluster.sync()`. On SM90 and
SM100 with TP=4/DP=4 (the configuration covered by CI) this is tolerated,
but on B300 / SM100 at TP=8/DP=8 with DeepEP it surfaces as
`CUDA_ERROR_ILLEGAL_ADDRESS` reported at downstream sync points (CUDA graph
replay, dense FP8 GEMM, etc.) -- the actual failing kernel is async.

Reproducer: deepseek-ai/DeepSeek-V4-Pro on 8x B300 SXM6,
`--tp 8 --dp-size 8 --enable-dp-attention --moe-a2a-backend deepep` with
`SGLANG_OPT_USE_TOPK_V2=1` (the default since sgl-project#25406). conc=32 ISL=8192
OSL=1024 triggers the fused 1-stage path (batch_size_per_rank=4 <=
kNumClusters=15) and crashes within ~10 seconds of decode. Workaround:
`SGLANG_OPT_USE_TOPK_V2=0` falls back to the v1 kernel.

Fix: restructure both branches so rank-0 does the work and all 8 ranks
meet at `cooperative_groups::this_cluster().sync()` before exit, honoring
the implicit cluster contract. Validated end-to-end on a B300 box:
SGLANG_OPT_USE_TOPK_V2=1 + the patch + the same workload completes
cleanly at 958.5 tok/s/gpu, vs 963.7 with the TOPK_V2=0 fallback (well
within run-to-run noise -- the 7 dummy ranks were already pinned to the
cluster's SM allotment whether they early-returned or sat at the barrier).
@GavinZhu-GMI GavinZhu-GMI force-pushed the fix-dsv4-topk-v2-cluster-contract branch from 1efb3bd to 95550cb Compare May 20, 2026 07:06
@GavinZhu-GMI
Copy link
Copy Markdown
Contributor Author

Updated the PR to a fuller PDL pattern (commit 95550cb). Force-pushed, ready for CI rerun.

What changed

  • PDLWaitPrimary hoisted to kernel entry — all 8 cluster CTAs now wait for the primary kernel before reading params.seq_lens[batch_id] / scores. Previously only rank 0 waited (inside Small::run(use_pdl=true)); the other CTAs read seq_lens with no PDL guarantee, which was a latent ordering bug.
  • Small::run(use_pdl=false) in the SMALL branch — wait is already hoisted, avoid double-wait.
  • Removed inline PDLWaitPrimary in LARGE — also covered by the hoisted entry-wait.
  • Added PDLTriggerSecondary<true>() before each cluster.sync() in TRIVIAL/SMALL, and after Large::stage1_prologue(...) in LARGE (this last bit matches the spirit of Fix DSV4 topk v2 fused PDL trigger #25785). Lets dependent kernels start earlier without waiting for full CTA completion.
  • cluster.sync() preserved — still the load-bearing fix for the SM100 crash. Hoisting PDL alone does not resolve it (confirmed by an earlier experiment).

Updated the kernel comment to describe the empirical findings honestly and drop the "cluster contract" language.

Perf — A/B/C/D matrix

Same fb3 host (8× B300 SXM6, TP=8 + DP=8 + EP=8 + MegaMoE W4A4, ISL=8192 OSL=1024). Each row a fresh container start, May 17 PR knobs (mem-frac=0.90 chunked-prefill=8192 max-running=768 cuda-graph-max-bs=512). Only the mounted topk_v2.cuh varies.

conc=32 (96 prompts)

Variant tput/gpu TPOT Δ vs A
A cluster.sync only (original PR) 914.43 27.22 ms
B sync-before-early-exit 912.98 27.28 ms −0.16%
C A + PDLTriggerSecondary before sync 912.15 27.39 ms −0.25%
D full pattern (this PR) 913.78 27.22 ms −0.07%

conc=64 (192 prompts)

Variant tput/gpu TPOT Δ vs A
A 1348.37 36.62 ms
B 1351.67 36.65 ms +0.24%
D 1351.97 36.64 ms +0.27%

All four variants are perf-equivalent within ≤0.3% run-to-run noise. The fuller pattern (D) doesn't degrade and is slightly better at c=64.

Why D vs A

  • D is more semantically correct (no UB on seq_lens read pre-wait).
  • D's trigger-before-sync placement lets dependent launches start as each CTA finishes its branch, instead of waiting for the implicit completion at CTA exit.
  • D subsumes Fix DSV4 topk v2 fused PDL trigger #25785's intent in the LARGE branch (PDLTriggerSecondary after stage1_prologue). Happy to coordinate with @parrot18 if there's a conflict on merge order — the change is in the same hunk so the conflict is trivial either way.
  • Same crash fix preserved (cluster.sync() at each branch tail).

The mid-bench non-determinism we noted yesterday (cudaErrorIllegalAddress ~9 min into 320- and 640-prompt runs) reproduces under all four kernel variants and is independent of this PR — likely a separate bug in the decode path we'll chase separately. The short-bench numbers above are stable.

// kernel before reading params.seq_lens / scores. This both (1) ensures
// primary-produced data is visible and (2) gives every CTA a symmetric
// PDL state, sidestepping the SM100 cluster crash described below.
device::PDLWaitPrimary<true>();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wait primiary can be moved after loading seq_len. This is because seq_len is metadata, which is not related to the input. We can safely prefetch this part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DSv4 topk_v2 fused-cluster kernel crashes on B300/SM100 with DP-attn + DeepEP

2 participants