Skip to content

cuda: disable sparse V skip (warp divergence regression)#105

Merged
TheTom merged 1 commit into
feature/turboquant-kv-cachefrom
fix/disable-sparse-v-cuda
Apr 24, 2026
Merged

cuda: disable sparse V skip (warp divergence regression)#105
TheTom merged 1 commit into
feature/turboquant-kv-cachefrom
fix/disable-sparse-v-cuda

Conversation

@TheTom
Copy link
Copy Markdown
Owner

@TheTom TheTom commented Apr 24, 2026

Summary

Disables sparse V dequant skip on CUDA. Per-lane branching causes warp divergence that costs more than the skipped dequants save.

Benchmarks (@sztlink, Qwen3-30B-A3B Q4_K_M)

RTX 4090 (SM89):

Context Sparse V ON OFF (baseline) Delta
512 59.05 59.51 -0.8%
4K 13.39 13.77 -2.8%
8K 7.73 7.77 -0.5%
16K 3.95 3.97 -0.5%

RTX 3090 (SM86):

Context Sparse V ON OFF (baseline) Delta
512 32.49 32.19 -0.9%
4K 6.40 6.38 -0.3%
8K 2.56 2.54 -0.8%
16K 1.26 1.25 -0.5%

Metal path unaffected (remains enabled, +4% to +23%).

TODO

Revisit with warp-level ballot skip (__ballot_sync + early exit when entire warp is below threshold).

Per-lane branching in the VEC FA kernel causes warp divergence that
costs more than the skipped dequants save. Benchmarked at -0.3% to
-2.8% on RTX 3090/4090 across all context lengths.

Metal path unaffected (remains enabled, +4% to +23%).

TODO: revisit with warp-level ballot skip (__ballot_sync + early
exit when entire warp is below threshold).

Data: @sztlink (Qwen3-30B-A3B Q4_K_M, CUDA SM86/SM89)
@TheTom TheTom merged commit 11a241d into feature/turboquant-kv-cache Apr 24, 2026
20 of 48 checks passed
jimbothigpen pushed a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
cuda: disable sparse V skip (warp divergence regression)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant