Commit f2dc968
committed
cuda: disable sparse V skip (warp divergence regression)
Per-lane branching in the VEC FA kernel causes warp divergence that
costs more than the skipped dequants save. Benchmarked at -0.3% to
-2.8% on RTX 3090/4090 across all context lengths.
Metal path unaffected (remains enabled, +4% to +23%).
TODO: revisit with warp-level ballot skip (__ballot_sync + early
exit when entire warp is below threshold).
Data: @sztlink (Qwen3-30B-A3B Q4_K_M, CUDA SM86/SM89)1 parent 9e3fb40 commit f2dc968
1 file changed
Lines changed: 10 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
333 | 333 | | |
334 | 334 | | |
335 | 335 | | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
336 | 340 | | |
337 | 341 | | |
338 | 342 | | |
| |||
341 | 345 | | |
342 | 346 | | |
343 | 347 | | |
| 348 | + | |
344 | 349 | | |
345 | 350 | | |
346 | 351 | | |
| |||
373 | 378 | | |
374 | 379 | | |
375 | 380 | | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
376 | 385 | | |
377 | 386 | | |
378 | 387 | | |
| |||
381 | 390 | | |
382 | 391 | | |
383 | 392 | | |
| 393 | + | |
384 | 394 | | |
385 | 395 | | |
386 | 396 | | |
| |||
0 commit comments