Commit a629ab1
committed
fix: Add stream parameter to cgemm_nvfp4 for CUDA graph support
The kernel launch now uses the caller's stream via <<<grid, threads, 0,
stream>>>. The Python dispatch passes _get_tensor_stream(A_packed).
This enables CUDA graph capture for accurate benchmarking.1 parent 9c96365 commit a629ab1
2 files changed
+3
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
928 | 928 | | |
929 | 929 | | |
930 | 930 | | |
| 931 | + | |
931 | 932 | | |
932 | 933 | | |
933 | 934 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
485 | 485 | | |
486 | 486 | | |
487 | 487 | | |
488 | | - | |
| 488 | + | |
489 | 489 | | |
490 | 490 | | |
491 | 491 | | |
492 | 492 | | |
493 | 493 | | |
494 | 494 | | |
495 | 495 | | |
496 | | - | |
| 496 | + | |
497 | 497 | | |
0 commit comments