Commit 0ba4950
perf: Remove unnecessary workspace allocation from NVFP4 GEMM
CUTLASS get_workspace_size() returns 0 for our GEMM configuration
(no split-k, simple epilogue). Pass nullptr instead of constructing
a cutlass::device_memory::allocation which calls cudaMalloc/cudaFree
on every invocation. This makes the GEMM kernel CUDA-graph-capturable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent 9d2f452 commit 0ba4950
1 file changed
+2
-5
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
98 | 98 | | |
99 | 99 | | |
100 | 100 | | |
101 | | - | |
102 | | - | |
103 | | - | |
104 | 101 | | |
105 | 102 | | |
106 | 103 | | |
| |||
109 | 106 | | |
110 | 107 | | |
111 | 108 | | |
112 | | - | |
| 109 | + | |
113 | 110 | | |
114 | 111 | | |
115 | 112 | | |
116 | 113 | | |
117 | 114 | | |
118 | | - | |
| 115 | + | |
119 | 116 | | |
120 | 117 | | |
121 | 118 | | |
| |||
0 commit comments