Skip to content

Commit 7f55f7c

Browse files
TimDettmersclaude
andcommitted
bench: Add LinearNVFP4 end-to-end benchmarks
LinearNVFP4 vs FP16 nn.Linear on RTX PRO 6000: bs=1-128, hidden=4096, shapes include FFN (11008). ~10x slower than cuBLAS FP16, 3.6x memory savings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 1e2dc09 commit 7f55f7c

File tree

1 file changed

+15
-0
lines changed

1 file changed

+15
-0
lines changed

benchmarks/nvfp4_gemm_results.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,3 +72,18 @@ Despite the performance gap, the implementation provides:
7272
with 0.000000 relative error (same quantized data, different only in FP32 rounding)
7373
- **Full Python API**: quantize/dequantize/GEMM/LinearNVFP4 all working end-to-end
7474
- **NVFP4 output epilogue**: GEMM → quantize chain for layer chaining
75+
76+
## LinearNVFP4 End-to-End Benchmarks
77+
78+
LinearNVFP4 includes activation quantization overhead on top of the GEMM kernel.
79+
80+
| Config | NVFP4 (ms) | FP16 (ms) | Speedup |
81+
|--------|-----------|----------|---------|
82+
| bs=1, 4096→4096 (proj) | 0.120 | 0.010 | 0.09x |
83+
| bs=1, 4096→11008 (FFN) | 0.128 | 0.019 | 0.15x |
84+
| bs=8, 4096→4096 (proj) | 0.128 | 0.010 | 0.08x |
85+
| bs=8, 4096→11008 (FFN) | 0.143 | 0.019 | 0.13x |
86+
| bs=32, 4096→4096 (proj) | 0.147 | 0.013 | 0.08x |
87+
| bs=32, 4096→11008 (FFN) | 0.228 | 0.021 | 0.09x |
88+
| bs=128, 4096→4096 (proj) | 0.315 | 0.019 | 0.06x |
89+
| bs=128, 4096→11008 (FFN) | 0.710 | 0.041 | 0.06x |

0 commit comments

Comments
 (0)