Skip to content

Commit f4e423f

Browse files
committed
feat: cache-blocked SIMD MatMul achieving 14-17 GFLOPS
Add SimdMatMul.cs with GEBP (General Block Panel) algorithm: - Cache blocking: MC=64, KC=256 tuned for L1/L2 cache - 8x16 micro-kernel with 16 vector accumulators - K-loop unrolled by 4 for better ILP - FMA support when available - Aligned memory allocation for packing buffers Performance improvement (single-threaded): - 256x256: 4.6 → 16.8 GFLOPS (3.7x) - 512x512: 5.8 → 16.8 GFLOPS (2.9x) - 1024x1024: 5.5 → 16.1 GFLOPS (2.9x) - 2048x2048: 3.5 → 14.7 GFLOPS (4.2x) This approaches OpenBLAS single-thread performance (~20-40 GFLOPS) without requiring parallelization.
1 parent e88710c commit f4e423f

2 files changed

Lines changed: 404 additions & 4 deletions

File tree

src/NumSharp.Core/Backends/Default/Math/BLAS/Default.MatMul.2D2D.cs

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -81,14 +81,12 @@ private static unsafe bool TryMatMulSimd(NDArray left, NDArray right, NDArray re
8181
{
8282
case NPTypeCode.Single:
8383
{
84-
var kernel = ILKernelGenerator.GetMatMulKernel<float>();
85-
if (kernel == null) return false;
86-
8784
float* a = (float*)left.Address;
8885
float* b = (float*)right.Address;
8986
float* c = (float*)result.Address;
9087

91-
kernel(a, b, c, M, N, K);
88+
// Use cache-blocked implementation for better performance
89+
SimdMatMul.MatMulFloat(a, b, c, M, N, K);
9290
return true;
9391
}
9492

0 commit comments

Comments
 (0)