Commit b92e6c4

committed

perf: full panel packing for MatMul achieving 20+ GFLOPS

Optimize SimdMatMul with full panel packing for both A and B matrices, improving from ~16 GFLOPS to 20+ GFLOPS (25% improvement). Key optimizations: - Pack A as [kc][MR] panels: 8 rows interleaved per k value - Pack B as [kc][NR] panels: 16 columns contiguous per k value - Micro-kernel accesses both as contiguous memory: - aPanel[k * 8 + row] instead of packA[row * kc + k] - bPanel[k * 16 + col] instead of packB[k * n + col] - 4x k-loop unrolling with 16 Vector256 accumulators - FMA (Fused Multiply-Add) for 2x FLOP throughput Performance results (1024x1024, single-threaded): - Before (row-major B): ~16 GFLOPS - After (full panel): ~20-21 GFLOPS - Theoretical peak: 96 GFLOPS (AVX2 @ 3GHz) - Efficiency: ~21% (excellent for C#/.NET) Why panel packing helps: - Original: B access stride = N * 4 bytes (cache-unfriendly) - Panel: B access stride = 64 bytes (one cache line) - Both A and B now have optimal sequential access patterns Algorithm: GEBP (General Block Panel) with cache blocking: - MC=64 (rows per A panel, fits L2) - KC=256 (K depth per block) - MR=8 (micro-kernel rows) - NR=16 (micro-kernel cols = 2 vectors)

1 parent 0f6aac9 commit b92e6c4Copy full SHA for b92e6c4

1 file changed

src/NumSharp.Core/Backends/Kernels
- SimdMatMul.cs

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit b92e6c4

File tree

0 commit comments