Commit b92e6c4
committed
perf: full panel packing for MatMul achieving 20+ GFLOPS
Optimize SimdMatMul with full panel packing for both A and B matrices,
improving from ~16 GFLOPS to 20+ GFLOPS (25% improvement).
Key optimizations:
- Pack A as [kc][MR] panels: 8 rows interleaved per k value
- Pack B as [kc][NR] panels: 16 columns contiguous per k value
- Micro-kernel accesses both as contiguous memory:
- aPanel[k * 8 + row] instead of packA[row * kc + k]
- bPanel[k * 16 + col] instead of packB[k * n + col]
- 4x k-loop unrolling with 16 Vector256 accumulators
- FMA (Fused Multiply-Add) for 2x FLOP throughput
Performance results (1024x1024, single-threaded):
- Before (row-major B): ~16 GFLOPS
- After (full panel): ~20-21 GFLOPS
- Theoretical peak: 96 GFLOPS (AVX2 @ 3GHz)
- Efficiency: ~21% (excellent for C#/.NET)
Why panel packing helps:
- Original: B access stride = N * 4 bytes (cache-unfriendly)
- Panel: B access stride = 64 bytes (one cache line)
- Both A and B now have optimal sequential access patterns
Algorithm: GEBP (General Block Panel) with cache blocking:
- MC=64 (rows per A panel, fits L2)
- KC=256 (K depth per block)
- MR=8 (micro-kernel rows)
- NR=16 (micro-kernel cols = 2 vectors)1 parent 0f6aac9 commit b92e6c4
1 file changed
Lines changed: 263 additions & 182 deletions
0 commit comments