Commit f4e423f
committed
feat: cache-blocked SIMD MatMul achieving 14-17 GFLOPS
Add SimdMatMul.cs with GEBP (General Block Panel) algorithm:
- Cache blocking: MC=64, KC=256 tuned for L1/L2 cache
- 8x16 micro-kernel with 16 vector accumulators
- K-loop unrolled by 4 for better ILP
- FMA support when available
- Aligned memory allocation for packing buffers
Performance improvement (single-threaded):
- 256x256: 4.6 → 16.8 GFLOPS (3.7x)
- 512x512: 5.8 → 16.8 GFLOPS (2.9x)
- 1024x1024: 5.5 → 16.1 GFLOPS (2.9x)
- 2048x2048: 3.5 → 14.7 GFLOPS (4.2x)
This approaches OpenBLAS single-thread performance (~20-40 GFLOPS)
without requiring parallelization.1 parent e88710c commit f4e423f
2 files changed
Lines changed: 404 additions & 4 deletions
Lines changed: 2 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
81 | 81 | | |
82 | 82 | | |
83 | 83 | | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | 84 | | |
88 | 85 | | |
89 | 86 | | |
90 | 87 | | |
91 | | - | |
| 88 | + | |
| 89 | + | |
92 | 90 | | |
93 | 91 | | |
94 | 92 | | |
| |||
0 commit comments