- 2048x2048 matrix multiply is 650x slower than expected
- Performance: 0.13 GFLOPS instead of expected ~130 GFLOPS
- Smaller matrices work fine: 256x256 (23 GFLOPS), 512x512 (67 GFLOPS), 1024x1024 (84 GFLOPS)
(index, a, b, result, k) =>
{
float sum = 0;
for (int i = 0; i < k; i++)
sum += a[index.X, i] * b[i, index.Y]; // PROBLEM HERE
result[index] = sum;
}Matrix B is stored in row-major order (DenseX with stride = k).
When accessing b[i, index.Y] for a fixed column (e.g., column 0):
- i=0: accesses b[0,0] at memory offset 0
- i=1: accesses b[1,0] at memory offset 2048
- i=2: accesses b[2,0] at memory offset 4096
- ... (each access is 2048 elements apart)
- Cache line size is typically 64-128 bytes (4-8 float elements)
- Accessing 2048 elements apart means every single access MISSES the cache
- With 2048 iterations per thread, all 2048 accesses are L1 cache misses
- With 4,194,304 threads (2048x2048), this becomes 8.6 billion cache misses
| Size | Stride | Expected | Observed |
|---|---|---|---|
| 256x256 | 256 | OK | 23 GFLOPS |
| 512x512 | 512 | OK | 67 GFLOPS |
| 1024x1024 | 1024 | Degraded | 84 GFLOPS |
| 2048x2048 | 2048 | Catastrophic | 0.13 GFLOPS |
- Change kernel to use B^T instead of B
- Kernel becomes:
sum += a[index.X, i] * b_transposed[index.Y, i] - This makes both accesses consecutive (coalesced) in memory
- Expected improvement: 50-100x speedup for large matrices
- Load tiles of A and B into fast shared memory
- Compute using shared memory (1000+ GB/s effective bandwidth)
- Expected improvement: 10-20x speedup
- Add optimized kernel that expects B^T
- Transpose B on GPU before matmul (one-time cost)
- Use transposed B for the actual multiply
- For small matrices, skip transpose (overhead not worth it)
-
Integer Overflow Bug (CRITICAL)
int totalOps = m * n * poverflows for 2048x2048 matrices (8.6B > int.MaxValue)- Fix: Changed to
long totalOps = (long)m * n * p - Impact: Without this fix, large matrices silently fell back to CPU!
-
Transpose-B Optimization
- Added optimized kernel that expects B^T (transposed) for coalesced memory access
- Transpose is performed on GPU before matmul
- Threshold set to k >= 2048 (smaller matrices don't benefit from transpose overhead)
| Size | Before (GFLOPS) | After (GFLOPS) | Improvement |
|---|---|---|---|
| 256x256 | 23 | 22.50 | Same |
| 512x512 | 67 | 65.02 | Same |
| 1024x1024 | 84 | 86.16 | Same |
| 2048x2048 | 0.13 | 51.90 | 400x faster |
The combination of two bugs caused catastrophic slowdown:
- Integer overflow caused the GPU path to be skipped entirely for large matrices
- Even on GPU, the naive kernel's strided memory access pattern caused:
- 8.6 billion cache misses for 2048x2048 matrices
- Each thread accessed elements 2048 apart in memory
- L1/L2 cache was completely ineffective
For k >= 2048, we now:
- Allocate a buffer for B^T on GPU
- Transpose B using a separate kernel:
B[k,n] -> B^T[n,k] - Use optimized kernel:
sum += a[row,i] * bT[col,i] - Both memory accesses are now coalesced (consecutive memory)
For k < 2048, we use the naive kernel because the transpose overhead isn't worth it.