GPU Matrix Multiplication Performance Analysis

Date: 2025-12-29

Problem Summary

2048x2048 matrix multiply is 650x slower than expected
Performance: 0.13 GFLOPS instead of expected ~130 GFLOPS
Smaller matrices work fine: 256x256 (23 GFLOPS), 512x512 (67 GFLOPS), 1024x1024 (84 GFLOPS)

Root Cause: Catastrophic Cache Misses on Matrix B

The Kernel Code

(index, a, b, result, k) =>
{
    float sum = 0;
    for (int i = 0; i < k; i++)
        sum += a[index.X, i] * b[i, index.Y];  // PROBLEM HERE
    result[index] = sum;
}

Memory Access Pattern

Matrix B is stored in row-major order (DenseX with stride = k).

When accessing b[i, index.Y] for a fixed column (e.g., column 0):

i=0: accesses b[0,0] at memory offset 0
i=1: accesses b[1,0] at memory offset 2048
i=2: accesses b[2,0] at memory offset 4096
... (each access is 2048 elements apart)

Why This Is Catastrophic

Cache line size is typically 64-128 bytes (4-8 float elements)
Accessing 2048 elements apart means every single access MISSES the cache
With 2048 iterations per thread, all 2048 accesses are L1 cache misses
With 4,194,304 threads (2048x2048), this becomes 8.6 billion cache misses

Scaling Analysis

Size	Stride	Expected	Observed
256x256	256	OK	23 GFLOPS
512x512	512	OK	67 GFLOPS
1024x1024	1024	Degraded	84 GFLOPS
2048x2048	2048	Catastrophic	0.13 GFLOPS

Solution: Transpose Matrix B

Solution 1: Transpose B Before Kernel

Change kernel to use B^T instead of B
Kernel becomes: sum += a[index.X, i] * b_transposed[index.Y, i]
This makes both accesses consecutive (coalesced) in memory
Expected improvement: 50-100x speedup for large matrices

Solution 2: Tiled Kernel with Shared Memory

Load tiles of A and B into fast shared memory
Compute using shared memory (1000+ GB/s effective bandwidth)
Expected improvement: 10-20x speedup

Implementation Plan

Add optimized kernel that expects B^T
Transpose B on GPU before matmul (one-time cost)
Use transposed B for the actual multiply
For small matrices, skip transpose (overhead not worth it)

Implementation Results (2025-12-29)

Bug Fixes Applied

Integer Overflow Bug (CRITICAL)
- int totalOps = m * n * p overflows for 2048x2048 matrices (8.6B > int.MaxValue)
- Fix: Changed to long totalOps = (long)m * n * p
- Impact: Without this fix, large matrices silently fell back to CPU!
Transpose-B Optimization
- Added optimized kernel that expects B^T (transposed) for coalesced memory access
- Transpose is performed on GPU before matmul
- Threshold set to k >= 2048 (smaller matrices don't benefit from transpose overhead)

Final Benchmark Results

Size	Before (GFLOPS)	After (GFLOPS)	Improvement
256x256	23	22.50	Same
512x512	67	65.02	Same
1024x1024	84	86.16	Same
2048x2048	0.13	51.90	400x faster

Why 2048x2048 Was So Slow

The combination of two bugs caused catastrophic slowdown:

Integer overflow caused the GPU path to be skipped entirely for large matrices
Even on GPU, the naive kernel's strided memory access pattern caused:
- 8.6 billion cache misses for 2048x2048 matrices
- Each thread accessed elements 2048 apart in memory
- L1/L2 cache was completely ineffective

Solution

For k >= 2048, we now:

Allocate a buffer for B^T on GPU
Transpose B using a separate kernel: B[k,n] -> B^T[n,k]
Use optimized kernel: sum += a[row,i] * bT[col,i]
Both memory accesses are now coalesced (consecutive memory)

For k < 2048, we use the naive kernel because the transpose overhead isn't worth it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU Matrix Multiplication Performance Analysis

Date: 2025-12-29

Problem Summary

Root Cause: Catastrophic Cache Misses on Matrix B

The Kernel Code

Memory Access Pattern

Why This Is Catastrophic

Scaling Analysis

Solution: Transpose Matrix B

Solution 1: Transpose B Before Kernel

Solution 2: Tiled Kernel with Shared Memory

Implementation Plan

Implementation Results (2025-12-29)

Bug Fixes Applied

Final Benchmark Results

Why 2048x2048 Was So Slow

Solution

Uh oh!

FilesExpand file tree

GPU_MATMUL_ANALYSIS.md

Latest commit

History

GPU_MATMUL_ANALYSIS.md

File metadata and controls

GPU Matrix Multiplication Performance Analysis

Date: 2025-12-29

Problem Summary

Root Cause: Catastrophic Cache Misses on Matrix B

The Kernel Code

Memory Access Pattern

Why This Is Catastrophic

Scaling Analysis

Solution: Transpose Matrix B

Solution 1: Transpose B Before Kernel

Solution 2: Tiled Kernel with Shared Memory

Implementation Plan

Implementation Results (2025-12-29)

Bug Fixes Applied

Final Benchmark Results

Why 2048x2048 Was So Slow

Solution