Parent
Part of #1243 — DeepSeek NSA kernels for MI350
Description
Build performance benchmarks for NSA kernels on MI350, measuring against dense FlashAttention and the paper's reported speedups.
Benchmark suite
-
Per-kernel microbenchmarks
- Mean pooling: vary N, block_size
- Compressed attention: vary N/block_size
- Top-k selection: vary N/block_size, block_count
- Selection attention forward: vary N, block_count, block_size
- Selection attention backward: vary N, block_count, block_size
- Sliding window: vary N, window_size
- Gated combination: vary M, H, D
-
End-to-end pipeline benchmarks
- Full NSA forward vs dense FA forward
- Full NSA forward+backward vs dense FA forward+backward
- Prefill mode: M=N (long prompt)
- Decode mode: M=1, varying N (context length)
-
Scaling benchmarks
- Context length scaling: N = 1k, 4k, 16k, 64k, 128k, 256k
- Batch size scaling: B = 1, 2, 4, 8, 16
- Head count scaling: H = 32, 64, 128
- block_count sensitivity: T = 4, 8, 16, 32, 64
Metrics
- Wall-clock time (ms)
- Memory bandwidth utilization (% of MI350 peak)
- Compute utilization (TFLOPS achieved vs peak)
- Peak memory usage (MB)
- Speedup vs dense FlashAttention
Target speedups (from paper, 64k context)
| Mode |
Target speedup vs dense |
| Decode (M=1) |
11.6x |
| Forward (prefill) |
9.0x |
| Backward |
6.0x |
Infrastructure
- Use wave's existing benchmark infrastructure
- Output results as JSON for CI tracking
- Generate roofline plots comparing achieved vs theoretical performance
Depends on
Parent
Part of #1243 — DeepSeek NSA kernels for MI350
Description
Build performance benchmarks for NSA kernels on MI350, measuring against dense FlashAttention and the paper's reported speedups.
Benchmark suite
Per-kernel microbenchmarks
End-to-end pipeline benchmarks
Scaling benchmarks
Metrics
Target speedups (from paper, 64k context)
Infrastructure
Depends on