Parent
Part of #1243 — DeepSeek NSA kernels for MI350
Description
Optimize wavefront scheduling and occupancy for the irregular memory access patterns in NSA's selection attention kernel on MI350 (CDNA4).
Problem
Selection attention gathers non-contiguous KV blocks based on per-query top-k indices. This creates:
- Irregular global memory access patterns (poor cache utilization)
- Variable work per wavefront (some queries select nearby blocks, others select distant ones)
- Potential wavefront stalls waiting for memory
Tasks
-
Gather coalescing analysis
- Profile actual memory access patterns for representative block_indices distributions
- Measure global memory bandwidth utilization vs theoretical peak
- Identify whether block_size=64 aligns well with MI350 cache lines (64B) and memory channels
-
Wavefront occupancy tuning
- Determine optimal number of wavefronts per CU for selection attention
- Trade off register usage (more live state = fewer wavefronts) vs memory latency hiding (more wavefronts)
- MI350 has 64KB VGPR per SIMD, 128 VGPRs per wavefront at full occupancy — find the sweet spot
-
Prefetch strategies
- Software prefetch: issue global loads for the next block's KV while computing attention on the current block
- LDS staging: load entire selected KV blocks into LDS before attention computation
- Double-buffering: overlap loads and computation across blocks
-
Workgroup sizing
- Current grid: (B, M, G) — one program per query position per GQA group
- Consider coarsening: multiple query positions per workgroup to amortize block index loads
- Consider splitting: if T is large, split blocks across workgroups with a reduction step
Depends on
Parent
Part of #1243 — DeepSeek NSA kernels for MI350
Description
Optimize wavefront scheduling and occupancy for the irregular memory access patterns in NSA's selection attention kernel on MI350 (CDNA4).
Problem
Selection attention gathers non-contiguous KV blocks based on per-query top-k indices. This creates:
Tasks
Gather coalescing analysis
Wavefront occupancy tuning
Prefetch strategies
Workgroup sizing
Depends on