Skip to content

[wave] NSA: CDNA4 wavefront scheduling for sparse gather patterns #1256

@harsh-nod

Description

@harsh-nod

Parent

Part of #1243 — DeepSeek NSA kernels for MI350

Description

Optimize wavefront scheduling and occupancy for the irregular memory access patterns in NSA's selection attention kernel on MI350 (CDNA4).

Problem

Selection attention gathers non-contiguous KV blocks based on per-query top-k indices. This creates:

  • Irregular global memory access patterns (poor cache utilization)
  • Variable work per wavefront (some queries select nearby blocks, others select distant ones)
  • Potential wavefront stalls waiting for memory

Tasks

  1. Gather coalescing analysis

    • Profile actual memory access patterns for representative block_indices distributions
    • Measure global memory bandwidth utilization vs theoretical peak
    • Identify whether block_size=64 aligns well with MI350 cache lines (64B) and memory channels
  2. Wavefront occupancy tuning

    • Determine optimal number of wavefronts per CU for selection attention
    • Trade off register usage (more live state = fewer wavefronts) vs memory latency hiding (more wavefronts)
    • MI350 has 64KB VGPR per SIMD, 128 VGPRs per wavefront at full occupancy — find the sweet spot
  3. Prefetch strategies

    • Software prefetch: issue global loads for the next block's KV while computing attention on the current block
    • LDS staging: load entire selected KV blocks into LDS before attention computation
    • Double-buffering: overlap loads and computation across blocks
  4. Workgroup sizing

    • Current grid: (B, M, G) — one program per query position per GQA group
    • Consider coarsening: multiple query positions per workgroup to amortize block index loads
    • Consider splitting: if T is large, split blocks across workgroups with a reduction step

Depends on

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestnsaDeepSeek Native Sparse Attention

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions