You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NCU showed 80 regs/thread from N_TILES_PER_WARP=4 accumulators.
Reduce to 2 N-tiles per warp (8 acc floats instead of 16),
increase to 4×4=16 warps per block (m64×n64 tile, 512 threads).
Target 2 blocks/SM for better occupancy.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 commit comments