You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: Correct CuTE thread decomposition in NVFP4 GEMM kernel
In CuTE layouts, Shape<_4,_8> means the first mode is fastest:
T = t0 + t1*4, so t0 = T%4, t1 = T/4. The kernel had the
inverse decomposition (t0 = T/8, t1 = T%8), which placed data
in wrong register positions for the MMA instruction.
Fixed all four layout mappings:
- ALayout: t0=lane%4, t1=lane/4 (was lane/8, lane%8)
- BLayout: same correction
- SFALayout: sf_idx=(lane%2)*8+(lane/4) (was (lane/16)*8+(lane%8))
- SFBLayout: sf_idx=lane/4 (was lane%8)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 commit comments