You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add ldmatrix + XOR swizzle for A-fragment loading in production kernel
Replace 8 element-by-element shared memory reads per A fragment with a
single ldmatrix.sync.aligned.m8n8.x4.shared.b16 instruction. Add XOR
bank-conflict swizzle: col_group ^ (row % 8) at 8-half granularity.
Without swizzle, all 8 threads in an ldmatrix group hit the same bank
(8-way conflict) because TILE_K=64 gives a stride that's a multiple of
the bank repeat distance. The XOR swizzle distributes threads across
8 different banks (zero conflicts).
All 139 tests still pass. The fp16 path produces identical output to
the element-by-element version (verified by test_prod_fp16_matches_splitk).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 commit comments