Small Tile M BlockScaled GEMM + Grouped GEMM on SM12x by besquared · Pull Request #3196 · NVIDIA/cutlass

besquared · 2026-04-30T04:09:59Z

Adds BlockScaled GEMM and Grouped GEMM support for small tile M = 64 and K = 64 on RTX / DGX Spark GPUs targeting SM120 and SM121. Completes the symmetric counterpart to #3176 (small N).

Motivation: fused attention. NVFP4 attention kernels at head_dim=256 benefit from M=64 because it enables 4-thread-per-row softmax distribution with 8 mma warps (vs 2 threads/row at M=128), removing warp imbalance during online softmax production. K=64 reduces the QK gemm's per-stage scale tensor footprint, freeing smem budget for the attention scaffold's score and output staging.

Changes:

SFA M-padding for M < 128 in sm120_blockscaled_mma_builder.inl, sm120_blockscaled_mma_tma.hpp, sm120_blockscaled_mma_array_tma.hpp (mirror of Small Tile N BlockScaled GEMM + Grouped GEMM on SM12x #3176's SFB N-padding).
Round SFA TMA descriptor box to 128 rows when M < 128 to satisfy cuTensorMapEncodeTiled's 16-byte minimum inner-transfer constraint,
with consumer-side slice partitioning.
K=64 SingleCtaKBlock consumer mainloop path in sm120_blockscaled_mma_tma.hpp.
Relax M >= 128 cooperative kernel precondition for SM120 blockscaled kernels in sm90_gemm_tma_warpspecialized_cooperative.hpp.
Add NVFP4 cooperative tests covering M,N,K ∈ {64,128} combinations, including a GemmUniversalAdapter initialize/run path test that exercises the host-side TMA descriptor construction at M=64.

Verified on RTX 6000 Pro Blackwell (sm_120):

Dense SM120 blockscaled suite: all tests pass.
Grouped SM120 blockscaled suite: all tests pass.
compute-sanitizer memcheck: clean at all new tile shapes.

Note on scope: M = 32 is not included. The mma.sync NVFP4 atom is m16n8k64; the m = 16 dimension forces M = 64 as the minimum tile that keeps all 4 cooperative mma warps engaged with at least one atom each. M = 32 would leave 2 of 4 warps idle and is structurally degenerate. #3176's small-N support could go to N = 32 because n = 8 in the same atom permits full warp utilization at the smaller dim. The asymmetry between m and n in this PR's small-tile coverage reflects the asymmetry in the underlying mma atom, not a scope cut.

Mirror the small-N scale-tensor padding pattern from PR NVIDIA#3176 for SFA so M=64 cooperative blockscaled kernels use a padded TMA box, broadcasted producer coordinates, and sliced consumer layouts. Add dense, active GemmUniversalAdapter, and grouped coverage for SM120 NVFP4 small-tile shapes.

depaulmillz · 2026-04-30T17:01:28Z

      //

-      auto [gA_mkl, gB_nkl, gSFA_mkl, gSFB_nkl] = load_inputs;
+      auto gA_mkl = get<0>(load_inputs);


Is there a reason that this needed to be changed to get?

Fixed, just a stylistic regression, I had gpt5.5 sweep for other stylistic and idiomatic regressions and things seemed ok

depaulmillz · 2026-04-30T17:05:32Z

      //
      // Compute on k_tile
      //
+      if constexpr (SingleCtaKBlock) {


For clarification is this just intended as a constexpr branch for TileShapeK == MMA_K?

I updated this so that it conditions on the implication of the tile shape (K_BLOCK_MAX=1) rather than on the tile shape itself, which hopefully clarifies the semantics.

depaulmillz · 2026-04-30T17:08:15Z

Would it also be possible to add the tile sizes to python/cutlass_library/generator.py similar to what was done in #3176?

besquared · 2026-05-02T15:07:35Z

Would it also be possible to add the tile sizes to python/cutlass_library/generator.py similar to what was done in #3176?

I mirrored the #3176 generator pattern and added the new small-M shapes to the SM120 blockscaled generator lists. I also added the analogous grouped pingpong smoke test for M=64:

SM120_Device_Gemm_e2m1t_e2m1n_e2m1t_tensorop_f32_epilogue_VS16_group_pingpong.row_sf_64x128x128

That gives representative cooperative small-tile coverage plus a grouped pingpong smoke test for the generated pingpong schedule exactly as in #3176.

depaulmillz reviewed Apr 30, 2026

View reviewed changes

besquared added 3 commits May 2, 2026 09:36

Address SM120 small-M blockscaled review feedback

dd4ddbe

Polish SM120 small-M review cleanup

11c8f5f

Add SM120 small-M grouped pingpong coverage

a27f9c7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small Tile M BlockScaled GEMM + Grouped GEMM on SM12x#3196

Small Tile M BlockScaled GEMM + Grouped GEMM on SM12x#3196
besquared wants to merge 4 commits intoNVIDIA:mainfrom
thedatamates:sm120-nvfp4-small-m-tiles

besquared commented Apr 30, 2026

Uh oh!

depaulmillz Apr 30, 2026

Uh oh!

besquared May 2, 2026

Uh oh!

depaulmillz Apr 30, 2026

Uh oh!

besquared May 2, 2026

Uh oh!

depaulmillz commented Apr 30, 2026

Uh oh!

besquared commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

besquared commented Apr 30, 2026

Uh oh!

depaulmillz Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

besquared May 2, 2026

Choose a reason for hiding this comment

Uh oh!

depaulmillz Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

besquared May 2, 2026

Choose a reason for hiding this comment

Uh oh!

depaulmillz commented Apr 30, 2026

Uh oh!

besquared commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants