You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> This is the sm_120 / sm_121 warp-specialized context FMHA that carries the
4
+
> per-warp **skip-softmax** optimization (hence the name). Only half of the
5
+
> Hopper warp-specialization recipe ports to consumer Blackwell: TMA-driven
6
+
> async loads survive, but async MMA does not (sm_120 / sm_121 have no
7
+
> `wgmma.async` equivalent), so the compute warps stay on `mma.sync` while a
8
+
> dedicated producer warp drives the loads with TMA.
9
+
10
+
This directory implements a warp-specialized context FMHA for the sm_120
11
+
family (sm_120 / sm_121). It targets BF16, causal mask, `head_dim ==
12
+
head_dim_v` in `{128, 256}`, and the PACKED_QKV layout. The kernel carries the
13
+
per-warp skip-softmax optimization into the warp-specialized design.
14
+
15
+
## Files
16
+
17
+
| File | Role |
18
+
|------|------|
19
+
|`kernel_traits.h`|`Kernel_traits_skip_softmax_sm120`: wraps `fmha::Kernel_traits_v2` for the LDGSTS-friendly `Smem_tile_*` types, then layers on the producer/consumer warp roles, the granular smem buffers, the circular-buffer barriers, and the V re-tile (see below). |
20
+
|`dma_sync_mma.h`| Producer (`DMA::run`). Issues `cp.async.bulk.tensor.3d.shared::cta.global.tile` for Q / K / V into the granular buffers. `DMA::Host::init_params` builds the three `CUtensorMap` descriptors with the driver-API `cuTensorMapEncodeTiled`. |
21
+
|`compute_sync_mma.h`| Consumer (`Compute::run`). The kv-loop body — BMM1 (`fmha::gemm`) + softmax + causal mask + per-warp skip-softmax vote + BMM2 + epilogue — reading the granular `Smem_tile_q/k/v` per ring slot. |
22
+
23
+
The translation unit and the in-engine dispatch bridges live in
0 commit comments