Skip to content

Attention Perf: Transpose blocked K right before QK instead of pre-transposing before the kernel#2374

Merged
AmesingFlank merged 1 commit into
mainfrom
AmesingFlank/stack/50
May 11, 2026
Merged

Attention Perf: Transpose blocked K right before QK instead of pre-transposing before the kernel#2374
AmesingFlank merged 1 commit into
mainfrom
AmesingFlank/stack/50

Conversation

@AmesingFlank
Copy link
Copy Markdown
Contributor

@AmesingFlank AmesingFlank commented May 9, 2026

Optimization found by claude, by comparing the current Helion kernel with this reference impl

Previously, The Helion-generated kernel reshapes and transposes K before the kernel: k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(0, 2, 1) producing shape [B*H, D, S]. And then during matmul (jax.dot_general) it then uses a standard contraction dimension_numbers=(((2,), (1,)), ((0,), (0,))).

This causes sub-optimal DMA patterns because the transposed K layout in HBM has non-contiguous memory access for the pipeline's sequential block reads along the sequence dimension.

The PR modifies the kernel by keeping K contiguous, and only transposes right before the matmul. On top of the previous PR (#2373) this improves the TFLOPs from 633 to 652

AmesingFlank added a commit that referenced this pull request May 9, 2026
…ansposing before the kernel

stack-info: PR: #2374, branch: AmesingFlank/stack/50
@AmesingFlank AmesingFlank force-pushed the AmesingFlank/stack/50 branch from 0bdb0b9 to 4577603 Compare May 9, 2026 03:42
@AmesingFlank AmesingFlank force-pushed the AmesingFlank/stack/49 branch from 1b0111f to cdfb1ce Compare May 9, 2026 03:42
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 9, 2026
@AmesingFlank AmesingFlank marked this pull request as draft May 9, 2026 04:08
@AmesingFlank AmesingFlank changed the base branch from AmesingFlank/stack/49 to main May 9, 2026 04:08
@AmesingFlank AmesingFlank force-pushed the AmesingFlank/stack/50 branch from 4577603 to a2abca6 Compare May 9, 2026 04:08
@AmesingFlank AmesingFlank changed the base branch from main to AmesingFlank/stack/49 May 9, 2026 04:08
@AmesingFlank AmesingFlank marked this pull request as ready for review May 9, 2026 04:08
AmesingFlank added a commit that referenced this pull request May 9, 2026
…ansposing before the kernel

stack-info: PR: #2374, branch: AmesingFlank/stack/50
@AmesingFlank AmesingFlank marked this pull request as draft May 9, 2026 05:16
@AmesingFlank AmesingFlank changed the base branch from AmesingFlank/stack/49 to main May 9, 2026 05:16
@AmesingFlank AmesingFlank force-pushed the AmesingFlank/stack/50 branch from a2abca6 to 6ad5991 Compare May 9, 2026 05:16
@AmesingFlank AmesingFlank changed the base branch from main to AmesingFlank/stack/49 May 9, 2026 05:16
@AmesingFlank AmesingFlank marked this pull request as ready for review May 9, 2026 05:16
@AmesingFlank AmesingFlank changed the base branch from AmesingFlank/stack/49 to main May 11, 2026 16:44
@AmesingFlank AmesingFlank changed the base branch from main to AmesingFlank/stack/49 May 11, 2026 16:45
…ansposing before the kernel

stack-info: PR: #2374, branch: AmesingFlank/stack/50
@AmesingFlank AmesingFlank marked this pull request as draft May 11, 2026 16:45
@AmesingFlank AmesingFlank changed the base branch from AmesingFlank/stack/49 to main May 11, 2026 16:45
@AmesingFlank AmesingFlank force-pushed the AmesingFlank/stack/50 branch from 6ad5991 to f9e9453 Compare May 11, 2026 16:45
@AmesingFlank AmesingFlank marked this pull request as ready for review May 11, 2026 16:45
@AmesingFlank AmesingFlank merged commit 2e0e236 into main May 11, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants