[CUDA] gather_mm by zcbenz · Pull Request #3414 · ml-explore/mlx

zcbenz · 2026-04-15T05:45:32Z

Implement gather_mm by fusing CUTLASS GEMM with a custom prologue.

Note that this implementation is completely unoptimized (e.g. no tensor core, no vectorized read) and only performs 0.3x of cuBLAS, we would need to write our own heuristics to dispatch to proper main loops and alignments, which I'm postponing a bit until we have more CUTLASS kernels that I can make a general design.

angeloskath

Looks great. Should be fairly easy to extend to at least Ampere tensor cores.

[CUDA] gather_matmul

490a8a6

zcbenz force-pushed the cuda-gather-mm branch from 25f4d35 to 490a8a6 Compare April 15, 2026 09:55

zcbenz changed the title ~~[CUDA] gather_matmul~~ [CUDA] gather_mm Apr 17, 2026

angeloskath approved these changes Apr 17, 2026

View reviewed changes

zcbenz merged commit d142de6 into ml-explore:main Apr 17, 2026
16 checks passed

zcbenz deleted the cuda-gather-mm branch April 17, 2026 07:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] gather_mm#3414

[CUDA] gather_mm#3414
zcbenz merged 1 commit intoml-explore:mainfrom
zcbenz:cuda-gather-mm

zcbenz commented Apr 15, 2026 •

edited

Loading

Uh oh!

angeloskath left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zcbenz commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zcbenz commented Apr 15, 2026 •

edited

Loading