Skip to content

[Proposal] Upstream SM90 WGMMA HGEMM TN? #416

@lnxtree

Description

@lnxtree

Hi maintainers @DefTruth , I’d like to ask for feedback before opening a PR.

I implemented a Hopper-only HGEMM TN kernel using WGMMA + TMA + Warp Specialization (SM90, CUDA 12.x).
Current config: BM=128, BN=128, BK=64, stages=3, wgmma m64n128k16.

Local status:
·Correctness: max error = 0 on tested square cases
·Performance: peak around 572.8 TFLOPS (local Hopper test)

Before preparing PR, could you please advise:
1. Is this direction welcome in upstream?
2. Preferred target directory/structure for this kernel?
3. For v1 PR, do you prefer:
A) kernel + benchmark + minimal docs
B) A + PyTorch binding

If accepted, I will follow your preferred structure and submit a minimal, review-friendly PR first.

Thanks!

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions