[Proposal] Upstream SM90 WGMMA HGEMM TN?

Hi maintainers @DefTruth , I’d like to ask for feedback before opening a PR.

I implemented a Hopper-only HGEMM TN kernel using WGMMA + TMA + Warp Specialization (SM90, CUDA 12.x).
Current config: BM=128, BN=128, BK=64, stages=3, wgmma m64n128k16.

Local status:
    ·Correctness: max error = 0 on tested square cases
    ·Performance: peak around 572.8 TFLOPS (local Hopper test)

Before preparing PR, could you please advise:
    1. Is this direction welcome in upstream?
    2. Preferred target directory/structure for this kernel?
    3. For v1 PR, do you prefer:
        A) kernel + benchmark + minimal docs
        B) A + PyTorch binding

If accepted, I will follow your preferred structure and submit a minimal, review-friendly PR first.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Proposal] Upstream SM90 WGMMA HGEMM TN? #416

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Proposal] Upstream SM90 WGMMA HGEMM TN? #416

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions