Adding support for bmm_fp8 OP using cutlass by bhargaveede · Pull Request #108 · sgl-project/sgl-kernel-xpu

bhargaveede · 2026-02-05T13:43:54Z

Created new PR inplace of closed PR : #40

@kareemshaik80 @mingfeima This will still be relevant for HW with native support. As we will not do dtype conversions there.
Also, Right now we are doing replication of scales to support stride of 1 (current limitation of cutlass).

On top of current PR, We can avoid that replication once that stride support of 0 is available which will improve the perf further.

msinnha1 · 2026-02-24T04:36:16Z

    B_scale: torch.Tensor,
 ) -> None:
-    cublas_handle = torch.cuda.current_blas_handle()
+    # cublas_handle = torch.cuda.current_blas_handle()


review comments not resolved from the earlier PR#40 (#40)

Eede, Bhargav added 3 commits January 22, 2026 08:28

Addition of bmm_fp8 cutlass op.

21eec0d

Adding test to test suite

7b10c1c

Both inputs must have same dtype

3250807

msinnha1 reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for bmm_fp8 OP using cutlass#108

Adding support for bmm_fp8 OP using cutlass#108
bhargaveede wants to merge 3 commits intosgl-project:mainfrom
bhargaveede:bmm_fp8_cutlass

bhargaveede commented Feb 5, 2026

Uh oh!

msinnha1 Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bhargaveede commented Feb 5, 2026

Uh oh!

msinnha1 Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants