Skip to content

Add cuBLAS mm_out shim to eliminate libtorch runtime dependency#19360

Draft
digantdesai wants to merge 1 commit intomainfrom
cublas-mm-shim
Draft

Add cuBLAS mm_out shim to eliminate libtorch runtime dependency#19360
digantdesai wants to merge 1 commit intomainfrom
cublas-mm-shim

Conversation

@digantdesai
Copy link
Copy Markdown
Contributor

Implements aoti_torch_cuda_mm_out as a thin cuBLAS wrapper in the ExecuTorch AOTI CUDA shims. When Inductor picks cuBLAS over Triton templates for aten::mm (F.linear), the compiled .so requires this symbol at runtime. Without this shim, it resolves from libtorch_cuda.so, pulling in the full libtorch runtime.

In practice, Inductor's autotune on A100 picks Triton templates for the Qwen3.5 MoE dense projections (bf16 [M,2048]x[2048,N]), so the shim is not exercised for this model. It serves as a safety net for models or shapes where cuBLAS wins the autotune, ensuring fully libtorch-free AOTI CUDA deployment in all cases.

Summary

[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a Fixes #<issue-id> line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.

Test plan

[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 7, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19360

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Cancelled Job, 4 Unrelated Failures

As of commit b6b4ad7 with merge base 8ae05c2 (image):

NEW FAILURE - The following job has failed:

CANCELLED JOB - The following job was cancelled. Please retry:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 7, 2026
@digantdesai digantdesai added ciflow/cuda and removed CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. labels May 7, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 7, 2026
Implements aoti_torch_cuda_mm_out as a thin cuBLAS wrapper in the
ExecuTorch AOTI CUDA shims. When Inductor picks cuBLAS over Triton
templates for aten::mm (F.linear), the compiled .so requires this
symbol at runtime. Without this shim, it resolves from
libtorch_cuda.so, pulling in the full libtorch runtime.

In practice, Inductor's autotune on A100 picks Triton templates for
the Qwen3.5 MoE dense projections (bf16 [M,2048]x[2048,N]), so the
shim is not exercised for this model. It serves as a safety net for
models or shapes where cuBLAS wins the autotune, ensuring fully
libtorch-free AOTI CUDA deployment in all cases.

Co-authored-by: Claude <noreplyanthropic.com>
@Gasoonjia Gasoonjia changed the title Add cuBLAS mm_out shim to eliminate libtorch runtime dependency Add cuBLAS mm_out shim to cuda backend May 7, 2026
@Gasoonjia Gasoonjia changed the title Add cuBLAS mm_out shim to cuda backend Add cuBLAS mm_out shim to eliminate libtorch runtime dependency May 7, 2026
@Gasoonjia
Copy link
Copy Markdown
Contributor

can you help me to update the title and summary a little bit? one thing is our cuda backend never depend on libtorch; our current state sounds like we are depending on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants