Skip to content

[CUDA] Add sm_121/Blackwell to known target#24523

Open
Charlie-Tsai1123 wants to merge 3 commits into
iree-org:mainfrom
Charlie-Tsai1123:add-sm121-cuda-target
Open

[CUDA] Add sm_121/Blackwell to known target#24523
Charlie-Tsai1123 wants to merge 3 commits into
iree-org:mainfrom
Charlie-Tsai1123:add-sm121-cuda-target

Conversation

@Charlie-Tsai1123
Copy link
Copy Markdown

Summary

Add initial CUDA known target support for sm_121 / Blackwell NVIDIA GB10.

The CUDA execution limits are based on local cudaDeviceProp results from an sm_121 device. Existing NVIDIA MMA ops are reused as a conservative baseline until Blackwell-specific MMA intrinsics are modeled.

Related to #24477.
#24477 reports that IREE does not currently recognize newer Blackwell CUDA targets such as sm_120. This PR addresses the same target-enablement path for sm_121, which is the Blackwell target I can validate locally on NVIDIA GB10.
It intentionally does not add sm_120 support because I do not have sm_120 hardware to confirm the device limits or runtime behavior.

Testing

Tested locally on NVIDIA GB10 / sm_121. sm_121 requires PTX 8.8. Using +ptx88 compiles successfully.

Compiled and ran a local abs.mlir smoke test:

../iree-build/tools/iree-compile abs.mlir \
  --iree-hal-target-device=cuda \
  --iree-cuda-target=sm_121 \
  --iree-cuda-target-features=+ptx88 \
  -o abs_cuda.vmfb

../iree-build/tools/iree-run-module \
  --device=cuda \
  --module=abs_cuda.vmfb \
  --function=abs \
  --input=4xf32=-1,-2,3,-4

Results:

4xf32=1 2 3 4

Compiled and ran a local matmul.mlir smoke test:

../iree-build/tools/iree-compile matmul.mlir \
  --iree-hal-target-device=cuda \
  --iree-cuda-target=sm_121 \
  --iree-cuda-target-features=+ptx88 \
  -o matmul_cuda.vmfb

../iree-build/tools/iree-run-module \
  --device=cuda \
  --module=matmul_cuda.vmfb \
  --function=matmul \
  --input=128x256xf16=1 \
  --input=256x128xf16=1

Result: 128x128xf32 values are 256 as expected.

Add an initial NVIDIA GB10 / sm_121 CUDA target description.

The CUDA execution limits are based on local cudaDeviceProp results from
an sm_121 device. Existing NVIDIA MMA ops are reused as a conservative
baseline until Blackwell-specific MMA intrinsics are modeled.

Signed-off-by: Charlie-Tsai1123 <charlie1123tsai@gmail.com>
Copy link
Copy Markdown
Contributor

@AGindinson AGindinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A LIT test would be nice to have, same as in PR #24525, once there's an alignment on the conflicts / order of merging these PRs.

Signed-off-by: Charlie-Tsai1123 <charlie1123tsai@gmail.com>
Signed-off-by: Charlie-Tsai1123 <charlie1123tsai@gmail.com>
Copy link
Copy Markdown
Contributor

@AGindinson AGindinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants