Speed up CUDA CI build: split into per-arch OBJECT libraries, add --flash_nvcc_threads, and enable quick build mode by tianleiwu · Pull Request #28645 · microsoft/onnxruntime

tianleiwu · 2026-05-22T20:29:35Z

Description

Speed up CUDA CI build times by splitting the monolithic CUDA provider into architecture-specific OBJECT libraries with independent nvcc --threads control, and introducing a quick build mode (onnxruntime_QUICK_BUILD) that reduces kernel instantiations for CI validation.

Motivation and Context

CUDA builds were bottlenecked by --nvcc_threads 1 across all targets because flash attention (48 .cu files, SM80+) requires ~4GB per nvcc thread and caused OOM when compiled with higher thread counts. The old heuristic in build.py used psutil to auto-detect memory but was unreliable and always conservative.

By splitting flash attention into its own OBJECT library, the rest of the build can safely use --threads 4 while flash attention stays at --threads 2. Combined with quick build mode (fewer kernel variants), this significantly reduces CI wall-clock time.

CI Time Saving

N1F1: --nvcc_threads 1. CI time is from checks of PR 28607.
N4F2: --nvcc_threads 4 --flash_nvcc_threads 2: CI time is from this PR.
N8F4: --nvcc_threads 8 --flash_nvcc_threads 4: CI time is from this PR.
N4F4: --nvcc_threads 4 --flash_nvcc_threads 4: CI time is from this PR. This is the final candidate.
Saving = N1F1 - N4F4
Saving Ratio = (N1F1 - N4F4) / N1F1

Here is CI time (Build + Test time in minutes) saving:

CI	N1F1	N4F2	N8F4	N4F4	Saved Minutes	Saving Ratio
Linux CI	35 + 38	35 + 32	35 + 32	36 + 27	10	14%
Windows CI	58 + 36	53 + 38	54 + 38	48 + 36	10	11%
Plugin Linux CI	53 + 26	38 + 17	39 + 39	39 + 15	25	32%
Plugin Windows CI	77 + 16	57 + 14	54 + 14	53 + 12	28	30%
Windows TRT CI	54 + 43	38 + 38	42 + 43	41 + 37	19	20%

Note that this is only one time comparison. Cache might take effect with more runs, and might change the statistics. The CI time is reduced in the range of 11% to 32%. Total CI time saving is more than 90 minutes.

Key Changes

1. CMake: Architecture-specific OBJECT Libraries

File	Change
`cmake/onnxruntime_cuda_source_filters.cmake`	New macros: `onnxruntime_extract_flash_attention_sources()`, `onnxruntime_extract_llm_sources()`, `onnxruntime_extract_sm_specific_cuda_sources()` to partition sources by SM arch
`cmake/onnxruntime_providers_cuda.cmake`	Create `flash_attention` (SM80+), `llm` (SM75+), `sm90_tma`, and `sm120_tma` OBJECT libraries with per-target `--threads`; merge fpA_intB SM90 launchers into SM90 TMA lib
`cmake/onnxruntime_providers_cuda_plugin.cmake`	Mirror OBJECT library pattern for plugin EP build; consolidate shared compile options into a variable; fix `-Xcudafe --diag_suppress=550,2810` and `--std c++20` for CUDA 12.8 compatibility
`cmake/onnxruntime_unittests.cmake`	Link new OBJECT libraries into test target

2. Build Script: `--flash_nvcc_threads` and Default 4

File	Change
`tools/ci_build/build.py`	Remove `psutil`-based memory heuristic; add `--flash_nvcc_threads` forwarding; default `nvcc_threads` to 4
`tools/ci_build/build_args.py`	Add `--flash_nvcc_threads` CLI argument (default: same as `--nvcc_threads`)

3. Quick Build Mode (`onnxruntime_QUICK_BUILD`)

Reduces flash attention kernels to hdim128 fp16 only (skips hdim32/64/96/192/256)
Guards some MoE SM90 generated launchers with #ifndef ORT_QUICK_BUILD
Restricts CUTLASS SM80 tile configs to 3 instantiations
Skips test cases that depend on excluded kernel variants (e.g., test_gqa_fp8_fallback_unsupported_head_size needs hdim64)
Applied to all CI pipelines except Linux CUDA CI (full build) and packaging pipelines

4. CI and Packaging Pipeline Updates

All CUDA CI pipelines updated from --nvcc_threads 1 to --nvcc_threads 4 --flash_nvcc_threads 4:

.github/workflows/linux_cuda_ci.yml
.github/workflows/linux_cuda_plugin_ci.yml (+ QUICK_BUILD=ON)
.github/workflows/linux_tensorrt_ci.yml (+ QUICK_BUILD=ON)
.github/workflows/windows_cuda.yml (+ QUICK_BUILD=ON)
.github/workflows/windows_cuda_plugin.yml (+ QUICK_BUILD=ON)
.github/workflows/windows_tensorrt.yml (+ QUICK_BUILD=ON)

Packaging pipeline updated to use --nvcc_threads 4 --flash_nvcc_threads 2, except --nvcc_threads 2 --flash_nvcc_threads 1 for cuda plugin:

Azure Pipelines: custom-nuget-packaging-pipeline.yml, nuget-win-cuda-packaging-stage.yml, plugin-win-cuda-stage.yml, py-win-gpu-stage.yml
Linux scripts: build_cuda_plugin_package.sh, build_linux_python_package.sh

5. Bug Fix: CUTLASS Heuristic for SIMT Kernels

onnxruntime/contrib_ops/cuda/llm/cutlass_heuristic.cc: Fixed ORT_QUICK_BUILD path to return proper tile config for SIMT (float) gemm type instead of discarding the type info

Architecture Mapping

OBJECT Library	Min SM	Sources	Threads
`*_flash_attention`	SM80+	`bert/flash_attention/*.cu` (48 files)	`onnxruntime_FLASH_NVCC_THREADS` (default: same as nvcc_threads)
`*_llm`	SM75+	`contrib_ops/cuda/llm/*.cu` (excl. SM90/SM120 launchers)	`onnxruntime_NVCC_THREADS` (default 4)
`*_sm90_tma`	90a-real	MoE TMA + fpA_intB SM90 launchers	`onnxruntime_NVCC_THREADS`
`*_sm120_tma`	SM120+	MoE SM120 TMA generated files	`onnxruntime_NVCC_THREADS`
Parent target	All archs	Everything else	`onnxruntime_NVCC_THREADS`

New Build Options

--nvcc_threads N (default 4) — threads for all CUDA targets except flash attention
--flash_nvcc_threads N (default: same as --nvcc_threads) — threads specifically for flash attention compilation

CMake cache variables: onnxruntime_NVCC_THREADS, onnxruntime_FLASH_NVCC_THREADS

Testing

Built locally with CMAKE_CUDA_ARCHITECTURES="75;80;86;89;90;100;120", --nvcc_threads 4 --flash_nvcc_threads 2
Verified flash attention .cu files compile only for SM80+ (checked build.ninja / VS project)
Verified LLM .cu files compile for SM75+
Ran onnxruntime_provider_test — all CUDA EP tests pass
Ran python test_qmoe_cuda.py (MoE kernels), flash attention / GQA tests
No link errors in both in-tree provider and plugin EP builds
No nvcc warnings about duplicate --threads flags
Plugin CI compile options verified: --std c++20, -Xcudafe --diag_suppress=550,2810, MSVC /bigobj all applied to OBJECT libraries

github-actions

You can commit the suggested changes from lintrunner.

Copilot

Pull request overview

This PR targets CUDA CI/build-time reductions by introducing per-architecture CUDA OBJECT libraries (enabling different nvcc --threads settings for flash-attention vs the rest), adding a dedicated --flash_nvcc_threads build option, and enabling a “quick build” mode to reduce kernel instantiations in CI.

Changes:

Split CUDA provider (and CUDA plugin EP) CUDA sources into arch-specific OBJECT libraries (flash-attn, LLM, SM90 TMA, SM120 TMA) and apply per-target --threads.
Build tooling updates: default --nvcc_threads to 4, add --flash_nvcc_threads, and forward both to CMake as onnxruntime_NVCC_THREADS / onnxruntime_FLASH_NVCC_THREADS.
CI/pipeline and tests updated to use the new thread flags and quick-build gating, plus a CUTLASS heuristic fix for ORT_QUICK_BUILD.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tools/ci_build/github/linux/build_tensorrt_c_api_package.sh	Use higher NVCC thread defaults and separate flash-attn thread count for packaging build.
tools/ci_build/github/linux/build_linux_python_package.sh	Update CUDA build args to new NVCC thread defaults + flash-attn thread count.
tools/ci_build/github/linux/build_cuda_plugin_package.sh	Adjust plugin packaging build to use separate (lower) flash-attn thread count.
tools/ci_build/github/linux/build_cuda_ci.sh	Remove legacy CUDA CI build script.
tools/ci_build/github/linux/build_cuda_c_api_package.sh	Use higher NVCC thread defaults and separate flash-attn thread count for packaging build.
tools/ci_build/github/azure-pipelines/stages/py-win-gpu-stage.yml	Increase NVCC threads and add flash-attn threads for Windows Python GPU stage.
tools/ci_build/github/azure-pipelines/stages/plugin-win-cuda-stage.yml	Increase NVCC threads and add flash-attn threads for plugin Windows CUDA stage.
tools/ci_build/github/azure-pipelines/stages/nuget-win-cuda-packaging-stage.yml	Increase NVCC threads and add flash-attn threads in NuGet CUDA packaging.
tools/ci_build/github/azure-pipelines/custom-nuget-packaging-pipeline.yml	Increase NVCC threads and add flash-attn threads for custom NuGet packaging.
tools/ci_build/build.py	Default NVCC threads behavior and propagate `onnxruntime_FLASH_NVCC_THREADS` to CMake.
tools/ci_build/build_args.py	Change default `--nvcc_threads` to 4 and add `--flash_nvcc_threads`.
onnxruntime/test/python/transformers/test_gqa.py	Skip a FP8 GQA case under quick-build due to missing head-size kernels.
onnxruntime/contrib_ops/cuda/llm/cutlass_heuristic.cc	Fix `ORT_QUICK_BUILD` path to return correct SIMT tile config.
cmake/onnxruntime_unittests.cmake	Link new arch-specific CUDA OBJECT libraries into CUDA UT shared module.
cmake/onnxruntime_providers_cuda.cmake	Implement arch-specific CUDA OBJECT libraries and per-target NVCC `--threads` settings.
cmake/onnxruntime_providers_cuda_plugin.cmake	Mirror provider OBJECT-library split and consolidate shared NVCC compile flags for plugin EP.
cmake/onnxruntime_cuda_source_filters.cmake	Add source-partitioning macros for flash-attn, LLM, and SM90/SM120 generated sources.
.github/workflows/windows_tensorrt.yml	Update NVCC threads and enable quick-build for Windows TensorRT CI build.
.github/workflows/windows_cuda.yml	Update NVCC threads and add flash-attn threads; quick-build is only applied in the test job.
.github/workflows/windows_cuda_plugin.yml	Update NVCC threads and enable quick-build for Windows CUDA plugin CI.
.github/workflows/linux_tensorrt_ci.yml	Update NVCC threads + flash-attn threads and enable quick-build for Linux TensorRT CI.
.github/workflows/linux_cuda_plugin_ci.yml	Update NVCC threads + flash-attn threads and enable quick-build for Linux CUDA plugin CI.
.github/workflows/linux_cuda_ci.yml	Update NVCC threads + flash-attn threads for Linux CUDA CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Remove nargs="?" from --nvcc_threads and --flash_nvcc_threads to prevent TypeError when passed without a value - Add onnxruntime_QUICK_BUILD=ON to Windows CUDA CI build job for consistency with the test job - Update CMake cache default for onnxruntime_NVCC_THREADS from 1 to 4 to match the build.py default

Copilot

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.

  if(onnxruntime_QUICK_BUILD)
    message(STATUS "Quick build mode enabled: Only building hdim128 fp16 flash attention kernels")
-    list(FILTER ${CU_SRC_LIST} EXCLUDE REGEX "flash_fwd.*hdim(32|64|96|192|256)")
+    list(FILTER _list EXCLUDE REGEX "flash_fwd.*hdim(32|64|96|192|256)")


+          PARENT onnxruntime_providers_cuda
+          CUDA_ARCHITECTURES "${_ort_flash_cuda_architectures}"
+          NVCC_THREADS "${onnxruntime_FLASH_NVCC_THREADS}"
+          SOURCES ${onnxruntime_cuda_flash_attention_srcs})


+      CUDA_ARCHITECTURES "${_plugin_flash_cuda_architectures}"
+      NVCC_THREADS "${onnxruntime_FLASH_NVCC_THREADS}"
+      COMPILE_OPTIONS ${_cuda_plugin_shared_compile_options}
+      SOURCES ${_cuda_plugin_flash_attention_srcs})


tianleiwu added 2 commits May 22, 2026 10:02

obj lib for moe gemm to speed up

207789f

object lib for flash attention and llm

19fb133

github-actions Bot reviewed May 22, 2026

View reviewed changes

Comment thread tools/ci_build/build.py

use --nvcc_threads 4 --flash_nvcc_threads 2 in pipelines

7861019

tianleiwu force-pushed the tlwu/cuda_build_speedup branch from 9091c9d to 7861019 Compare May 22, 2026 20:40

tianleiwu marked this pull request as draft May 22, 2026 22:06

tianleiwu added 4 commits May 22, 2026 20:21

exclude sm100 moe gemm kernels

21d7386

fix cuda 12.8 plugin CI

6a3bc28

use quick build in CI (except linux cuda CI).

41e2974

Fix plugin CI

d5c70d5

tianleiwu changed the title ~~Speed up CUDA build: split flash attention and LLM into per-arch OBJECT libraries with configurable nvcc_threads~~ Speed up CUDA CI build: split into per-arch OBJECT libraries, add --flash_nvcc_threads, and enable quick build mode May 23, 2026

CI --nvcc_threads 8 --flash_nvcc_threads 4

64f4acf

tianleiwu force-pushed the tlwu/cuda_build_speedup branch from 08d5248 to 64f4acf Compare May 24, 2026 04:45

tianleiwu added 3 commits May 24, 2026 01:21

--nvcc_threads 4 --flash_nvcc_threads 2 in packaging pipeline

d0afc5a

Fix Windows_Packaging_TensorRT

c4d32b8

cuda plugin packaging uses --nvcc_threads 2 --flash_nvcc_threads 1

f14741f

tianleiwu marked this pull request as ready for review May 24, 2026 22:28

CI --nvcc_threads 4 --flash_nvcc_threads 4

6ebd5e4

tianleiwu requested review from Copilot, edgchen1 and sanaa-hamel-microsoft May 25, 2026 21:15

Copilot started reviewing on behalf of tianleiwu May 25, 2026 21:15 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Comment thread tools/ci_build/build_args.py

Comment thread cmake/onnxruntime_providers_cuda.cmake

Comment thread cmake/onnxruntime_providers_cuda_plugin.cmake

Comment thread .github/workflows/windows_cuda.yml Outdated

Comment thread cmake/onnxruntime_providers_cuda.cmake Outdated

tianleiwu enabled auto-merge (squash) May 26, 2026 18:00

sanaa-hamel-microsoft requested changes May 26, 2026

View reviewed changes

address feedbacks

b5aae5c

tianleiwu requested review from Copilot and sanaa-hamel-microsoft May 26, 2026 22:00

Copilot started reviewing on behalf of tianleiwu May 26, 2026 22:00 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

Comment thread cmake/onnxruntime_providers_cuda.cmake Outdated

tianleiwu added 2 commits May 26, 2026 15:23

refactoring

86c45a4

handle disable contrib ops nicely

6636c99

tianleiwu requested a review from Copilot May 26, 2026 22:30

Copilot started reviewing on behalf of tianleiwu May 26, 2026 22:30 View session

kunal-vaishnavi approved these changes May 26, 2026

View reviewed changes

Copilot AI reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up CUDA CI build: split into per-arch OBJECT libraries, add --flash_nvcc_threads, and enable quick build mode#28645

Speed up CUDA CI build: split into per-arch OBJECT libraries, add --flash_nvcc_threads, and enable quick build mode#28645
tianleiwu wants to merge 16 commits into
mainfrom
tlwu/cuda_build_speedup

tianleiwu commented May 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tianleiwu commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

CI Time Saving

Key Changes

1. CMake: Architecture-specific OBJECT Libraries

2. Build Script: --flash_nvcc_threads and Default 4

3. Quick Build Mode (onnxruntime_QUICK_BUILD)

4. CI and Packaging Pipeline Updates

5. Bug Fix: CUTLASS Heuristic for SIMT Kernels

Architecture Mapping

New Build Options

Testing

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tianleiwu commented May 22, 2026 •

edited

Loading

2. Build Script: `--flash_nvcc_threads` and Default 4

3. Quick Build Mode (`onnxruntime_QUICK_BUILD`)