Speed up CUDA CI build: split into per-arch OBJECT libraries, add --flash_nvcc_threads, and enable quick build mode#28645
Open
tianleiwu wants to merge 16 commits into
Open
Speed up CUDA CI build: split into per-arch OBJECT libraries, add --flash_nvcc_threads, and enable quick build mode#28645tianleiwu wants to merge 16 commits into
tianleiwu wants to merge 16 commits into
Conversation
9091c9d to
7861019
Compare
08d5248 to
64f4acf
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR targets CUDA CI/build-time reductions by introducing per-architecture CUDA OBJECT libraries (enabling different nvcc --threads settings for flash-attention vs the rest), adding a dedicated --flash_nvcc_threads build option, and enabling a “quick build” mode to reduce kernel instantiations in CI.
Changes:
- Split CUDA provider (and CUDA plugin EP) CUDA sources into arch-specific OBJECT libraries (flash-attn, LLM, SM90 TMA, SM120 TMA) and apply per-target
--threads. - Build tooling updates: default
--nvcc_threadsto 4, add--flash_nvcc_threads, and forward both to CMake asonnxruntime_NVCC_THREADS/onnxruntime_FLASH_NVCC_THREADS. - CI/pipeline and tests updated to use the new thread flags and quick-build gating, plus a CUTLASS heuristic fix for
ORT_QUICK_BUILD.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/ci_build/github/linux/build_tensorrt_c_api_package.sh | Use higher NVCC thread defaults and separate flash-attn thread count for packaging build. |
| tools/ci_build/github/linux/build_linux_python_package.sh | Update CUDA build args to new NVCC thread defaults + flash-attn thread count. |
| tools/ci_build/github/linux/build_cuda_plugin_package.sh | Adjust plugin packaging build to use separate (lower) flash-attn thread count. |
| tools/ci_build/github/linux/build_cuda_ci.sh | Remove legacy CUDA CI build script. |
| tools/ci_build/github/linux/build_cuda_c_api_package.sh | Use higher NVCC thread defaults and separate flash-attn thread count for packaging build. |
| tools/ci_build/github/azure-pipelines/stages/py-win-gpu-stage.yml | Increase NVCC threads and add flash-attn threads for Windows Python GPU stage. |
| tools/ci_build/github/azure-pipelines/stages/plugin-win-cuda-stage.yml | Increase NVCC threads and add flash-attn threads for plugin Windows CUDA stage. |
| tools/ci_build/github/azure-pipelines/stages/nuget-win-cuda-packaging-stage.yml | Increase NVCC threads and add flash-attn threads in NuGet CUDA packaging. |
| tools/ci_build/github/azure-pipelines/custom-nuget-packaging-pipeline.yml | Increase NVCC threads and add flash-attn threads for custom NuGet packaging. |
| tools/ci_build/build.py | Default NVCC threads behavior and propagate onnxruntime_FLASH_NVCC_THREADS to CMake. |
| tools/ci_build/build_args.py | Change default --nvcc_threads to 4 and add --flash_nvcc_threads. |
| onnxruntime/test/python/transformers/test_gqa.py | Skip a FP8 GQA case under quick-build due to missing head-size kernels. |
| onnxruntime/contrib_ops/cuda/llm/cutlass_heuristic.cc | Fix ORT_QUICK_BUILD path to return correct SIMT tile config. |
| cmake/onnxruntime_unittests.cmake | Link new arch-specific CUDA OBJECT libraries into CUDA UT shared module. |
| cmake/onnxruntime_providers_cuda.cmake | Implement arch-specific CUDA OBJECT libraries and per-target NVCC --threads settings. |
| cmake/onnxruntime_providers_cuda_plugin.cmake | Mirror provider OBJECT-library split and consolidate shared NVCC compile flags for plugin EP. |
| cmake/onnxruntime_cuda_source_filters.cmake | Add source-partitioning macros for flash-attn, LLM, and SM90/SM120 generated sources. |
| .github/workflows/windows_tensorrt.yml | Update NVCC threads and enable quick-build for Windows TensorRT CI build. |
| .github/workflows/windows_cuda.yml | Update NVCC threads and add flash-attn threads; quick-build is only applied in the test job. |
| .github/workflows/windows_cuda_plugin.yml | Update NVCC threads and enable quick-build for Windows CUDA plugin CI. |
| .github/workflows/linux_tensorrt_ci.yml | Update NVCC threads + flash-attn threads and enable quick-build for Linux TensorRT CI. |
| .github/workflows/linux_cuda_plugin_ci.yml | Update NVCC threads + flash-attn threads and enable quick-build for Linux CUDA plugin CI. |
| .github/workflows/linux_cuda_ci.yml | Update NVCC threads + flash-attn threads for Linux CUDA CI. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Remove nargs="?" from --nvcc_threads and --flash_nvcc_threads to prevent TypeError when passed without a value - Add onnxruntime_QUICK_BUILD=ON to Windows CUDA CI build job for consistency with the test job - Update CMake cache default for onnxruntime_NVCC_THREADS from 1 to 4 to match the build.py default
sanaa-hamel-microsoft
requested changes
May 26, 2026
kunal-vaishnavi
approved these changes
May 26, 2026
Comment on lines
20
to
+22
| if(onnxruntime_QUICK_BUILD) | ||
| message(STATUS "Quick build mode enabled: Only building hdim128 fp16 flash attention kernels") | ||
| list(FILTER ${CU_SRC_LIST} EXCLUDE REGEX "flash_fwd.*hdim(32|64|96|192|256)") | ||
| list(FILTER _list EXCLUDE REGEX "flash_fwd.*hdim(32|64|96|192|256)") |
| PARENT onnxruntime_providers_cuda | ||
| CUDA_ARCHITECTURES "${_ort_flash_cuda_architectures}" | ||
| NVCC_THREADS "${onnxruntime_FLASH_NVCC_THREADS}" | ||
| SOURCES ${onnxruntime_cuda_flash_attention_srcs}) |
| CUDA_ARCHITECTURES "${_plugin_flash_cuda_architectures}" | ||
| NVCC_THREADS "${onnxruntime_FLASH_NVCC_THREADS}" | ||
| COMPILE_OPTIONS ${_cuda_plugin_shared_compile_options} | ||
| SOURCES ${_cuda_plugin_flash_attention_srcs}) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Speed up CUDA CI build times by splitting the monolithic CUDA provider into architecture-specific OBJECT libraries with independent
nvcc --threadscontrol, and introducing a quick build mode (onnxruntime_QUICK_BUILD) that reduces kernel instantiations for CI validation.Motivation and Context
CUDA builds were bottlenecked by
--nvcc_threads 1across all targets because flash attention (48 .cu files, SM80+) requires ~4GB per nvcc thread and caused OOM when compiled with higher thread counts. The old heuristic inbuild.pyusedpsutilto auto-detect memory but was unreliable and always conservative.By splitting flash attention into its own OBJECT library, the rest of the build can safely use
--threads 4while flash attention stays at--threads 2. Combined with quick build mode (fewer kernel variants), this significantly reduces CI wall-clock time.CI Time Saving
--nvcc_threads 1. CI time is from checks of PR 28607.--nvcc_threads 4 --flash_nvcc_threads 2: CI time is from this PR.--nvcc_threads 8 --flash_nvcc_threads 4: CI time is from this PR.--nvcc_threads 4 --flash_nvcc_threads 4: CI time is from this PR. This is the final candidate.Here is CI time (Build + Test time in minutes) saving:
Note that this is only one time comparison. Cache might take effect with more runs, and might change the statistics. The CI time is reduced in the range of 11% to 32%. Total CI time saving is more than 90 minutes.
Key Changes
1. CMake: Architecture-specific OBJECT Libraries
cmake/onnxruntime_cuda_source_filters.cmakeonnxruntime_extract_flash_attention_sources(),onnxruntime_extract_llm_sources(),onnxruntime_extract_sm_specific_cuda_sources()to partition sources by SM archcmake/onnxruntime_providers_cuda.cmakeflash_attention(SM80+),llm(SM75+),sm90_tma, andsm120_tmaOBJECT libraries with per-target--threads; merge fpA_intB SM90 launchers into SM90 TMA libcmake/onnxruntime_providers_cuda_plugin.cmake-Xcudafe --diag_suppress=550,2810and--std c++20for CUDA 12.8 compatibilitycmake/onnxruntime_unittests.cmake2. Build Script:
--flash_nvcc_threadsand Default 4tools/ci_build/build.pypsutil-based memory heuristic; add--flash_nvcc_threadsforwarding; defaultnvcc_threadsto 4tools/ci_build/build_args.py--flash_nvcc_threadsCLI argument (default: same as--nvcc_threads)3. Quick Build Mode (
onnxruntime_QUICK_BUILD)#ifndef ORT_QUICK_BUILDtest_gqa_fp8_fallback_unsupported_head_sizeneeds hdim64)4. CI and Packaging Pipeline Updates
All CUDA CI pipelines updated from
--nvcc_threads 1to--nvcc_threads 4 --flash_nvcc_threads 4:.github/workflows/linux_cuda_ci.yml.github/workflows/linux_cuda_plugin_ci.yml(+QUICK_BUILD=ON).github/workflows/linux_tensorrt_ci.yml(+QUICK_BUILD=ON).github/workflows/windows_cuda.yml(+QUICK_BUILD=ON).github/workflows/windows_cuda_plugin.yml(+QUICK_BUILD=ON).github/workflows/windows_tensorrt.yml(+QUICK_BUILD=ON)Packaging pipeline updated to use
--nvcc_threads 4 --flash_nvcc_threads 2, except--nvcc_threads 2 --flash_nvcc_threads 1for cuda plugin:custom-nuget-packaging-pipeline.yml,nuget-win-cuda-packaging-stage.yml,plugin-win-cuda-stage.yml,py-win-gpu-stage.ymlbuild_cuda_plugin_package.sh,build_linux_python_package.sh5. Bug Fix: CUTLASS Heuristic for SIMT Kernels
onnxruntime/contrib_ops/cuda/llm/cutlass_heuristic.cc: FixedORT_QUICK_BUILDpath to return proper tile config for SIMT (float) gemm type instead of discarding the type infoArchitecture Mapping
*_flash_attentionbert/flash_attention/*.cu(48 files)onnxruntime_FLASH_NVCC_THREADS(default: same as nvcc_threads)*_llmcontrib_ops/cuda/llm/*.cu(excl. SM90/SM120 launchers)onnxruntime_NVCC_THREADS(default 4)*_sm90_tmaonnxruntime_NVCC_THREADS*_sm120_tmaonnxruntime_NVCC_THREADSonnxruntime_NVCC_THREADSNew Build Options
--nvcc_threads N(default 4) — threads for all CUDA targets except flash attention--flash_nvcc_threads N(default: same as--nvcc_threads) — threads specifically for flash attention compilationCMake cache variables:
onnxruntime_NVCC_THREADS,onnxruntime_FLASH_NVCC_THREADSTesting
CMAKE_CUDA_ARCHITECTURES="75;80;86;89;90;100;120",--nvcc_threads 4 --flash_nvcc_threads 2build.ninja/ VS project)onnxruntime_provider_test— all CUDA EP tests passpython test_qmoe_cuda.py(MoE kernels), flash attention / GQA tests--threadsflags--std c++20,-Xcudafe --diag_suppress=550,2810, MSVC/bigobjall applied to OBJECT libraries