Skip to content

Speed up CUDA CI build: split into per-arch OBJECT libraries, add --flash_nvcc_threads, and enable quick build mode#28645

Open
tianleiwu wants to merge 16 commits into
mainfrom
tlwu/cuda_build_speedup
Open

Speed up CUDA CI build: split into per-arch OBJECT libraries, add --flash_nvcc_threads, and enable quick build mode#28645
tianleiwu wants to merge 16 commits into
mainfrom
tlwu/cuda_build_speedup

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu commented May 22, 2026

Description

Speed up CUDA CI build times by splitting the monolithic CUDA provider into architecture-specific OBJECT libraries with independent nvcc --threads control, and introducing a quick build mode (onnxruntime_QUICK_BUILD) that reduces kernel instantiations for CI validation.

Motivation and Context

CUDA builds were bottlenecked by --nvcc_threads 1 across all targets because flash attention (48 .cu files, SM80+) requires ~4GB per nvcc thread and caused OOM when compiled with higher thread counts. The old heuristic in build.py used psutil to auto-detect memory but was unreliable and always conservative.

By splitting flash attention into its own OBJECT library, the rest of the build can safely use --threads 4 while flash attention stays at --threads 2. Combined with quick build mode (fewer kernel variants), this significantly reduces CI wall-clock time.

CI Time Saving

  • N1F1: --nvcc_threads 1. CI time is from checks of PR 28607.
  • N4F2: --nvcc_threads 4 --flash_nvcc_threads 2: CI time is from this PR.
  • N8F4: --nvcc_threads 8 --flash_nvcc_threads 4: CI time is from this PR.
  • N4F4: --nvcc_threads 4 --flash_nvcc_threads 4: CI time is from this PR. This is the final candidate.
  • Saving = N1F1 - N4F4
  • Saving Ratio = (N1F1 - N4F4) / N1F1

Here is CI time (Build + Test time in minutes) saving:

CI N1F1 N4F2 N8F4 N4F4 Saved Minutes Saving Ratio
Linux CI 35 + 38 35 + 32 35 + 32 36 + 27 10 14%
Windows CI 58 + 36 53 + 38 54 + 38 48 + 36 10 11%
Plugin Linux CI 53 + 26 38 + 17 39 + 39 39 + 15 25 32%
Plugin Windows CI 77 + 16 57 + 14 54 + 14 53 + 12 28 30%
Windows TRT CI 54 + 43 38 + 38 42 + 43 41 + 37 19 20%

Note that this is only one time comparison. Cache might take effect with more runs, and might change the statistics. The CI time is reduced in the range of 11% to 32%. Total CI time saving is more than 90 minutes.

Key Changes

1. CMake: Architecture-specific OBJECT Libraries

File Change
cmake/onnxruntime_cuda_source_filters.cmake New macros: onnxruntime_extract_flash_attention_sources(), onnxruntime_extract_llm_sources(), onnxruntime_extract_sm_specific_cuda_sources() to partition sources by SM arch
cmake/onnxruntime_providers_cuda.cmake Create flash_attention (SM80+), llm (SM75+), sm90_tma, and sm120_tma OBJECT libraries with per-target --threads; merge fpA_intB SM90 launchers into SM90 TMA lib
cmake/onnxruntime_providers_cuda_plugin.cmake Mirror OBJECT library pattern for plugin EP build; consolidate shared compile options into a variable; fix -Xcudafe --diag_suppress=550,2810 and --std c++20 for CUDA 12.8 compatibility
cmake/onnxruntime_unittests.cmake Link new OBJECT libraries into test target

2. Build Script: --flash_nvcc_threads and Default 4

File Change
tools/ci_build/build.py Remove psutil-based memory heuristic; add --flash_nvcc_threads forwarding; default nvcc_threads to 4
tools/ci_build/build_args.py Add --flash_nvcc_threads CLI argument (default: same as --nvcc_threads)

3. Quick Build Mode (onnxruntime_QUICK_BUILD)

  • Reduces flash attention kernels to hdim128 fp16 only (skips hdim32/64/96/192/256)
  • Guards some MoE SM90 generated launchers with #ifndef ORT_QUICK_BUILD
  • Restricts CUTLASS SM80 tile configs to 3 instantiations
  • Skips test cases that depend on excluded kernel variants (e.g., test_gqa_fp8_fallback_unsupported_head_size needs hdim64)
  • Applied to all CI pipelines except Linux CUDA CI (full build) and packaging pipelines

4. CI and Packaging Pipeline Updates

All CUDA CI pipelines updated from --nvcc_threads 1 to --nvcc_threads 4 --flash_nvcc_threads 4:

  • .github/workflows/linux_cuda_ci.yml
  • .github/workflows/linux_cuda_plugin_ci.yml (+ QUICK_BUILD=ON)
  • .github/workflows/linux_tensorrt_ci.yml (+ QUICK_BUILD=ON)
  • .github/workflows/windows_cuda.yml (+ QUICK_BUILD=ON)
  • .github/workflows/windows_cuda_plugin.yml (+ QUICK_BUILD=ON)
  • .github/workflows/windows_tensorrt.yml (+ QUICK_BUILD=ON)

Packaging pipeline updated to use --nvcc_threads 4 --flash_nvcc_threads 2, except --nvcc_threads 2 --flash_nvcc_threads 1 for cuda plugin:

  • Azure Pipelines: custom-nuget-packaging-pipeline.yml, nuget-win-cuda-packaging-stage.yml, plugin-win-cuda-stage.yml, py-win-gpu-stage.yml
  • Linux scripts: build_cuda_plugin_package.sh, build_linux_python_package.sh

5. Bug Fix: CUTLASS Heuristic for SIMT Kernels

  • onnxruntime/contrib_ops/cuda/llm/cutlass_heuristic.cc: Fixed ORT_QUICK_BUILD path to return proper tile config for SIMT (float) gemm type instead of discarding the type info

Architecture Mapping

OBJECT Library Min SM Sources Threads
*_flash_attention SM80+ bert/flash_attention/*.cu (48 files) onnxruntime_FLASH_NVCC_THREADS (default: same as nvcc_threads)
*_llm SM75+ contrib_ops/cuda/llm/*.cu (excl. SM90/SM120 launchers) onnxruntime_NVCC_THREADS (default 4)
*_sm90_tma 90a-real MoE TMA + fpA_intB SM90 launchers onnxruntime_NVCC_THREADS
*_sm120_tma SM120+ MoE SM120 TMA generated files onnxruntime_NVCC_THREADS
Parent target All archs Everything else onnxruntime_NVCC_THREADS

New Build Options

  • --nvcc_threads N (default 4) — threads for all CUDA targets except flash attention
  • --flash_nvcc_threads N (default: same as --nvcc_threads) — threads specifically for flash attention compilation

CMake cache variables: onnxruntime_NVCC_THREADS, onnxruntime_FLASH_NVCC_THREADS

Testing

  • Built locally with CMAKE_CUDA_ARCHITECTURES="75;80;86;89;90;100;120", --nvcc_threads 4 --flash_nvcc_threads 2
  • Verified flash attention .cu files compile only for SM80+ (checked build.ninja / VS project)
  • Verified LLM .cu files compile for SM75+
  • Ran onnxruntime_provider_test — all CUDA EP tests pass
  • Ran python test_qmoe_cuda.py (MoE kernels), flash attention / GQA tests
  • No link errors in both in-tree provider and plugin EP builds
  • No nvcc warnings about duplicate --threads flags
  • Plugin CI compile options verified: --std c++20, -Xcudafe --diag_suppress=550,2810, MSVC /bigobj all applied to OBJECT libraries

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread tools/ci_build/build.py
@tianleiwu tianleiwu force-pushed the tlwu/cuda_build_speedup branch from 9091c9d to 7861019 Compare May 22, 2026 20:40
@tianleiwu tianleiwu marked this pull request as draft May 22, 2026 22:06
@tianleiwu tianleiwu changed the title Speed up CUDA build: split flash attention and LLM into per-arch OBJECT libraries with configurable nvcc_threads Speed up CUDA CI build: split into per-arch OBJECT libraries, add --flash_nvcc_threads, and enable quick build mode May 23, 2026
@tianleiwu tianleiwu force-pushed the tlwu/cuda_build_speedup branch from 08d5248 to 64f4acf Compare May 24, 2026 04:45
@tianleiwu tianleiwu marked this pull request as ready for review May 24, 2026 22:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets CUDA CI/build-time reductions by introducing per-architecture CUDA OBJECT libraries (enabling different nvcc --threads settings for flash-attention vs the rest), adding a dedicated --flash_nvcc_threads build option, and enabling a “quick build” mode to reduce kernel instantiations in CI.

Changes:

  • Split CUDA provider (and CUDA plugin EP) CUDA sources into arch-specific OBJECT libraries (flash-attn, LLM, SM90 TMA, SM120 TMA) and apply per-target --threads.
  • Build tooling updates: default --nvcc_threads to 4, add --flash_nvcc_threads, and forward both to CMake as onnxruntime_NVCC_THREADS / onnxruntime_FLASH_NVCC_THREADS.
  • CI/pipeline and tests updated to use the new thread flags and quick-build gating, plus a CUTLASS heuristic fix for ORT_QUICK_BUILD.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tools/ci_build/github/linux/build_tensorrt_c_api_package.sh Use higher NVCC thread defaults and separate flash-attn thread count for packaging build.
tools/ci_build/github/linux/build_linux_python_package.sh Update CUDA build args to new NVCC thread defaults + flash-attn thread count.
tools/ci_build/github/linux/build_cuda_plugin_package.sh Adjust plugin packaging build to use separate (lower) flash-attn thread count.
tools/ci_build/github/linux/build_cuda_ci.sh Remove legacy CUDA CI build script.
tools/ci_build/github/linux/build_cuda_c_api_package.sh Use higher NVCC thread defaults and separate flash-attn thread count for packaging build.
tools/ci_build/github/azure-pipelines/stages/py-win-gpu-stage.yml Increase NVCC threads and add flash-attn threads for Windows Python GPU stage.
tools/ci_build/github/azure-pipelines/stages/plugin-win-cuda-stage.yml Increase NVCC threads and add flash-attn threads for plugin Windows CUDA stage.
tools/ci_build/github/azure-pipelines/stages/nuget-win-cuda-packaging-stage.yml Increase NVCC threads and add flash-attn threads in NuGet CUDA packaging.
tools/ci_build/github/azure-pipelines/custom-nuget-packaging-pipeline.yml Increase NVCC threads and add flash-attn threads for custom NuGet packaging.
tools/ci_build/build.py Default NVCC threads behavior and propagate onnxruntime_FLASH_NVCC_THREADS to CMake.
tools/ci_build/build_args.py Change default --nvcc_threads to 4 and add --flash_nvcc_threads.
onnxruntime/test/python/transformers/test_gqa.py Skip a FP8 GQA case under quick-build due to missing head-size kernels.
onnxruntime/contrib_ops/cuda/llm/cutlass_heuristic.cc Fix ORT_QUICK_BUILD path to return correct SIMT tile config.
cmake/onnxruntime_unittests.cmake Link new arch-specific CUDA OBJECT libraries into CUDA UT shared module.
cmake/onnxruntime_providers_cuda.cmake Implement arch-specific CUDA OBJECT libraries and per-target NVCC --threads settings.
cmake/onnxruntime_providers_cuda_plugin.cmake Mirror provider OBJECT-library split and consolidate shared NVCC compile flags for plugin EP.
cmake/onnxruntime_cuda_source_filters.cmake Add source-partitioning macros for flash-attn, LLM, and SM90/SM120 generated sources.
.github/workflows/windows_tensorrt.yml Update NVCC threads and enable quick-build for Windows TensorRT CI build.
.github/workflows/windows_cuda.yml Update NVCC threads and add flash-attn threads; quick-build is only applied in the test job.
.github/workflows/windows_cuda_plugin.yml Update NVCC threads and enable quick-build for Windows CUDA plugin CI.
.github/workflows/linux_tensorrt_ci.yml Update NVCC threads + flash-attn threads and enable quick-build for Linux TensorRT CI.
.github/workflows/linux_cuda_plugin_ci.yml Update NVCC threads + flash-attn threads and enable quick-build for Linux CUDA plugin CI.
.github/workflows/linux_cuda_ci.yml Update NVCC threads + flash-attn threads for Linux CUDA CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/ci_build/build_args.py
Comment thread cmake/onnxruntime_providers_cuda.cmake
Comment thread cmake/onnxruntime_providers_cuda_plugin.cmake
Comment thread .github/workflows/windows_cuda.yml Outdated
Comment thread cmake/onnxruntime_providers_cuda.cmake Outdated
- Remove nargs="?" from --nvcc_threads and --flash_nvcc_threads to prevent
  TypeError when passed without a value
- Add onnxruntime_QUICK_BUILD=ON to Windows CUDA CI build job for consistency
  with the test job
- Update CMake cache default for onnxruntime_NVCC_THREADS from 1 to 4 to match
  the build.py default
@tianleiwu tianleiwu enabled auto-merge (squash) May 26, 2026 18:00
Comment thread .github/workflows/linux_cuda_plugin_ci.yml Outdated
Comment thread .github/workflows/windows_cuda_plugin.yml Outdated
Comment thread cmake/onnxruntime_cuda_source_filters.cmake Outdated
Comment thread cmake/onnxruntime_cuda_source_filters.cmake Outdated
Comment thread cmake/onnxruntime_cuda_source_filters.cmake Outdated
Comment thread tools/ci_build/github/azure-pipelines/stages/plugin-win-cuda-stage.yml Outdated
Comment thread tools/ci_build/github/azure-pipelines/stages/plugin-win-cuda-stage.yml Outdated
Comment thread tools/ci_build/github/azure-pipelines/stages/py-win-gpu-stage.yml Outdated
Comment thread tools/ci_build/github/linux/build_cuda_plugin_package.sh Outdated
Comment thread tools/ci_build/github/linux/build_linux_python_package.sh Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.

Comment thread cmake/onnxruntime_providers_cuda.cmake Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.

Comment on lines 20 to +22
if(onnxruntime_QUICK_BUILD)
message(STATUS "Quick build mode enabled: Only building hdim128 fp16 flash attention kernels")
list(FILTER ${CU_SRC_LIST} EXCLUDE REGEX "flash_fwd.*hdim(32|64|96|192|256)")
list(FILTER _list EXCLUDE REGEX "flash_fwd.*hdim(32|64|96|192|256)")
PARENT onnxruntime_providers_cuda
CUDA_ARCHITECTURES "${_ort_flash_cuda_architectures}"
NVCC_THREADS "${onnxruntime_FLASH_NVCC_THREADS}"
SOURCES ${onnxruntime_cuda_flash_attention_srcs})
CUDA_ARCHITECTURES "${_plugin_flash_cuda_architectures}"
NVCC_THREADS "${onnxruntime_FLASH_NVCC_THREADS}"
COMPILE_OPTIONS ${_cuda_plugin_shared_compile_options}
SOURCES ${_cuda_plugin_flash_attention_srcs})
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants