Skip to content

Honor disable_synchronize_execution_providers for CUDA graph replay#28686

Merged
tianleiwu merged 6 commits into
mainfrom
tlwu/async-cuda-graph-replay
Jun 2, 2026
Merged

Honor disable_synchronize_execution_providers for CUDA graph replay#28686
tianleiwu merged 6 commits into
mainfrom
tlwu/async-cuda-graph-replay

Conversation

@tianleiwu

@tianleiwu tianleiwu commented May 27, 2026

Copy link
Copy Markdown
Contributor

Description

When using IO Binding with pre-allocated GPU buffers and disable_synchronize_execution_providers=1 in RunOptions, CUDA graph replay was the only remaining synchronization point that prevented fully async Session::Run(). This PR threads the sync flag through the ReplayGraph virtual so that CUDA graph replay respects the same run option.

Motivation

For latency-sensitive inference pipelines, users want to:

  1. Bind inputs/outputs to fixed GPU memory (IO Binding)
  2. Set a custom compute stream
  3. Use CUDA graph capture for reduced kernel launch overhead
  4. Run fully async — no host-side synchronization during Run()

Before this change, even with disable_synchronize_execution_providers=1, CUDA graph replay always called cudaStreamSynchronize after cudaGraphLaunch (hardcoded sync_status_flag=true). This forced a host-GPU sync on every replay, defeating the purpose of the async config.

Behavior Change

Configuration Before After
Default (disable_synchronize_execution_providers unset or "0") cudaStreamSynchronize after graph launch SamecudaStreamSynchronize after graph launch
disable_synchronize_execution_providers = "1" cudaStreamSynchronize after graph launch (ignored the config) No synccudaGraphLaunch returns immediately, fully async

Key Changes

  • IExecutionProvider::ReplayGraph — Added bool sync = true parameter to the virtual method (backward-compatible default)
  • InferenceSession::RunImpl — Session-level graph replay path now reads disable_synchronize_execution_providers and passes sync=false when set
  • CUDAExecutionProvider::OnRunEnd — First-capture replay passes existing sync_stream flag (already derived from the run option)
  • CUDAExecutionProvider::ReplayGraphPerThreadContext::ReplayGraphCUDAGraphManager::Replaysync flag threaded through the entire chain
  • Plugin CUDA EPReplayGraphImpl launches graph without sync; PluginExecutionProvider::ReplayGraph bridge calls Sync() only when sync=true
  • Other EPs (TensorRT, DML, JS, WebGPU, NV TensorRT RTX) — Signature updated for compilation; sync parameter accepted but unused (these EPs have their own sync semantics)

Usage Example

import onnxruntime as ort

providers = [("CUDAExecutionProvider", {"cuda_stream": str(stream_ptr)})]
session = ort.InferenceSession("model.onnx", providers=providers)
io_binding = session.io_binding()

# Bind pre-allocated GPU buffers
io_binding.bind_input("input", "cuda", 0, np.float16, shape, input_ptr)
io_binding.bind_output("output", "cuda", 0, np.float16, shape, output_ptr)

# Fully async run — no host sync during Run()
run_options = ort.RunOptions()
run_options.add_run_config_entry("disable_synchronize_execution_providers", "1")
session.run_with_iobinding(io_binding, run_options)

# Sync only when consuming output
torch.cuda.current_stream().synchronize()

Notes

  • The plugin CUDA EP uses cudaDeviceSynchronize (via Sync()) for the default sync path instead of stream-level sync. This is because the C API OrtEp::ReplayGraph signature cannot be extended with a sync parameter without a versioned ABI change. Functionally correct; slightly broader than stream sync but only matters on the default (blocking) path.
  • CUDA graph capture-end replay in OnRunEnd was already gated by sync_stream, which is derived from the same run option — no additional change needed there beyond passing it through.

Testing

Unit tests / existing coverage

  • Build passes with CUDA 13.0.
  • Existing CUDA graph tests continue to pass (default sync=true behavior unchanged).

Developer verification (nsys profiling)

A developer verification harness was added under onnxruntime/test/python/transformers/. It is not run in CI (requires a GPU, the nsys profiler, and the nvtx Python package), but lets anyone reproduce the async guarantee in one command:

  • profile_disable_sync.py — Builds a small cuBLAS-free elementwise model (so cuBLAS internal syncs cannot mask the EP-level sync), binds inputs/outputs on GPU via IO Binding (no host↔device copies during Run()), and wraps each Run() in an NVTX range. Toggles disable_synchronize_execution_providers via --sync on|off.
  • parse_nsys.py — Extended to parse CUDA runtime API calls that fall inside a named NVTX range from the nsys SQLite export. New flags: --cuda-api, --sync-apis-only (exits non-zero if any host-synchronization API is found in the range), --api-pattern, --list-cuda-apis, --skip-first-ranges N (skip warmup occurrences). Host-sync APIs detected include cudaStreamSynchronize, cudaDeviceSynchronize, cudaEventSynchronize, blocking cudaMemcpy/cudaMemset, and their driver-API equivalents.
  • run_disable_sync_check.sh — End-to-end driver: profiles sync=off and sync=on with nsys and reports host-synchronization CUDA APIs inside the per-run NVTX range.

Results

Verified on an H100 (SM 90) with a Release CUDA 13.0 build (no node-IO dumping), 5 warmup + 50 measured runs, counting CUDA APIs inside the ort_run NVTX range (warmup occurrences skipped):

Config Host-sync APIs inside Run() cudaLaunchKernel inside Run() Result
disable_synchronize_execution_providers="1" (sync off) 0 1200 (24 kernels × 50 runs) PASS — fully async
default (sync on, baseline) 100 cudaStreamSynchronize (2 × 50 runs), avg ~2.3 µs 1200 sync present as expected

This confirms that when the option is set, no host-side stream/device synchronization occurs on the Run() path while kernels are still submitted normally; with the default config the synchronization remains in place.

Reproduce:

# Requires GPU + nsys + nvtx; point PYTHONPATH at a CUDA build of onnxruntime
bash onnxruntime/test/python/transformers/run_disable_sync_check.sh <python>

When disable_synchronize_execution_providers=1 is set in RunOptions,
CUDA graph replay now skips cudaStreamSynchronize after cudaGraphLaunch,
enabling fully async execution with IO Binding and pre-bound GPU buffers.

Previously, CUDA graph replay always called cudaStreamSynchronize
regardless of the disable_synchronize_execution_providers setting.
This was the only remaining synchronization point preventing fully
async Run() with IO Binding + CUDA graph.

Changes:
- Add bool sync parameter (default true) to IExecutionProvider::ReplayGraph
- Thread the parameter through CUDAExecutionProvider and plugin CUDA EP
- Session-level graph replay reads the run option to determine sync
- OnRunEnd capture-end replay uses the existing sync_stream flag
- All other EP overrides updated for signature compatibility

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Threads a sync flag through IExecutionProvider::ReplayGraph so that CUDA graph replay honors the existing disable_synchronize_execution_providers RunOption. Previously, even when the option was set, the session-level replay path always synchronized the CUDA stream after cudaGraphLaunch, defeating fully async IO-binding workflows.

Changes:

  • Added a backward-compatible bool sync = true parameter to IExecutionProvider::ReplayGraph and its overrides across CUDA, plugin CUDA, TensorRT, NV TensorRT RTX, DML, JS, and WebGPU EPs.
  • InferenceSession::RunImpl now reads disable_synchronize_execution_providers and passes the derived flag to ReplayGraph; CUDA EP also forwards sync_stream when replaying after first-capture in OnRunEnd.
  • Plugin EP bridge launches the OrtEp graph without sync and then calls Sync() only when sync=true (note: device-wide sync, since the C API OrtEp::ReplayGraph cannot be ABI-extended).

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated no comments.

Show a summary per file
File Description
include/onnxruntime/core/framework/execution_provider.h Adds sync parameter (default true) and doc to virtual ReplayGraph.
onnxruntime/core/session/inference_session.h Forwards sync through cached-EP graph-replay helper.
onnxruntime/core/session/inference_session.cc Derives sync_graph_replay from RunOptions and passes to ReplayGraph.
onnxruntime/core/providers/cuda/cuda_execution_provider.{h,cc} Threads sync through CUDA EP and PerThreadContext::ReplayGraph; uses sync_stream in OnRunEnd first-capture replay.
onnxruntime/core/providers/cuda/plugin/cuda_ep.cc ReplayGraphImpl always launches without sync; bridge handles sync.
onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.{h,cc} Plugin bridge: launches via C API then calls Sync() when sync=true.
onnxruntime/core/providers/{tensorrt,nv_tensorrt_rtx,dml,js,webgpu}/... Signature updates; sync parameter accepted but unused.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.cc
Comment thread onnxruntime/core/session/inference_session.cc
Comment thread include/onnxruntime/core/framework/execution_provider.h Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.

Comment thread onnxruntime/test/python/transformers/profile_disable_sync.py Fixed
Comment thread onnxruntime/test/python/transformers/profile_disable_sync.py Fixed
Comment thread onnxruntime/test/python/transformers/profile_disable_sync.py Fixed
Comment thread onnxruntime/test/python/transformers/profile_disable_sync.py Fixed

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Comment thread onnxruntime/test/python/transformers/parse_nsys.py
Comment thread onnxruntime/test/python/transformers/parse_nsys.py
Comment thread onnxruntime/test/python/transformers/profile_disable_sync.py Outdated
Comment thread onnxruntime/test/python/transformers/run_disable_sync_check.sh Outdated
hariharans29
hariharans29 previously approved these changes Jun 2, 2026
- Move onnx/onnxruntime imports to top level (RUFF PLC0415) and use a
  single import form to fix CodeQL dual-import warning.
- Fix warmup NVTX comment and use io_binding.synchronize_outputs() for the
  final sync instead of an extra inference run.
- Restrict SYNC_API_PATTERNS to host-blocking runtime APIs, drop driver
  (cu*) and cudaStreamWaitEvent patterns, and exclude *Async* variants.
- Correct the sync=on baseline banner wording.
@tianleiwu tianleiwu enabled auto-merge (squash) June 2, 2026 17:00
@tianleiwu tianleiwu merged commit 312f524 into main Jun 2, 2026
86 checks passed
@tianleiwu tianleiwu deleted the tlwu/async-cuda-graph-replay branch June 2, 2026 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants