Honor disable_synchronize_execution_providers for CUDA graph replay by tianleiwu · Pull Request #28686 · microsoft/onnxruntime

tianleiwu · 2026-05-27T06:06:18Z

Description

When using IO Binding with pre-allocated GPU buffers and disable_synchronize_execution_providers=1 in RunOptions, CUDA graph replay was the only remaining synchronization point that prevented fully async Session::Run(). This PR threads the sync flag through the ReplayGraph virtual so that CUDA graph replay respects the same run option.

Motivation

For latency-sensitive inference pipelines, users want to:

Bind inputs/outputs to fixed GPU memory (IO Binding)
Set a custom compute stream
Use CUDA graph capture for reduced kernel launch overhead
Run fully async — no host-side synchronization during Run()

Before this change, even with disable_synchronize_execution_providers=1, CUDA graph replay always called cudaStreamSynchronize after cudaGraphLaunch (hardcoded sync_status_flag=true). This forced a host-GPU sync on every replay, defeating the purpose of the async config.

Behavior Change

Configuration	Before	After
Default (`disable_synchronize_execution_providers` unset or `"0"`)	`cudaStreamSynchronize` after graph launch	Same — `cudaStreamSynchronize` after graph launch
`disable_synchronize_execution_providers = "1"`	`cudaStreamSynchronize` after graph launch (ignored the config)	No sync — `cudaGraphLaunch` returns immediately, fully async

Key Changes

IExecutionProvider::ReplayGraph — Added bool sync = true parameter to the virtual method (backward-compatible default)
InferenceSession::RunImpl — Session-level graph replay path now reads disable_synchronize_execution_providers and passes sync=false when set
CUDAExecutionProvider::OnRunEnd — First-capture replay passes existing sync_stream flag (already derived from the run option)
CUDAExecutionProvider::ReplayGraph → PerThreadContext::ReplayGraph → CUDAGraphManager::Replay — sync flag threaded through the entire chain
Plugin CUDA EP — ReplayGraphImpl launches graph without sync; PluginExecutionProvider::ReplayGraph bridge calls Sync() only when sync=true
Other EPs (TensorRT, DML, JS, WebGPU, NV TensorRT RTX) — Signature updated for compilation; sync parameter accepted but unused (these EPs have their own sync semantics)

Usage Example

import onnxruntime as ort

providers = [("CUDAExecutionProvider", {"cuda_stream": str(stream_ptr)})]
session = ort.InferenceSession("model.onnx", providers=providers)
io_binding = session.io_binding()

# Bind pre-allocated GPU buffers
io_binding.bind_input("input", "cuda", 0, np.float16, shape, input_ptr)
io_binding.bind_output("output", "cuda", 0, np.float16, shape, output_ptr)

# Fully async run — no host sync during Run()
run_options = ort.RunOptions()
run_options.add_run_config_entry("disable_synchronize_execution_providers", "1")
session.run_with_iobinding(io_binding, run_options)

# Sync only when consuming output
torch.cuda.current_stream().synchronize()

Notes

The plugin CUDA EP uses cudaDeviceSynchronize (via Sync()) for the default sync path instead of stream-level sync. This is because the C API OrtEp::ReplayGraph signature cannot be extended with a sync parameter without a versioned ABI change. Functionally correct; slightly broader than stream sync but only matters on the default (blocking) path.
CUDA graph capture-end replay in OnRunEnd was already gated by sync_stream, which is derived from the same run option — no additional change needed there beyond passing it through.

Testing

Unit tests / existing coverage

Build passes with CUDA 13.0.
Existing CUDA graph tests continue to pass (default sync=true behavior unchanged).

Developer verification (nsys profiling)

A developer verification harness was added under onnxruntime/test/python/transformers/. It is not run in CI (requires a GPU, the nsys profiler, and the nvtx Python package), but lets anyone reproduce the async guarantee in one command:

profile_disable_sync.py — Builds a small cuBLAS-free elementwise model (so cuBLAS internal syncs cannot mask the EP-level sync), binds inputs/outputs on GPU via IO Binding (no host↔device copies during Run()), and wraps each Run() in an NVTX range. Toggles disable_synchronize_execution_providers via --sync on|off.
parse_nsys.py — Extended to parse CUDA runtime API calls that fall inside a named NVTX range from the nsys SQLite export. New flags: --cuda-api, --sync-apis-only (exits non-zero if any host-synchronization API is found in the range), --api-pattern, --list-cuda-apis, --skip-first-ranges N (skip warmup occurrences). Host-sync APIs detected include cudaStreamSynchronize, cudaDeviceSynchronize, cudaEventSynchronize, blocking cudaMemcpy/cudaMemset, and their driver-API equivalents.
run_disable_sync_check.sh — End-to-end driver: profiles sync=off and sync=on with nsys and reports host-synchronization CUDA APIs inside the per-run NVTX range.

Results

Verified on an H100 (SM 90) with a Release CUDA 13.0 build (no node-IO dumping), 5 warmup + 50 measured runs, counting CUDA APIs inside the ort_run NVTX range (warmup occurrences skipped):

Config	Host-sync APIs inside `Run()`	`cudaLaunchKernel` inside `Run()`	Result
`disable_synchronize_execution_providers="1"` (sync off)	0	1200 (24 kernels × 50 runs)	PASS — fully async
default (sync on, baseline)	100 `cudaStreamSynchronize` (2 × 50 runs), avg ~2.3 µs	1200	sync present as expected

This confirms that when the option is set, no host-side stream/device synchronization occurs on the Run() path while kernels are still submitted normally; with the default config the synchronization remains in place.

Reproduce:

# Requires GPU + nsys + nvtx; point PYTHONPATH at a CUDA build of onnxruntime
bash onnxruntime/test/python/transformers/run_disable_sync_check.sh <python>

When disable_synchronize_execution_providers=1 is set in RunOptions, CUDA graph replay now skips cudaStreamSynchronize after cudaGraphLaunch, enabling fully async execution with IO Binding and pre-bound GPU buffers. Previously, CUDA graph replay always called cudaStreamSynchronize regardless of the disable_synchronize_execution_providers setting. This was the only remaining synchronization point preventing fully async Run() with IO Binding + CUDA graph. Changes: - Add bool sync parameter (default true) to IExecutionProvider::ReplayGraph - Thread the parameter through CUDAExecutionProvider and plugin CUDA EP - Session-level graph replay reads the run option to determine sync - OnRunEnd capture-end replay uses the existing sync_stream flag - All other EP overrides updated for signature compatibility

Copilot

Pull request overview

Threads a sync flag through IExecutionProvider::ReplayGraph so that CUDA graph replay honors the existing disable_synchronize_execution_providers RunOption. Previously, even when the option was set, the session-level replay path always synchronized the CUDA stream after cudaGraphLaunch, defeating fully async IO-binding workflows.

Changes:

Added a backward-compatible bool sync = true parameter to IExecutionProvider::ReplayGraph and its overrides across CUDA, plugin CUDA, TensorRT, NV TensorRT RTX, DML, JS, and WebGPU EPs.
InferenceSession::RunImpl now reads disable_synchronize_execution_providers and passes the derived flag to ReplayGraph; CUDA EP also forwards sync_stream when replaying after first-capture in OnRunEnd.
Plugin EP bridge launches the OrtEp graph without sync and then calls Sync() only when sync=true (note: device-wide sync, since the C API OrtEp::ReplayGraph cannot be ABI-extended).

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
include/onnxruntime/core/framework/execution_provider.h	Adds `sync` parameter (default true) and doc to virtual `ReplayGraph`.
onnxruntime/core/session/inference_session.h	Forwards `sync` through cached-EP graph-replay helper.
onnxruntime/core/session/inference_session.cc	Derives `sync_graph_replay` from RunOptions and passes to `ReplayGraph`.
onnxruntime/core/providers/cuda/cuda_execution_provider.{h,cc}	Threads `sync` through CUDA EP and `PerThreadContext::ReplayGraph`; uses `sync_stream` in `OnRunEnd` first-capture replay.
onnxruntime/core/providers/cuda/plugin/cuda_ep.cc	`ReplayGraphImpl` always launches without sync; bridge handles sync.
onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.{h,cc}	Plugin bridge: launches via C API then calls `Sync()` when `sync=true`.
onnxruntime/core/providers/{tensorrt,nv_tensorrt_rtx,dml,js,webgpu}/...	Signature updates; sync parameter accepted but unused.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

- Move onnx/onnxruntime imports to top level (RUFF PLC0415) and use a single import form to fix CodeQL dual-import warning. - Fix warmup NVTX comment and use io_binding.synchronize_outputs() for the final sync instead of an extra inference run. - Restrict SYNC_API_PATTERNS to host-blocking runtime APIs, drop driver (cu*) and cudaStreamWaitEvent patterns, and exclude *Async* variants. - Correct the sync=on baseline banner wording.

tianleiwu mentioned this pull request May 27, 2026

[Feature Request] Allow IO binding on run async #28539

Closed

tianleiwu requested review from Copilot and yuslepukhin May 27, 2026 06:14

Copilot started reviewing on behalf of tianleiwu May 27, 2026 06:14 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

tianleiwu requested review from edgchen1 and hariharans29 May 27, 2026 23:00

Merge main

bd3da4f

hariharans29 reviewed May 28, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/dml/DmlExecutionProvider/src/ExecutionProvider.h

hariharans29 reviewed May 28, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.cc

hariharans29 reviewed May 28, 2026

View reviewed changes

Comment thread onnxruntime/core/session/inference_session.cc

edgchen1 reviewed May 29, 2026

View reviewed changes

Comment thread include/onnxruntime/core/framework/execution_provider.h Outdated

add comments where sync parameter is ignored.

14f05b5

tianleiwu requested review from Copilot, edgchen1 and hariharans29 May 29, 2026 20:06

Copilot started reviewing on behalf of tianleiwu May 29, 2026 20:06 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

tianleiwu added 2 commits May 29, 2026 13:18

update comments

f1555ce

Add tests

03ff634

tianleiwu requested a review from Copilot June 2, 2026 00:10

Copilot started reviewing on behalf of tianleiwu June 2, 2026 00:10 View session

github-advanced-security AI found potential problems Jun 2, 2026

View reviewed changes

Comment thread onnxruntime/test/python/transformers/profile_disable_sync.py Fixed

Comment thread onnxruntime/test/python/transformers/profile_disable_sync.py Fixed

Comment thread onnxruntime/test/python/transformers/profile_disable_sync.py Fixed

github-advanced-security AI found potential problems Jun 2, 2026

View reviewed changes

Comment thread onnxruntime/test/python/transformers/profile_disable_sync.py Fixed

Copilot AI reviewed Jun 2, 2026

View reviewed changes

hariharans29 previously approved these changes Jun 2, 2026

View reviewed changes

tianleiwu dismissed hariharans29’s stale review via 327717c June 2, 2026 00:25

tianleiwu requested a review from hariharans29 June 2, 2026 00:36

tianleiwu enabled auto-merge (squash) June 2, 2026 17:00

hariharans29 approved these changes Jun 2, 2026

View reviewed changes

tianleiwu merged commit 312f524 into main Jun 2, 2026
86 checks passed

tianleiwu deleted the tlwu/async-cuda-graph-replay branch June 2, 2026 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Honor disable_synchronize_execution_providers for CUDA graph replay#28686

Honor disable_synchronize_execution_providers for CUDA graph replay#28686
tianleiwu merged 6 commits into
mainfrom
tlwu/async-cuda-graph-replay

tianleiwu commented May 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

tianleiwu commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation

Behavior Change

Key Changes

Usage Example

Notes

Testing

Unit tests / existing coverage

Developer verification (nsys profiling)

Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tianleiwu commented May 27, 2026 •

edited

Loading