Remove unnecessary cuda sync for better perf by Gasoonjia · Pull Request #17315 · pytorch/executorch

Gasoonjia · 2026-02-09T20:02:55Z

Stack from ghstack (oldest at bottom):

-> Remove unnecessary cuda sync for better perf #17315

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.

Differential Revision: D92193164

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) ghstack-source-id: 339552916 Pull Request resolved: #17315

pytorch-bot · 2026-02-09T20:03:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17315

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit e76fba9 with merge base 2cad5db ():

NEW FAILURE - The following job has failed:

pull / test-openvino-linux / linux-job (gh)
RuntimeError: Command docker exec -t 65b7f5da69747c6b0ec82ddc6999444419d76221d5fc94f65e30826ec0966b39 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-02-09T20:03:47Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339642357 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339728013 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339777492 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339784126 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339788761 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339802040 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339914649 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #17315 * __->__ #17324 torchcodec we are using (0.10.0.dev20251211) has no longer existed in https://download.pytorch.org/whl/nightly/torchcodec/, which leads to lots of cis including all whisper cis crashed. this diff pin bump torchcodec to bring ci back. Differential Revision: [D92797044](https://our.internmc.facebook.com/intern/diff/D92797044/)

larryliu0820 · 2026-02-10T19:38:04Z

+    if (is_using_shared_cuda_stream()) {
+      // Shared stream mode: set handle's stream to nullptr.
+      // The stream will be retrieved from backend in execute().
+      handle->cuda_stream = nullptr;


I think it's better to set handle->cuda_stream to the only cuda stream.

based on offline sync we tried to create a new CudaHandle class to inherit current aoti_handle, puting a shared_ptr<cuda_stream> inside cuda_handle, to make sure there's only one cuda stream in the whole pipeline.

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 340006830 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]

@cccclai

) ### Summary - Currently all tests will push same libs, which is redundant. With this PR it only push once, and reduce execution time from two to three times for operator's tests. ### Test plan ``` python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNFloatingPointOperator --device <device> --host <host> --model SM8650 --build_folder build-android --executorch_root . --artifact all_artifact ``` without optimization <img width="429" height="136" alt="image" src="https://github.com/user-attachments/assets/f52a167c-b665-47da-97b2-02f836f4858e" /> with optimization <img width="442" height="130" alt="image" src="https://github.com/user-attachments/assets/46366237-ec7b-4d38-b577-5677b3dafb36" /> cc @cccclai @cbilgin

- Clean up operation is required to create an available device - CI is complete, but device is not aware of the situation so it should be done by themselves Signed-off-by: jiseong.oh <jiseong.oh@samsung.com>

### Summary Added message matchers to death tests in 2 test files to verify tests fail with the expected error messages, not just that they fail. evalue_test.cpp (18 matchers): - Type checks: "EValue is not an int", "EValue is not a" - Null pointer checks: "Pointer is null", "pointer cannot be null" - List pointer checks: "string/int/bool/double/tensor list pointer is null" - BoxedEvalueList checks: "wrapped_vals/unwrapped_vals cannot be null" tensor_util_test.cpp (29 matchers): - Shape/dtype mismatches: "Tensors do not match" - Dimension validation: "Ending/Starting dimension.*should be in the range" - Empty matchers for stride checks (Windows regex limitations) Note: Matchers use only cross-platform compatible regex features (no brackets, unions, or grouping which fail on Windows). ### Test plan ``` ./test/run_oss_cpp_tests.sh ```

### Summary The validate_dim_order function only checked that values were in bounds, allowing invalid inputs like {0, 0, 0} to pass. This caused uninitialized memory access in dim_order_to_stride_nocheck. Fix by using a bitmask to detect duplicates. Also adds test fixture with runtime_init() for error logging and removes duplicate include. ### Test plan ``` ./test/run_oss_cpp_tests.sh ``` --------- Co-authored-by: Claude <noreply@anthropic.com>

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 340451972 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

larryliu0820

Thank you for splitting aoti_delegate_handle and cuda_delegate_handle, it's much cleaner this way.

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 340657506 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 340799405 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 342632237 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 342660691 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 342843227 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 342865467 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 342988455 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]

Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 343032381 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)

@Gasoonjia

This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #17315 by @Gasoonjia ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/gasoonjia/116/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/gasoonjia/116/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/gasoonjia/116/orig Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) @diff-train-skip-merge Co-authored-by: gasoonjia <gasoonjia@icloud.com>

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 9, 2026

meta-codesync Bot added fb-exported meta-exported labels Feb 9, 2026

Gasoonjia mentioned this pull request Feb 10, 2026

ping bump torchcodec #17324

Merged

larryliu0820 reviewed Feb 10, 2026

View reviewed changes

Comment thread backends/aoti/slim/core/storage.h

larryliu0820 reviewed Feb 10, 2026

View reviewed changes

Gasoonjia temporarily deployed to upload-benchmark-results February 11, 2026 11:12 — with GitHub Actions Inactive

Gasoonjia and others added 11 commits February 11, 2026 08:35

fix to allocate samsung device issue

a5c5f70

- Clean up operation is required to create an available device - CI is complete, but device is not aware of the situation so it should be done by themselves Signed-off-by: jiseong.oh <jiseong.oh@samsung.com>

Gasoonjia temporarily deployed to upload-benchmark-results February 11, 2026 20:49 — with GitHub Actions Inactive

Gasoonjia temporarily deployed to upload-benchmark-results February 11, 2026 23:29 — with GitHub Actions Inactive

larryliu0820 approved these changes Feb 12, 2026

View reviewed changes

Gasoonjia temporarily deployed to upload-benchmark-results February 19, 2026 09:59 — with GitHub Actions Inactive

Gasoonjia temporarily deployed to upload-benchmark-results February 19, 2026 12:12 — with GitHub Actions Inactive

Gasoonjia temporarily deployed to upload-benchmark-results February 19, 2026 23:21 — with GitHub Actions Inactive

Gasoonjia temporarily deployed to upload-benchmark-results February 20, 2026 05:14 — with GitHub Actions Inactive

Gasoonjia merged commit cf21a65 into gh/gasoonjia/116/base Feb 20, 2026
196 of 199 checks passed

Gasoonjia deleted the gh/gasoonjia/116/head branch February 20, 2026 18:55

Gasoonjia temporarily deployed to cherry-pick-bot February 20, 2026 18:55 — with GitHub Actions Inactive

pytorchbot mentioned this pull request Feb 20, 2026

Remove unnecessary cuda sync for better perf #17594

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unnecessary cuda sync for better perf#17315

Remove unnecessary cuda sync for better perf#17315
Gasoonjia merged 26 commits intogh/gasoonjia/116/basefrom
gh/gasoonjia/116/head

Gasoonjia commented Feb 9, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Feb 9, 2026

Uh oh!

Uh oh!

larryliu0820 Feb 10, 2026

Uh oh!

Gasoonjia Feb 10, 2026

Uh oh!

larryliu0820 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Gasoonjia commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17315

❌ 1 New Failure

Uh oh!

github-actions Bot commented Feb 9, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

larryliu0820 Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Gasoonjia Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

larryliu0820 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Gasoonjia commented Feb 9, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 9, 2026 •

edited

Loading

This PR needs a `release notes:` label