Remove unnecessary cuda sync for better perf#17315
Remove unnecessary cuda sync for better perf#17315Gasoonjia merged 26 commits intogh/gasoonjia/116/basefrom
Conversation
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) ghstack-source-id: 339552916 Pull Request resolved: #17315
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17315
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit e76fba9 with merge base 2cad5db ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339642357 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339728013 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339777492 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339784126 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339788761 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339802040 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 339914649 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #17315 * __->__ #17324 torchcodec we are using (0.10.0.dev20251211) has no longer existed in https://download.pytorch.org/whl/nightly/torchcodec/, which leads to lots of cis including all whisper cis crashed. this diff pin bump torchcodec to bring ci back. Differential Revision: [D92797044](https://our.internmc.facebook.com/intern/diff/D92797044/)
| if (is_using_shared_cuda_stream()) { | ||
| // Shared stream mode: set handle's stream to nullptr. | ||
| // The stream will be retrieved from backend in execute(). | ||
| handle->cuda_stream = nullptr; |
There was a problem hiding this comment.
I think it's better to set handle->cuda_stream to the only cuda stream.
There was a problem hiding this comment.
based on offline sync we tried to create a new CudaHandle class to inherit current aoti_handle, puting a shared_ptr<cuda_stream> inside cuda_handle, to make sure there's only one cuda stream in the whole pipeline.
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. ghstack-source-id: 340006830 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
) ### Summary - Currently all tests will push same libs, which is redundant. With this PR it only push once, and reduce execution time from two to three times for operator's tests. ### Test plan ``` python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNFloatingPointOperator --device <device> --host <host> --model SM8650 --build_folder build-android --executorch_root . --artifact all_artifact ``` without optimization <img width="429" height="136" alt="image" src="https://github.com/user-attachments/assets/f52a167c-b665-47da-97b2-02f836f4858e" /> with optimization <img width="442" height="130" alt="image" src="https://github.com/user-attachments/assets/46366237-ec7b-4d38-b577-5677b3dafb36" /> cc @cccclai @cbilgin
- Clean up operation is required to create an available device - CI is complete, but device is not aware of the situation so it should be done by themselves Signed-off-by: jiseong.oh <jiseong.oh@samsung.com>
### Summary Added message matchers to death tests in 2 test files to verify tests fail with the expected error messages, not just that they fail. evalue_test.cpp (18 matchers): - Type checks: "EValue is not an int", "EValue is not a" - Null pointer checks: "Pointer is null", "pointer cannot be null" - List pointer checks: "string/int/bool/double/tensor list pointer is null" - BoxedEvalueList checks: "wrapped_vals/unwrapped_vals cannot be null" tensor_util_test.cpp (29 matchers): - Shape/dtype mismatches: "Tensors do not match" - Dimension validation: "Ending/Starting dimension.*should be in the range" - Empty matchers for stride checks (Windows regex limitations) Note: Matchers use only cross-platform compatible regex features (no brackets, unions, or grouping which fail on Windows). ### Test plan ``` ./test/run_oss_cpp_tests.sh ```
### Summary
The validate_dim_order function only checked that values were in bounds,
allowing invalid inputs like {0, 0, 0} to pass. This caused
uninitialized memory access in dim_order_to_stride_nocheck.
Fix by using a bitmask to detect duplicates. Also adds test fixture with
runtime_init() for error logging and removes duplicate include.
### Test plan
```
./test/run_oss_cpp_tests.sh
```
---------
Co-authored-by: Claude <noreply@anthropic.com>
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 340451972 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
larryliu0820
left a comment
There was a problem hiding this comment.
Thank you for splitting aoti_delegate_handle and cuda_delegate_handle, it's much cleaner this way.
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 340657506 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 340799405 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 342632237 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 342660691 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 342843227 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 342865467 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 342988455 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) [ghstack-poisoned]
Pull Request resolved: #17315 Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync. Also we introduced new cuda_delegate_handle to remove cuda specifci inforamtion from aoti_delegate_handle for better hirearchy. ghstack-source-id: 343032381 @exported-using-ghexport Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/)
This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #17315 by @Gasoonjia ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/gasoonjia/116/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/gasoonjia/116/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/gasoonjia/116/orig Differential Revision: [D92193164](https://our.internmc.facebook.com/intern/diff/D92193164/) @diff-train-skip-merge Co-authored-by: gasoonjia <gasoonjia@icloud.com>
Stack from ghstack (oldest at bottom):
Right now we always do cudasync before existing cudabackend.execution(). However we only need that when copying data from gpu to cpu; any actions happen inside a same stream do not need explicit sync.
Differential Revision: D92193164