Usability of non-default CUDA streams, per-stream synchronization, memory safety guards#7348
Usability of non-default CUDA streams, per-stream synchronization, memory safety guards#7348TwentyPast4 wants to merge 11 commits into
Conversation
|
Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes. |
8cac804 to
f2176f1
Compare
…nchronization per-stream instead of for the entire device
a2ce2c0 to
f06bf67
Compare
There was a problem hiding this comment.
Pull request overview
This PR aims to make Open3D safer and more usable with non-default CUDA streams by introducing an explicit CUDAStream abstraction, switching many CUDA operations to use the current stream, and adding guardrails for host↔device memcpy semantics to avoid uninitialized host reads / lifetime hazards.
Changes:
- Introduces
core::CUDAStream+ stream-scoped behavior (including host↔device memcpy policies) and updatesCUDAScopedStreamaccordingly. - Replaces multiple device-wide synchronizations and default-stream usage with per-stream synchronization and “current stream” usage across kernels, nns, hashmap, and linalg CUDA backends.
- Adds a CUDA non-default stream regression test and includes various formatting-only cleanups.
Reviewed changes
Copilot reviewed 35 out of 51 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| cpp/tests/core/ParallelFor.cu | Adds a non-default CUDA stream test for ParallelFor. |
| cpp/tests/core/HashMap.cpp | Formatting-only ctor cleanup in a test helper type. |
| cpp/tests/core/CUDAUtils.cpp | Updates tests to validate new CUDAStream / CUDAScopedStream behavior. |
| cpp/pybind/core/cuda_utils.cpp | Adjusts Python bindings around CUDA synchronize behavior and adds synchronize_stream. |
| cpp/open3d/visualization/webrtc_server/PeerConnectionManager.h | Formatting-only ctor cleanup. |
| cpp/open3d/visualization/gui/WindowSystem.h | Formatting-only destructor cleanup. |
| cpp/open3d/visualization/gui/PickPointsInteractor.cpp | Formatting-only ctor cleanup. |
| cpp/open3d/visualization/gui/ImguiFilamentBridge.cpp | Formatting-only ctor cleanup. |
| cpp/open3d/visualization/gui/Gui.h | Formatting-only destructor cleanup. |
| cpp/open3d/utility/Optional.h | Formatting-only constructor formatting adjustments. |
| cpp/open3d/t/pipelines/registration/TransformationEstimation.h | Formatting-only destructor cleanup. |
| cpp/open3d/t/pipelines/kernel/TransformationConverter.cu | Uses current CUDAStream for kernel launches. |
| cpp/open3d/t/pipelines/kernel/RegistrationCUDA.cu | Uses current CUDAStream for launches + per-stream synchronization. |
| cpp/open3d/t/pipelines/kernel/RGBDOdometryCUDA.cu | Uses current CUDAStream for launches + per-stream synchronization. |
| cpp/open3d/t/geometry/kernel/VoxelBlockGridImpl.h | Replaces synchronize guards with per-stream sync under CUDA module builds. |
| cpp/open3d/t/geometry/kernel/PointCloudImpl.h | Replaces device sync w/ per-stream sync under CUDA module builds. |
| cpp/open3d/t/geometry/kernel/NPPImage.cpp | Wires NPP stream context to the current CUDAStream. |
| cpp/open3d/t/geometry/RGBDImage.h | Formatting-only destructor cleanup. |
| cpp/open3d/pipelines/registration/ColoredICP.h | Formatting-only destructor cleanup. |
| cpp/open3d/ml/impl/misc/Voxelize.cuh | Adds multiple stream synchronizations and changes cudaMemset → cudaMemsetAsync. |
| cpp/open3d/ml/impl/misc/InvertNeighborsList.cuh | Adds stream synchronization after scan-like operations. |
| cpp/open3d/ml/contrib/RoiPoolKernel.cu | Uses per-stream synchronization in debug builds. |
| cpp/open3d/io/sensor/RGBDSensor.h | Formatting-only destructor cleanup. |
| cpp/open3d/io/rpc/ConnectionBase.h | Formatting-only ctor/dtor cleanup. |
| cpp/open3d/io/PointCloudIO.h | Formatting-only ctor cleanup. |
| cpp/open3d/core/nns/NearestNeighborSearch.cpp | Formatting-only destructor cleanup. |
| cpp/open3d/core/nns/NanoFlannIndex.cpp | Formatting-only ctor/dtor cleanup. |
| cpp/open3d/core/nns/KnnSearchOps.cu | Routes thrust / ops to the current CUDAStream and removes per-iteration stream creation. |
| cpp/open3d/core/nns/FixedRadiusSearchOps.cu | Uses current CUDAStream and passes stream into sort implementation. |
| cpp/open3d/core/nns/FixedRadiusSearchImpl.cuh | Adds stream parameter to SortPairs and introduces additional stream syncs. |
| cpp/open3d/core/nns/FixedRadiusIndex.cpp | Formatting-only ctor/dtor cleanup. |
| cpp/open3d/core/linalg/SolveCUDA.cpp | Sets cuSolver handle stream to current CUDAStream. |
| cpp/open3d/core/linalg/SVDCUDA.cpp | Sets cuSolver handle stream to current CUDAStream (+ include). |
| cpp/open3d/core/linalg/MatmulCUDA.cpp | Sets cuBLAS handle stream to current CUDAStream (+ include). |
| cpp/open3d/core/linalg/LeastSquaresCUDA.cpp | Sets cuSolver + cuBLAS handle streams to current CUDAStream. |
| cpp/open3d/core/linalg/LUCUDA.cpp | Sets cuSolver handle stream to current CUDAStream (+ include). |
| cpp/open3d/core/linalg/InverseCUDA.cpp | Sets cuSolver handle stream to current CUDAStream (+ include). |
| cpp/open3d/core/linalg/AddMMCUDA.cpp | Sets cuBLAS handle stream to current CUDAStream (+ include). |
| cpp/open3d/core/kernel/ReductionCUDA.cu | Uses cudaMemsetAsync + per-stream kernel launch/sync. |
| cpp/open3d/core/kernel/NonZeroCUDA.cu | Routes thrust algorithms to current CUDAStream and synchronizes per-stream. |
| cpp/open3d/core/hashmap/CUDA/StdGPUHashBackend.h | Uses current CUDAStream for kernels, thrust copies, and synchronization. |
| cpp/open3d/core/hashmap/CUDA/SlabNodeManager.h | Uses async memset, stream-aware thrust, and async D2H copy with conditional sync. |
| cpp/open3d/core/hashmap/CUDA/SlabHashBackend.h | Switches to async memset + per-stream launches and synchronizations; stream-aware thrust usage. |
| cpp/open3d/core/hashmap/CUDA/CUDAHashBackendBufferAccessor.h | Switches to cudaMemsetAsync and per-stream sync. |
| cpp/open3d/core/hashmap/CUDA/CUDAHashBackendBuffer.cu | Routes thrust sequence to current CUDAStream and synchronizes. |
| cpp/open3d/core/ParallelFor.h | Launches kernels on current CUDAStream. |
| cpp/open3d/core/MemoryManagerCUDA.cpp | Uses current CUDAStream + adds per-stream host↔device memcpy safety policies. |
| cpp/open3d/core/Indexer.h | Formatting-only ctor cleanup. |
| cpp/open3d/core/CUDAUtils.h | Introduces CUDAStream, memcpy policy enum, updates CUDAScopedStream, adds stream Synchronize. |
| cpp/open3d/core/CUDAUtils.cpp | Implements CUDAStream, stream synchronize function, updates scoped stream handling, and enhances CUDA error logging. |
| cpp/benchmarks/core/HashMap.cpp | Formatting-only ctor cleanup in a benchmark helper type. |
Comments suppressed due to low confidence (3)
cpp/pybind/core/cuda_utils.cpp:1
- The docstring for the existing binding now says “Synchronizes a CUDA stream.” but the implementation shown still calls
cuda::Synchronize();with no stream parameter (different semantics). Also, this snippet introducesm_cuda.def(...)butm_cudais not in scope in the provided hunk (potential compile error if it isn’t defined earlier in this function). Consider (1) restoring the original docstring/behavior for the no-arg synchronize binding, and (2) ensuring the newsynchronize_streamis defined on a module object that is definitely in scope here (e.g.,m.def(...)or a clearly definedm_cudasubmodule).
cpp/open3d/ml/impl/misc/Voxelize.cuh:1 - This introduces a mandatory
cudaStreamSynchronizein the middle of a CUDA algorithm, which can significantly reduce overlap/concurrency and hurt performance on non-default streams. If the goal is to make host-visible scalar results available, prefer copying only the required scalar to host (viacudaMemcpyAsyncto pinned memory) and synchronizing only when the CPU actually consumes it, or use CUDA events to establish ordering across streams instead of a full stream sync. If the downstream use is on the same stream (device-side consumption), the sync should be unnecessary because stream operations are already ordered.
cpp/tests/core/ParallelFor.cu:1 - This test iterates and asserts over 10,000,000 elements on the CPU, which is likely to be slow and may cause CI timeouts/flakiness (especially in debug builds or on shared runners). Consider reducing
Nsubstantially, and/or sampling a subset of indices (e.g., first/last/strided checks) while still validating the non-default stream correctness property this test is meant to cover.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| OPEN3D_CUDA_CHECK(cudaMemcpyAsync( | ||
| result.data(), | ||
| thrust::raw_pointer_cast(slabs_per_superblock.data()), | ||
| num_super_blocks * sizeof(int), cudaMemcpyDeviceToHost, | ||
| CUDAStream::GetInstance().Get())); | ||
| if (!CUDAStream::GetInstance().IsDefaultStream()) { | ||
| cuda::Synchronize(CUDAStream::GetInstance()); | ||
| } | ||
| return result; |
There was a problem hiding this comment.
This returns result immediately after cudaMemcpyAsync without guaranteeing completion for the default stream case. Relying on pageable-host-memory behavior here is brittle and can lead to reading partially-copied data. Additionally, the copy uses sizeof(int) even though the device buffer is uint32_t (thrust::device_vector<uint32_t>), which is a type/size mismatch risk across platforms. Prefer copying into a std::vector<uint32_t> (or std::vector<int64_t> if that’s the intended API) and synchronizing the stream unconditionally (or switching to a synchronous cudaMemcpy if that is acceptable here).
| const std::unordered_set<cudaError_t> kProcessEndingErrors = { | ||
| cudaErrorAssert, | ||
| cudaErrorLaunchTimeout, | ||
| cudaErrorHardwareStackError, | ||
| cudaErrorIllegalInstruction, | ||
| cudaErrorMisalignedAddress, | ||
| cudaErrorInvalidAddressSpace, | ||
| cudaErrorInvalidPc, | ||
| cudaErrorTensorMemoryLeak, | ||
| cudaErrorMpsClientTerminated, | ||
| cudaErrorExternalDevice, | ||
| cudaErrorContained, | ||
| cudaErrorIllegalAddress, | ||
| cudaErrorLaunchFailure, | ||
| cudaErrorECCUncorrectable, | ||
| cudaErrorUnknown}; |
There was a problem hiding this comment.
Several cudaError_* enumerators here are not available in all CUDA toolkit versions/headers (Open3D often supports a range of CUDA versions). Unconditionally referencing newer error codes can break compilation for older CUDART versions. Consider guarding entries with #if CUDART_VERSION >= ... checks, or building this set from only broadly-available error codes (and treating unknown enums via runtime string matching / a fallback path).
| Sync = 0, | ||
| // Asynchronous memory copies. Unmanaged. | ||
| // No memory safety at all - you are responsible for your own actions. | ||
| // There are no guaranteed about the lifetime of memory copied between the |
There was a problem hiding this comment.
Correct the grammar in this comment: “There are no guaranteed” → “There are no guarantees”.
| // There are no guaranteed about the lifetime of memory copied between the | |
| // There are no guarantees about the lifetime of memory copied between the |
| /// ```cpp | ||
| /// void pokingTheBear() { | ||
| /// CUDAScopedStream scoped_stream(CUDAStream::CreateNew(), true); | ||
| /// CUDAStream::GetInstance().SetDeviceToHostMemcpyPolicy(CUDAMemoryCopyPolicy::AsyncUnmanaged); |
There was a problem hiding this comment.
The example references CUDAMemoryCopyPolicy::AsyncUnmanaged, but the enum introduced in this PR defines Async (not AsyncUnmanaged). Since this is header documentation for a newly introduced API, it should compile conceptually and match the actual identifier to avoid confusing users. Update the example to use the correct enum value (or rename the enum value if AsyncUnmanaged is the intended public API spelling).
| /// CUDAStream::GetInstance().SetDeviceToHostMemcpyPolicy(CUDAMemoryCopyPolicy::AsyncUnmanaged); | |
| /// CUDAStream::GetInstance().SetDeviceToHostMemcpyPolicy(CUDAMemoryCopyPolicy::Async); |
Type
Motivation and Context
In my experience, working with non-default CUDA streams is very dangerous in the current state of O3D.
A simple example that produces very unexpected behavior:
While a very similar looking piece of code doesn't have this issue:
The problem boils down to how the CUDA memory manager is implemented.
Issues are masked by the behavior of the CUDA default stream (different from non-default streams): https://docs.nvidia.com/cuda/cuda-driver-api/stream-sync-behavior.html
The CUDA memory manager Memcpy operation is implemented to perform async memcpy operations regardless of the direction of copy (device->device, device->host, host->device). This was a huge red flag when I first saw it, but the library worked, so I just assumed my imagined world was a lie and all is okay somehow.
My initial thinking was, if device->host is an async memcpy, wouldn't the library break pretty much all over the place, at every point where any operations are done on tensors of the sort
gpu_tensor.To("CPU:0").AnyOperationHere(). Seems logical, as in such a case host-native operations would be done on top of memory which was yet to be filled. But in practice, this was not the case, so I went on with my life. Until eventually in my multi-threaded workflows, I was annoyed that the threads were implicitly being synchronized, because they shared the same CUDA stream, and thus were waiting on each other's work every time the stream needed to be synchronized. I then tried to use the scoped stream to create a new CUDA stream on each thread, so that they would not have to interfere with each other.To my horror, that is when the exact behavior I initially pictured when looking at the memory manager manifested itself.
All memory accesses on tensors moved to the host device were done on uninitialized memory and the figurative explosions are quite spectacular.
This issue is also present in other cases too (eg. host->device async copy immediately followed by a host free).
Besides this, I later noticed the library does device-wide synchronization in many cases, instead of per-stream synchronization. This causes performance critical applications (where proper stream management is key) to suffer greatly.
I decided to fix my problems by tweaking the memory manager and adding the necessary stream management functionality to achieve my goals. I must stress that this was done in an effort to solve my problems , and not in the proper "contribute to the library" way. There are several reasons for this, primary of course being time, but secondary also being that my knowledge of O3D is nowhere near the level I'd want (yet) to be able to contribute in the way I'd like.
I'm providing this PR as a blueprint of what I believe needs to be done, and as an outline of the issue at hand.
I'm using the backing branch of the PR as a way to test solutions, and would welcome any input from O3D devs on how best to solve the limitations.
I understand the view point that in the end you could say this is a user-facing issue, and the user should make sure any memory copies go though by taking care of object lifetimes themselves. However, O3D has many cases where it just can't work in multi-cuda-stream mode, as these async memcpy operations break many other things.
It is also my opinion that there should be no way for users to introduce heavily undefined behavior into their applications that heavily warps the intuition of host data always being synchronous and safe to work with (it is not in the outlined use cases). So by this I mean, there should be no way to change the CUDA stream at all, as the library gives absolutely no guarantees that it will be able to live with that (the vast majority of workflows do not survive this change).
Checklist:
python util/check_style.py --applyto apply Open3D code styleto my code.
updated accordingly.
results (e.g. screenshots or numbers) here.
Description