Usability of non-default CUDA streams, per-stream synchronization, memory safety guards by TwentyPast4 · Pull Request #7348 · isl-org/Open3D

TwentyPast4 · 2025-10-29T14:55:30Z

Type

Bug fix (non-breaking change which fixes an issue): Fixes #
New feature (non-breaking change which adds functionality). Resolves #
Breaking change (fix or feature that would cause existing functionality to not work as expected) Resolves #

Motivation and Context

In my experience, working with non-default CUDA streams is very dangerous in the current state of O3D.
A simple example that produces very unexpected behavior:

ScopedCUDAStream scope(ScopedCUDAStream::CreateNew);
Tensor test = Tensor ::Init<float>({0.f}, "CUDA:0");
std::cout << test.ToString() << std::endl; // Uninitialized memory access!!

While a very similar looking piece of code doesn't have this issue:

ScopedCUDAStream scope(cuda::GetDefaultStream());
Tensor test = Tensor ::Init<float>({0.f}, "CUDA:0");
std::cout << test.ToString() << std::endl; // OK!

The problem boils down to how the CUDA memory manager is implemented.
Issues are masked by the behavior of the CUDA default stream (different from non-default streams): https://docs.nvidia.com/cuda/cuda-driver-api/stream-sync-behavior.html

The CUDA memory manager Memcpy operation is implemented to perform async memcpy operations regardless of the direction of copy (device->device, device->host, host->device). This was a huge red flag when I first saw it, but the library worked, so I just assumed my imagined world was a lie and all is okay somehow.
My initial thinking was, if device->host is an async memcpy, wouldn't the library break pretty much all over the place, at every point where any operations are done on tensors of the sort gpu_tensor.To("CPU:0").AnyOperationHere(). Seems logical, as in such a case host-native operations would be done on top of memory which was yet to be filled. But in practice, this was not the case, so I went on with my life. Until eventually in my multi-threaded workflows, I was annoyed that the threads were implicitly being synchronized, because they shared the same CUDA stream, and thus were waiting on each other's work every time the stream needed to be synchronized. I then tried to use the scoped stream to create a new CUDA stream on each thread, so that they would not have to interfere with each other.
To my horror, that is when the exact behavior I initially pictured when looking at the memory manager manifested itself.
All memory accesses on tensors moved to the host device were done on uninitialized memory and the figurative explosions are quite spectacular.
This issue is also present in other cases too (eg. host->device async copy immediately followed by a host free).

Besides this, I later noticed the library does device-wide synchronization in many cases, instead of per-stream synchronization. This causes performance critical applications (where proper stream management is key) to suffer greatly.

I decided to fix my problems by tweaking the memory manager and adding the necessary stream management functionality to achieve my goals. I must stress that this was done in an effort to solve my problems , and not in the proper "contribute to the library" way. There are several reasons for this, primary of course being time, but secondary also being that my knowledge of O3D is nowhere near the level I'd want (yet) to be able to contribute in the way I'd like.

I'm providing this PR as a blueprint of what I believe needs to be done, and as an outline of the issue at hand.
I'm using the backing branch of the PR as a way to test solutions, and would welcome any input from O3D devs on how best to solve the limitations.

I understand the view point that in the end you could say this is a user-facing issue, and the user should make sure any memory copies go though by taking care of object lifetimes themselves. However, O3D has many cases where it just can't work in multi-cuda-stream mode, as these async memcpy operations break many other things.

It is also my opinion that there should be no way for users to introduce heavily undefined behavior into their applications that heavily warps the intuition of host data always being synchronous and safe to work with (it is not in the outlined use cases). So by this I mean, there should be no way to change the CUDA stream at all, as the library gives absolutely no guarantees that it will be able to live with that (the vast majority of workflows do not survive this change).

Checklist:

I have run python util/check_style.py --apply to apply Open3D code style
to my code.
This PR changes Open3D behavior or adds new functionality.
- Both C++ (Doxygen) and Python (Sphinx / Google style) documentation is
  updated accordingly.
- I have added or updated C++ and / or Python unit tests OR included test
  results (e.g. screenshots or numbers) here.
I will follow up and update the code if CI fails.
For fork PRs, I have selected Allow edits from maintainers.

Description

Added a memory synchronization guard for non-default CUDA streams.
Swapped-out all CUDA synchronization with per-stream synchronization
changed cudaMemset to cudaMemsetAsync

update-docs · 2025-10-29T14:55:35Z

Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes.

…nchronization per-stream instead of for the entire device

…tions

Copilot

Pull request overview

This PR aims to make Open3D safer and more usable with non-default CUDA streams by introducing an explicit CUDAStream abstraction, switching many CUDA operations to use the current stream, and adding guardrails for host↔device memcpy semantics to avoid uninitialized host reads / lifetime hazards.

Changes:

Introduces core::CUDAStream + stream-scoped behavior (including host↔device memcpy policies) and updates CUDAScopedStream accordingly.
Replaces multiple device-wide synchronizations and default-stream usage with per-stream synchronization and “current stream” usage across kernels, nns, hashmap, and linalg CUDA backends.
Adds a CUDA non-default stream regression test and includes various formatting-only cleanups.

Reviewed changes

Copilot reviewed 35 out of 51 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
cpp/tests/core/ParallelFor.cu	Adds a non-default CUDA stream test for `ParallelFor`.
cpp/tests/core/HashMap.cpp	Formatting-only ctor cleanup in a test helper type.
cpp/tests/core/CUDAUtils.cpp	Updates tests to validate new `CUDAStream` / `CUDAScopedStream` behavior.
cpp/pybind/core/cuda_utils.cpp	Adjusts Python bindings around CUDA synchronize behavior and adds `synchronize_stream`.
cpp/open3d/visualization/webrtc_server/PeerConnectionManager.h	Formatting-only ctor cleanup.
cpp/open3d/visualization/gui/WindowSystem.h	Formatting-only destructor cleanup.
cpp/open3d/visualization/gui/PickPointsInteractor.cpp	Formatting-only ctor cleanup.
cpp/open3d/visualization/gui/ImguiFilamentBridge.cpp	Formatting-only ctor cleanup.
cpp/open3d/visualization/gui/Gui.h	Formatting-only destructor cleanup.
cpp/open3d/utility/Optional.h	Formatting-only constructor formatting adjustments.
cpp/open3d/t/pipelines/registration/TransformationEstimation.h	Formatting-only destructor cleanup.
cpp/open3d/t/pipelines/kernel/TransformationConverter.cu	Uses current `CUDAStream` for kernel launches.
cpp/open3d/t/pipelines/kernel/RegistrationCUDA.cu	Uses current `CUDAStream` for launches + per-stream synchronization.
cpp/open3d/t/pipelines/kernel/RGBDOdometryCUDA.cu	Uses current `CUDAStream` for launches + per-stream synchronization.
cpp/open3d/t/geometry/kernel/VoxelBlockGridImpl.h	Replaces synchronize guards with per-stream sync under CUDA module builds.
cpp/open3d/t/geometry/kernel/PointCloudImpl.h	Replaces device sync w/ per-stream sync under CUDA module builds.
cpp/open3d/t/geometry/kernel/NPPImage.cpp	Wires NPP stream context to the current `CUDAStream`.
cpp/open3d/t/geometry/RGBDImage.h	Formatting-only destructor cleanup.
cpp/open3d/pipelines/registration/ColoredICP.h	Formatting-only destructor cleanup.
cpp/open3d/ml/impl/misc/Voxelize.cuh	Adds multiple stream synchronizations and changes `cudaMemset` → `cudaMemsetAsync`.
cpp/open3d/ml/impl/misc/InvertNeighborsList.cuh	Adds stream synchronization after scan-like operations.
cpp/open3d/ml/contrib/RoiPoolKernel.cu	Uses per-stream synchronization in debug builds.
cpp/open3d/io/sensor/RGBDSensor.h	Formatting-only destructor cleanup.
cpp/open3d/io/rpc/ConnectionBase.h	Formatting-only ctor/dtor cleanup.
cpp/open3d/io/PointCloudIO.h	Formatting-only ctor cleanup.
cpp/open3d/core/nns/NearestNeighborSearch.cpp	Formatting-only destructor cleanup.
cpp/open3d/core/nns/NanoFlannIndex.cpp	Formatting-only ctor/dtor cleanup.
cpp/open3d/core/nns/KnnSearchOps.cu	Routes thrust / ops to the current `CUDAStream` and removes per-iteration stream creation.
cpp/open3d/core/nns/FixedRadiusSearchOps.cu	Uses current `CUDAStream` and passes stream into sort implementation.
cpp/open3d/core/nns/FixedRadiusSearchImpl.cuh	Adds stream parameter to `SortPairs` and introduces additional stream syncs.
cpp/open3d/core/nns/FixedRadiusIndex.cpp	Formatting-only ctor/dtor cleanup.
cpp/open3d/core/linalg/SolveCUDA.cpp	Sets cuSolver handle stream to current `CUDAStream`.
cpp/open3d/core/linalg/SVDCUDA.cpp	Sets cuSolver handle stream to current `CUDAStream` (+ include).
cpp/open3d/core/linalg/MatmulCUDA.cpp	Sets cuBLAS handle stream to current `CUDAStream` (+ include).
cpp/open3d/core/linalg/LeastSquaresCUDA.cpp	Sets cuSolver + cuBLAS handle streams to current `CUDAStream`.
cpp/open3d/core/linalg/LUCUDA.cpp	Sets cuSolver handle stream to current `CUDAStream` (+ include).
cpp/open3d/core/linalg/InverseCUDA.cpp	Sets cuSolver handle stream to current `CUDAStream` (+ include).
cpp/open3d/core/linalg/AddMMCUDA.cpp	Sets cuBLAS handle stream to current `CUDAStream` (+ include).
cpp/open3d/core/kernel/ReductionCUDA.cu	Uses `cudaMemsetAsync` + per-stream kernel launch/sync.
cpp/open3d/core/kernel/NonZeroCUDA.cu	Routes thrust algorithms to current `CUDAStream` and synchronizes per-stream.
cpp/open3d/core/hashmap/CUDA/StdGPUHashBackend.h	Uses current `CUDAStream` for kernels, thrust copies, and synchronization.
cpp/open3d/core/hashmap/CUDA/SlabNodeManager.h	Uses async memset, stream-aware thrust, and async D2H copy with conditional sync.
cpp/open3d/core/hashmap/CUDA/SlabHashBackend.h	Switches to async memset + per-stream launches and synchronizations; stream-aware thrust usage.
cpp/open3d/core/hashmap/CUDA/CUDAHashBackendBufferAccessor.h	Switches to `cudaMemsetAsync` and per-stream sync.
cpp/open3d/core/hashmap/CUDA/CUDAHashBackendBuffer.cu	Routes thrust sequence to current `CUDAStream` and synchronizes.
cpp/open3d/core/ParallelFor.h	Launches kernels on current `CUDAStream`.
cpp/open3d/core/MemoryManagerCUDA.cpp	Uses current `CUDAStream` + adds per-stream host↔device memcpy safety policies.
cpp/open3d/core/Indexer.h	Formatting-only ctor cleanup.
cpp/open3d/core/CUDAUtils.h	Introduces `CUDAStream`, memcpy policy enum, updates `CUDAScopedStream`, adds stream `Synchronize`.
cpp/open3d/core/CUDAUtils.cpp	Implements `CUDAStream`, stream synchronize function, updates scoped stream handling, and enhances CUDA error logging.
cpp/benchmarks/core/HashMap.cpp	Formatting-only ctor cleanup in a benchmark helper type.

Comments suppressed due to low confidence (3)

cpp/pybind/core/cuda_utils.cpp:1

The docstring for the existing binding now says “Synchronizes a CUDA stream.” but the implementation shown still calls cuda::Synchronize(); with no stream parameter (different semantics). Also, this snippet introduces m_cuda.def(...) but m_cuda is not in scope in the provided hunk (potential compile error if it isn’t defined earlier in this function). Consider (1) restoring the original docstring/behavior for the no-arg synchronize binding, and (2) ensuring the new synchronize_stream is defined on a module object that is definitely in scope here (e.g., m.def(...) or a clearly defined m_cuda submodule).
cpp/open3d/ml/impl/misc/Voxelize.cuh:1
This introduces a mandatory cudaStreamSynchronize in the middle of a CUDA algorithm, which can significantly reduce overlap/concurrency and hurt performance on non-default streams. If the goal is to make host-visible scalar results available, prefer copying only the required scalar to host (via cudaMemcpyAsync to pinned memory) and synchronizing only when the CPU actually consumes it, or use CUDA events to establish ordering across streams instead of a full stream sync. If the downstream use is on the same stream (device-side consumption), the sync should be unnecessary because stream operations are already ordered.
cpp/tests/core/ParallelFor.cu:1
This test iterates and asserts over 10,000,000 elements on the CPU, which is likely to be slow and may cause CI timeouts/flakiness (especially in debug builds or on shared runners). Consider reducing N substantially, and/or sampling a subset of indices (e.g., first/last/strided checks) while still validating the non-default stream correctness property this test is meant to cover.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-23T21:55:23Z

+        OPEN3D_CUDA_CHECK(cudaMemcpyAsync(
+                result.data(),
+                thrust::raw_pointer_cast(slabs_per_superblock.data()),
+                num_super_blocks * sizeof(int), cudaMemcpyDeviceToHost,
+                CUDAStream::GetInstance().Get()));
+        if (!CUDAStream::GetInstance().IsDefaultStream()) {
+            cuda::Synchronize(CUDAStream::GetInstance());
+        }
        return result;


This returns result immediately after cudaMemcpyAsync without guaranteeing completion for the default stream case. Relying on pageable-host-memory behavior here is brittle and can lead to reading partially-copied data. Additionally, the copy uses sizeof(int) even though the device buffer is uint32_t (thrust::device_vector<uint32_t>), which is a type/size mismatch risk across platforms. Prefer copying into a std::vector<uint32_t> (or std::vector<int64_t> if that’s the intended API) and synchronizing the stream unconditionally (or switching to a synchronous cudaMemcpy if that is acceptable here).

Copilot · 2026-02-23T21:55:23Z

+const std::unordered_set<cudaError_t> kProcessEndingErrors = {
+        cudaErrorAssert,
+        cudaErrorLaunchTimeout,
+        cudaErrorHardwareStackError,
+        cudaErrorIllegalInstruction,
+        cudaErrorMisalignedAddress,
+        cudaErrorInvalidAddressSpace,
+        cudaErrorInvalidPc,
+        cudaErrorTensorMemoryLeak,
+        cudaErrorMpsClientTerminated,
+        cudaErrorExternalDevice,
+        cudaErrorContained,
+        cudaErrorIllegalAddress,
+        cudaErrorLaunchFailure,
+        cudaErrorECCUncorrectable,
+        cudaErrorUnknown};


Several cudaError_* enumerators here are not available in all CUDA toolkit versions/headers (Open3D often supports a range of CUDA versions). Unconditionally referencing newer error codes can break compilation for older CUDART versions. Consider guarding entries with #if CUDART_VERSION >= ... checks, or building this set from only broadly-available error codes (and treating unknown enums via runtime string matching / a fallback path).

Copilot · 2026-02-23T21:55:23Z

+    Sync = 0,
+    // Asynchronous memory copies. Unmanaged.
+    // No memory safety at all - you are responsible for your own actions.
+    // There are no guaranteed about the lifetime of memory copied between the


Correct the grammar in this comment: “There are no guaranteed” → “There are no guarantees”.

Suggested change

// There are no guaranteed about the lifetime of memory copied between the

// There are no guarantees about the lifetime of memory copied between the

Copilot · 2026-02-23T21:55:24Z

+    /// ```cpp
+    /// void pokingTheBear() {
+    ///     CUDAScopedStream scoped_stream(CUDAStream::CreateNew(), true);
+    ///     CUDAStream::GetInstance().SetDeviceToHostMemcpyPolicy(CUDAMemoryCopyPolicy::AsyncUnmanaged);


The example references CUDAMemoryCopyPolicy::AsyncUnmanaged, but the enum introduced in this PR defines Async (not AsyncUnmanaged). Since this is header documentation for a newly introduced API, it should compile conceptually and match the actual identifier to avoid confusing users. Update the example to use the correct enum value (or rename the enum value if AsyncUnmanaged is the intended public API spelling).

Suggested change

/// CUDAStream::GetInstance().SetDeviceToHostMemcpyPolicy(CUDAMemoryCopyPolicy::AsyncUnmanaged);

/// CUDAStream::GetInstance().SetDeviceToHostMemcpyPolicy(CUDAMemoryCopyPolicy::Async);

TwentyPast4 force-pushed the cuda-streams branch from 8cac804 to f2176f1 Compare October 30, 2025 09:23

TwentyPast4 marked this pull request as draft October 30, 2025 12:48

Dan Toškan added 5 commits November 3, 2025 11:10

Added CUDA stream management, async device->host memcopy handling, sy…

2edc51f

…nchronization per-stream instead of for the entire device

Refactored usages of old GetDefaultStream() API

b1c178f

Fixed compile error in tests

27d2ccc

Implemented stream synchronization policy for memcopies in both direc…

bbfb696

…tions

Implemented stream execution policy for thrust use cases

f06bf67

TwentyPast4 force-pushed the cuda-streams branch from a2ce2c0 to f06bf67 Compare November 3, 2025 10:10

Dan Toškan added 6 commits November 4, 2025 17:08

Fixed memory-corrupting use cases of CUB functions

3daa3cc

Fixed a compile error and typo

22da83e

Missed another CUB call without an assigned stream

271eb1e

Fixed radius search not using correct CUDA stream

8b4935c

Questionable: removed starting new stream in KnnSearch due to overhead

3ca5175

Added stream handles to other cuda library usages

2d05634

ssheorey requested a review from Copilot February 23, 2026 21:48

Copilot AI reviewed Feb 23, 2026

View reviewed changes

Copilot started reviewing on behalf of ssheorey February 23, 2026 21:59 View session

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usability of non-default CUDA streams, per-stream synchronization, memory safety guards#7348

Usability of non-default CUDA streams, per-stream synchronization, memory safety guards#7348
TwentyPast4 wants to merge 11 commits into
isl-org:mainfrom
TwentyPast4:cuda-streams

TwentyPast4 commented Oct 29, 2025 •

edited

Loading

Uh oh!

update-docs Bot commented Oct 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 23, 2026

Uh oh!

Copilot AI Feb 23, 2026

Uh oh!

Copilot AI Feb 23, 2026

Uh oh!

Copilot AI Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// There are no guaranteed about the lifetime of memory copied between the
	// There are no guarantees about the lifetime of memory copied between the

	/// CUDAStream::GetInstance().SetDeviceToHostMemcpyPolicy(CUDAMemoryCopyPolicy::AsyncUnmanaged);
	/// CUDAStream::GetInstance().SetDeviceToHostMemcpyPolicy(CUDAMemoryCopyPolicy::Async);

Conversation

TwentyPast4 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type

Motivation and Context

Checklist:

Description

Uh oh!

update-docs Bot commented Oct 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TwentyPast4 commented Oct 29, 2025 •

edited

Loading