Skip to content

Usability of non-default CUDA streams, per-stream synchronization, memory safety guards#7348

Draft
TwentyPast4 wants to merge 11 commits into
isl-org:mainfrom
TwentyPast4:cuda-streams
Draft

Usability of non-default CUDA streams, per-stream synchronization, memory safety guards#7348
TwentyPast4 wants to merge 11 commits into
isl-org:mainfrom
TwentyPast4:cuda-streams

Conversation

@TwentyPast4
Copy link
Copy Markdown
Contributor

@TwentyPast4 TwentyPast4 commented Oct 29, 2025

Type

  • Bug fix (non-breaking change which fixes an issue): Fixes #
  • New feature (non-breaking change which adds functionality). Resolves #
  • Breaking change (fix or feature that would cause existing functionality to not work as expected) Resolves #

Motivation and Context

In my experience, working with non-default CUDA streams is very dangerous in the current state of O3D.
A simple example that produces very unexpected behavior:

ScopedCUDAStream scope(ScopedCUDAStream::CreateNew);
Tensor test = Tensor ::Init<float>({0.f}, "CUDA:0");
std::cout << test.ToString() << std::endl; // Uninitialized memory access!!

While a very similar looking piece of code doesn't have this issue:

ScopedCUDAStream scope(cuda::GetDefaultStream());
Tensor test = Tensor ::Init<float>({0.f}, "CUDA:0");
std::cout << test.ToString() << std::endl; // OK!

The problem boils down to how the CUDA memory manager is implemented.
Issues are masked by the behavior of the CUDA default stream (different from non-default streams): https://docs.nvidia.com/cuda/cuda-driver-api/stream-sync-behavior.html

The CUDA memory manager Memcpy operation is implemented to perform async memcpy operations regardless of the direction of copy (device->device, device->host, host->device). This was a huge red flag when I first saw it, but the library worked, so I just assumed my imagined world was a lie and all is okay somehow.
My initial thinking was, if device->host is an async memcpy, wouldn't the library break pretty much all over the place, at every point where any operations are done on tensors of the sort gpu_tensor.To("CPU:0").AnyOperationHere(). Seems logical, as in such a case host-native operations would be done on top of memory which was yet to be filled. But in practice, this was not the case, so I went on with my life. Until eventually in my multi-threaded workflows, I was annoyed that the threads were implicitly being synchronized, because they shared the same CUDA stream, and thus were waiting on each other's work every time the stream needed to be synchronized. I then tried to use the scoped stream to create a new CUDA stream on each thread, so that they would not have to interfere with each other.
To my horror, that is when the exact behavior I initially pictured when looking at the memory manager manifested itself.
All memory accesses on tensors moved to the host device were done on uninitialized memory and the figurative explosions are quite spectacular.
This issue is also present in other cases too (eg. host->device async copy immediately followed by a host free).

Besides this, I later noticed the library does device-wide synchronization in many cases, instead of per-stream synchronization. This causes performance critical applications (where proper stream management is key) to suffer greatly.

I decided to fix my problems by tweaking the memory manager and adding the necessary stream management functionality to achieve my goals. I must stress that this was done in an effort to solve my problems , and not in the proper "contribute to the library" way. There are several reasons for this, primary of course being time, but secondary also being that my knowledge of O3D is nowhere near the level I'd want (yet) to be able to contribute in the way I'd like.

I'm providing this PR as a blueprint of what I believe needs to be done, and as an outline of the issue at hand.
I'm using the backing branch of the PR as a way to test solutions, and would welcome any input from O3D devs on how best to solve the limitations.

I understand the view point that in the end you could say this is a user-facing issue, and the user should make sure any memory copies go though by taking care of object lifetimes themselves. However, O3D has many cases where it just can't work in multi-cuda-stream mode, as these async memcpy operations break many other things.

It is also my opinion that there should be no way for users to introduce heavily undefined behavior into their applications that heavily warps the intuition of host data always being synchronous and safe to work with (it is not in the outlined use cases). So by this I mean, there should be no way to change the CUDA stream at all, as the library gives absolutely no guarantees that it will be able to live with that (the vast majority of workflows do not survive this change).

Checklist:

  • I have run python util/check_style.py --apply to apply Open3D code style
    to my code.
  • This PR changes Open3D behavior or adds new functionality.
    • Both C++ (Doxygen) and Python (Sphinx / Google style) documentation is
      updated accordingly.
    • I have added or updated C++ and / or Python unit tests OR included test
      results
      (e.g. screenshots or numbers) here.
  • I will follow up and update the code if CI fails.
  • For fork PRs, I have selected Allow edits from maintainers.

Description

  • Added a memory synchronization guard for non-default CUDA streams.
  • Swapped-out all CUDA synchronization with per-stream synchronization
  • changed cudaMemset to cudaMemsetAsync

@update-docs
Copy link
Copy Markdown

update-docs Bot commented Oct 29, 2025

Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes.

@TwentyPast4 TwentyPast4 marked this pull request as draft October 30, 2025 12:48
@ssheorey ssheorey requested a review from Copilot February 23, 2026 21:48
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make Open3D safer and more usable with non-default CUDA streams by introducing an explicit CUDAStream abstraction, switching many CUDA operations to use the current stream, and adding guardrails for host↔device memcpy semantics to avoid uninitialized host reads / lifetime hazards.

Changes:

  • Introduces core::CUDAStream + stream-scoped behavior (including host↔device memcpy policies) and updates CUDAScopedStream accordingly.
  • Replaces multiple device-wide synchronizations and default-stream usage with per-stream synchronization and “current stream” usage across kernels, nns, hashmap, and linalg CUDA backends.
  • Adds a CUDA non-default stream regression test and includes various formatting-only cleanups.

Reviewed changes

Copilot reviewed 35 out of 51 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
cpp/tests/core/ParallelFor.cu Adds a non-default CUDA stream test for ParallelFor.
cpp/tests/core/HashMap.cpp Formatting-only ctor cleanup in a test helper type.
cpp/tests/core/CUDAUtils.cpp Updates tests to validate new CUDAStream / CUDAScopedStream behavior.
cpp/pybind/core/cuda_utils.cpp Adjusts Python bindings around CUDA synchronize behavior and adds synchronize_stream.
cpp/open3d/visualization/webrtc_server/PeerConnectionManager.h Formatting-only ctor cleanup.
cpp/open3d/visualization/gui/WindowSystem.h Formatting-only destructor cleanup.
cpp/open3d/visualization/gui/PickPointsInteractor.cpp Formatting-only ctor cleanup.
cpp/open3d/visualization/gui/ImguiFilamentBridge.cpp Formatting-only ctor cleanup.
cpp/open3d/visualization/gui/Gui.h Formatting-only destructor cleanup.
cpp/open3d/utility/Optional.h Formatting-only constructor formatting adjustments.
cpp/open3d/t/pipelines/registration/TransformationEstimation.h Formatting-only destructor cleanup.
cpp/open3d/t/pipelines/kernel/TransformationConverter.cu Uses current CUDAStream for kernel launches.
cpp/open3d/t/pipelines/kernel/RegistrationCUDA.cu Uses current CUDAStream for launches + per-stream synchronization.
cpp/open3d/t/pipelines/kernel/RGBDOdometryCUDA.cu Uses current CUDAStream for launches + per-stream synchronization.
cpp/open3d/t/geometry/kernel/VoxelBlockGridImpl.h Replaces synchronize guards with per-stream sync under CUDA module builds.
cpp/open3d/t/geometry/kernel/PointCloudImpl.h Replaces device sync w/ per-stream sync under CUDA module builds.
cpp/open3d/t/geometry/kernel/NPPImage.cpp Wires NPP stream context to the current CUDAStream.
cpp/open3d/t/geometry/RGBDImage.h Formatting-only destructor cleanup.
cpp/open3d/pipelines/registration/ColoredICP.h Formatting-only destructor cleanup.
cpp/open3d/ml/impl/misc/Voxelize.cuh Adds multiple stream synchronizations and changes cudaMemsetcudaMemsetAsync.
cpp/open3d/ml/impl/misc/InvertNeighborsList.cuh Adds stream synchronization after scan-like operations.
cpp/open3d/ml/contrib/RoiPoolKernel.cu Uses per-stream synchronization in debug builds.
cpp/open3d/io/sensor/RGBDSensor.h Formatting-only destructor cleanup.
cpp/open3d/io/rpc/ConnectionBase.h Formatting-only ctor/dtor cleanup.
cpp/open3d/io/PointCloudIO.h Formatting-only ctor cleanup.
cpp/open3d/core/nns/NearestNeighborSearch.cpp Formatting-only destructor cleanup.
cpp/open3d/core/nns/NanoFlannIndex.cpp Formatting-only ctor/dtor cleanup.
cpp/open3d/core/nns/KnnSearchOps.cu Routes thrust / ops to the current CUDAStream and removes per-iteration stream creation.
cpp/open3d/core/nns/FixedRadiusSearchOps.cu Uses current CUDAStream and passes stream into sort implementation.
cpp/open3d/core/nns/FixedRadiusSearchImpl.cuh Adds stream parameter to SortPairs and introduces additional stream syncs.
cpp/open3d/core/nns/FixedRadiusIndex.cpp Formatting-only ctor/dtor cleanup.
cpp/open3d/core/linalg/SolveCUDA.cpp Sets cuSolver handle stream to current CUDAStream.
cpp/open3d/core/linalg/SVDCUDA.cpp Sets cuSolver handle stream to current CUDAStream (+ include).
cpp/open3d/core/linalg/MatmulCUDA.cpp Sets cuBLAS handle stream to current CUDAStream (+ include).
cpp/open3d/core/linalg/LeastSquaresCUDA.cpp Sets cuSolver + cuBLAS handle streams to current CUDAStream.
cpp/open3d/core/linalg/LUCUDA.cpp Sets cuSolver handle stream to current CUDAStream (+ include).
cpp/open3d/core/linalg/InverseCUDA.cpp Sets cuSolver handle stream to current CUDAStream (+ include).
cpp/open3d/core/linalg/AddMMCUDA.cpp Sets cuBLAS handle stream to current CUDAStream (+ include).
cpp/open3d/core/kernel/ReductionCUDA.cu Uses cudaMemsetAsync + per-stream kernel launch/sync.
cpp/open3d/core/kernel/NonZeroCUDA.cu Routes thrust algorithms to current CUDAStream and synchronizes per-stream.
cpp/open3d/core/hashmap/CUDA/StdGPUHashBackend.h Uses current CUDAStream for kernels, thrust copies, and synchronization.
cpp/open3d/core/hashmap/CUDA/SlabNodeManager.h Uses async memset, stream-aware thrust, and async D2H copy with conditional sync.
cpp/open3d/core/hashmap/CUDA/SlabHashBackend.h Switches to async memset + per-stream launches and synchronizations; stream-aware thrust usage.
cpp/open3d/core/hashmap/CUDA/CUDAHashBackendBufferAccessor.h Switches to cudaMemsetAsync and per-stream sync.
cpp/open3d/core/hashmap/CUDA/CUDAHashBackendBuffer.cu Routes thrust sequence to current CUDAStream and synchronizes.
cpp/open3d/core/ParallelFor.h Launches kernels on current CUDAStream.
cpp/open3d/core/MemoryManagerCUDA.cpp Uses current CUDAStream + adds per-stream host↔device memcpy safety policies.
cpp/open3d/core/Indexer.h Formatting-only ctor cleanup.
cpp/open3d/core/CUDAUtils.h Introduces CUDAStream, memcpy policy enum, updates CUDAScopedStream, adds stream Synchronize.
cpp/open3d/core/CUDAUtils.cpp Implements CUDAStream, stream synchronize function, updates scoped stream handling, and enhances CUDA error logging.
cpp/benchmarks/core/HashMap.cpp Formatting-only ctor cleanup in a benchmark helper type.
Comments suppressed due to low confidence (3)

cpp/pybind/core/cuda_utils.cpp:1

  • The docstring for the existing binding now says “Synchronizes a CUDA stream.” but the implementation shown still calls cuda::Synchronize(); with no stream parameter (different semantics). Also, this snippet introduces m_cuda.def(...) but m_cuda is not in scope in the provided hunk (potential compile error if it isn’t defined earlier in this function). Consider (1) restoring the original docstring/behavior for the no-arg synchronize binding, and (2) ensuring the new synchronize_stream is defined on a module object that is definitely in scope here (e.g., m.def(...) or a clearly defined m_cuda submodule).
    cpp/open3d/ml/impl/misc/Voxelize.cuh:1
  • This introduces a mandatory cudaStreamSynchronize in the middle of a CUDA algorithm, which can significantly reduce overlap/concurrency and hurt performance on non-default streams. If the goal is to make host-visible scalar results available, prefer copying only the required scalar to host (via cudaMemcpyAsync to pinned memory) and synchronizing only when the CPU actually consumes it, or use CUDA events to establish ordering across streams instead of a full stream sync. If the downstream use is on the same stream (device-side consumption), the sync should be unnecessary because stream operations are already ordered.
    cpp/tests/core/ParallelFor.cu:1
  • This test iterates and asserts over 10,000,000 elements on the CPU, which is likely to be slow and may cause CI timeouts/flakiness (especially in debug builds or on shared runners). Consider reducing N substantially, and/or sampling a subset of indices (e.g., first/last/strided checks) while still validating the non-default stream correctness property this test is meant to cover.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +271 to 279
OPEN3D_CUDA_CHECK(cudaMemcpyAsync(
result.data(),
thrust::raw_pointer_cast(slabs_per_superblock.data()),
num_super_blocks * sizeof(int), cudaMemcpyDeviceToHost,
CUDAStream::GetInstance().Get()));
if (!CUDAStream::GetInstance().IsDefaultStream()) {
cuda::Synchronize(CUDAStream::GetInstance());
}
return result;
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns result immediately after cudaMemcpyAsync without guaranteeing completion for the default stream case. Relying on pageable-host-memory behavior here is brittle and can lead to reading partially-copied data. Additionally, the copy uses sizeof(int) even though the device buffer is uint32_t (thrust::device_vector<uint32_t>), which is a type/size mismatch risk across platforms. Prefer copying into a std::vector<uint32_t> (or std::vector<int64_t> if that’s the intended API) and synchronizing the stream unconditionally (or switching to a synchronous cudaMemcpy if that is acceptable here).

Copilot uses AI. Check for mistakes.
Comment on lines +323 to +338
const std::unordered_set<cudaError_t> kProcessEndingErrors = {
cudaErrorAssert,
cudaErrorLaunchTimeout,
cudaErrorHardwareStackError,
cudaErrorIllegalInstruction,
cudaErrorMisalignedAddress,
cudaErrorInvalidAddressSpace,
cudaErrorInvalidPc,
cudaErrorTensorMemoryLeak,
cudaErrorMpsClientTerminated,
cudaErrorExternalDevice,
cudaErrorContained,
cudaErrorIllegalAddress,
cudaErrorLaunchFailure,
cudaErrorECCUncorrectable,
cudaErrorUnknown};
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several cudaError_* enumerators here are not available in all CUDA toolkit versions/headers (Open3D often supports a range of CUDA versions). Unconditionally referencing newer error codes can break compilation for older CUDART versions. Consider guarding entries with #if CUDART_VERSION >= ... checks, or building this set from only broadly-available error codes (and treating unknown enums via runtime string matching / a fallback path).

Copilot uses AI. Check for mistakes.
Sync = 0,
// Asynchronous memory copies. Unmanaged.
// No memory safety at all - you are responsible for your own actions.
// There are no guaranteed about the lifetime of memory copied between the
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct the grammar in this comment: “There are no guaranteed” → “There are no guarantees”.

Suggested change
// There are no guaranteed about the lifetime of memory copied between the
// There are no guarantees about the lifetime of memory copied between the

Copilot uses AI. Check for mistakes.
/// ```cpp
/// void pokingTheBear() {
/// CUDAScopedStream scoped_stream(CUDAStream::CreateNew(), true);
/// CUDAStream::GetInstance().SetDeviceToHostMemcpyPolicy(CUDAMemoryCopyPolicy::AsyncUnmanaged);
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example references CUDAMemoryCopyPolicy::AsyncUnmanaged, but the enum introduced in this PR defines Async (not AsyncUnmanaged). Since this is header documentation for a newly introduced API, it should compile conceptually and match the actual identifier to avoid confusing users. Update the example to use the correct enum value (or rename the enum value if AsyncUnmanaged is the intended public API spelling).

Suggested change
/// CUDAStream::GetInstance().SetDeviceToHostMemcpyPolicy(CUDAMemoryCopyPolicy::AsyncUnmanaged);
/// CUDAStream::GetInstance().SetDeviceToHostMemcpyPolicy(CUDAMemoryCopyPolicy::Async);

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants