Skip to content

Fix AllreduceV with CUDA stream.#12171

Merged
trivialfis merged 9 commits into
dmlc:masterfrom
trivialfis:fix-nccl-allreduce-v
Apr 18, 2026
Merged

Fix AllreduceV with CUDA stream.#12171
trivialfis merged 9 commits into
dmlc:masterfrom
trivialfis:fix-nccl-allreduce-v

Conversation

@trivialfis
Copy link
Copy Markdown
Member

@trivialfis trivialfis commented Apr 17, 2026

We use async NCCL to implement timeout. However, when NCCL is in async mode, it uses a thread pool, and cannot work with per-thread CUDA stream, which resolves to the wrong per-thread stream in its internal pool.

The existing allreduce implementation works because we use a custom CUDA stream in the NCCL coll wrapper.

This PR moves that stream into NCCLComm, to expose it to the allreduce V implementation. In addition, to correctly synchronize with call stream, we pass the Context object through the stack.

ref #12122

We use async NCCL to implement timeout. However, when NCCL is in async mode, it uses a
thread pool, and cannot work with per-thread CUDA stream, which resolves to the wrong
per-thread stream in its pool.

The existing allreduce implementation works because we use a custom CUDA stream in the
NCCL coll wrapper.

This PR moves that stream into NCCLComm, to expose it to the allreduce V implementation.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses incorrect CUDA stream usage when NCCL is configured in async/non-blocking mode (thread pool), by moving ownership of a dedicated CUDA stream into NCCLComm and updating AllreduceV to correctly synchronize between the caller’s per-thread stream and the NCCL stream.

Changes:

  • Move the NCCL communication stream from NCCLColl into NCCLComm and expose it via NCCLComm::Stream().
  • Update NCCL collective launches to use the communicator-owned stream (removing the prior coll-owned stream plumbing).
  • Fix AllreduceV by introducing event-based stream bracketing between the user stream and the NCCL stream.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/collective/comm.cuh Make NCCLComm own a curt::Stream and expose a StreamRef view.
src/collective/comm.cu Update NCCLComm construction/destruction to use the owned stream and sync it on teardown.
src/collective/coll.cuh Remove the coll-owned CUDA stream member.
src/collective/coll.cu Route async NCCL launches through NCCLComm::Stream() and update call sites accordingly.
src/collective/allreduce_v.cuh Add event bracketing to safely interop between user stream and NCCL stream; update AllreduceV signature.
src/collective/allreduce.h Pass an explicit user stream into gpu_impl::AllreduceV (ctx-stream when available).
Comments suppressed due to low confidence (1)

src/collective/comm.cu:57

  • stream_ is constructed before curt::SetDevice(ctx->Ordinal()) is called. Since curt::Stream creates a cudaStream_t on the current device, this can create the NCCL communicator stream on the wrong device when the caller thread isn’t already on ctx->Ordinal(), leading to invalid-handle errors or silent mis-synchronization. Consider setting the CUDA device before constructing stream_ (e.g., delay stream creation until after SetDevice, or add a device-guard member that runs before stream_ is constructed).
NCCLComm::NCCLComm(Context const* ctx, Comm const& root, std::shared_ptr<Coll> pimpl,
                   StringView nccl_path)
    : Comm{root.TrackerInfo().host, root.TrackerInfo().port, root.Timeout(), root.Retry(),
           root.TaskID()},
      stream_{} {
  this->world_ = root.World();
  this->rank_ = root.Rank();
  this->domain_ = root.Domain();
  if (!root.IsDistributed()) {
    return;
  }

  curt::SetDevice(ctx->Ordinal());

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/collective/coll.cu
Comment thread src/collective/comm.cuh Outdated
Comment thread src/collective/comm.cu Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/common/cuda_stream.h
Comment thread tests/cpp/plugin/federated/test_federated_coll.cu Outdated
Comment thread src/collective/comm.cuh Outdated
Comment thread src/collective/comm.cu
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/common/cuda_stream.h
Comment thread src/collective/comm.cu
Comment thread src/collective/comm.cu
@trivialfis trivialfis requested a review from RAMitchell April 18, 2026 10:51
@trivialfis trivialfis merged commit abb1270 into dmlc:master Apr 18, 2026
78 checks passed
@trivialfis trivialfis deleted the fix-nccl-allreduce-v branch April 18, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants