Fix AllreduceV with CUDA stream. by trivialfis · Pull Request #12171 · dmlc/xgboost

trivialfis · 2026-04-17T14:27:21Z

We use async NCCL to implement timeout. However, when NCCL is in async mode, it uses a thread pool, and cannot work with per-thread CUDA stream, which resolves to the wrong per-thread stream in its internal pool.

The existing allreduce implementation works because we use a custom CUDA stream in the NCCL coll wrapper.

This PR moves that stream into NCCLComm, to expose it to the allreduce V implementation. In addition, to correctly synchronize with call stream, we pass the Context object through the stack.

ref #12122

We use async NCCL to implement timeout. However, when NCCL is in async mode, it uses a thread pool, and cannot work with per-thread CUDA stream, which resolves to the wrong per-thread stream in its pool. The existing allreduce implementation works because we use a custom CUDA stream in the NCCL coll wrapper. This PR moves that stream into NCCLComm, to expose it to the allreduce V implementation.

Copilot

Pull request overview

This PR addresses incorrect CUDA stream usage when NCCL is configured in async/non-blocking mode (thread pool), by moving ownership of a dedicated CUDA stream into NCCLComm and updating AllreduceV to correctly synchronize between the caller’s per-thread stream and the NCCL stream.

Changes:

Move the NCCL communication stream from NCCLColl into NCCLComm and expose it via NCCLComm::Stream().
Update NCCL collective launches to use the communicator-owned stream (removing the prior coll-owned stream plumbing).
Fix AllreduceV by introducing event-based stream bracketing between the user stream and the NCCL stream.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/collective/comm.cuh	Make `NCCLComm` own a `curt::Stream` and expose a `StreamRef` view.
src/collective/comm.cu	Update `NCCLComm` construction/destruction to use the owned stream and sync it on teardown.
src/collective/coll.cuh	Remove the coll-owned CUDA stream member.
src/collective/coll.cu	Route async NCCL launches through `NCCLComm::Stream()` and update call sites accordingly.
src/collective/allreduce_v.cuh	Add event bracketing to safely interop between user stream and NCCL stream; update AllreduceV signature.
src/collective/allreduce.h	Pass an explicit user stream into `gpu_impl::AllreduceV` (ctx-stream when available).

Comments suppressed due to low confidence (1)

src/collective/comm.cu:57

stream_ is constructed before curt::SetDevice(ctx->Ordinal()) is called. Since curt::Stream creates a cudaStream_t on the current device, this can create the NCCL communicator stream on the wrong device when the caller thread isn’t already on ctx->Ordinal(), leading to invalid-handle errors or silent mis-synchronization. Consider setting the CUDA device before constructing stream_ (e.g., delay stream creation until after SetDevice, or add a device-guard member that runs before stream_ is constructed).

NCCLComm::NCCLComm(Context const* ctx, Comm const& root, std::shared_ptr<Coll> pimpl,
                   StringView nccl_path)
    : Comm{root.TrackerInfo().host, root.TrackerInfo().port, root.Timeout(), root.Retry(),
           root.TaskID()},
      stream_{} {
  this->world_ = root.World();
  this->rank_ = root.Rank();
  this->domain_ = root.Domain();
  if (!root.IsDistributed()) {
    return;
  }

  curt::SetDevice(ctx->Ordinal());

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

trivialfis added 2 commits April 17, 2026 22:24

lint.

af87a85

trivialfis requested a review from Copilot April 17, 2026 14:28

Copilot started reviewing on behalf of trivialfis April 17, 2026 14:28 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Comment thread src/collective/coll.cu

Comment thread src/collective/comm.cuh Outdated

Comment thread src/collective/comm.cu Outdated

trivialfis added 4 commits April 18, 2026 01:04

thread through the context.

ec6e67e

more.

13ad8a7

more.

488241e

lint.

5a7f56c

trivialfis requested a review from Copilot April 17, 2026 17:20

Copilot started reviewing on behalf of trivialfis April 17, 2026 17:20 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Comment thread src/common/cuda_stream.h

Comment thread tests/cpp/plugin/federated/test_federated_coll.cu Outdated

Comment thread src/collective/comm.cuh Outdated

Comment thread src/collective/comm.cu

fixes.

c1def2a

trivialfis requested a review from Copilot April 17, 2026 22:56

Copilot started reviewing on behalf of trivialfis April 17, 2026 22:56 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Comment thread src/common/cuda_stream.h

Comment thread src/collective/comm.cu

Comment thread src/collective/comm.cu

trivialfis added 2 commits April 18, 2026 18:40

Restore device.

8b3e109

cpu build.

da4bb6a

trivialfis requested a review from RAMitchell April 18, 2026 10:51

RAMitchell approved these changes Apr 18, 2026

View reviewed changes

trivialfis merged commit abb1270 into dmlc:master Apr 18, 2026
78 checks passed

trivialfis deleted the fix-nccl-allreduce-v branch April 18, 2026 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix AllreduceV with CUDA stream.#12171

Fix AllreduceV with CUDA stream.#12171
trivialfis merged 9 commits into
dmlc:masterfrom
trivialfis:fix-nccl-allreduce-v

trivialfis commented Apr 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

trivialfis commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

trivialfis commented Apr 17, 2026 •

edited

Loading