Skip to content

Replace CUDA.synchronize() with CuEvent-based synchronization#700

Open
victorcamaraa wants to merge 1 commit intoJuliaParallel:masterfrom
victorcamaraa:CuEvent-sync
Open

Replace CUDA.synchronize() with CuEvent-based synchronization#700
victorcamaraa wants to merge 1 commit intoJuliaParallel:masterfrom
victorcamaraa:CuEvent-sync

Conversation

@victorcamaraa
Copy link
Copy Markdown

Replaces all CUDA.synchronize() device-wide barriers in CUDAExt.jl with CuEvent-based synchronization.

The old approach issued a full cuDeviceSynchronize on every data movement and gpu_synchronize() call, stalling the CPU until the entire device was idle. The new approach records a CuEvent on the relevant stream and issues a GPU-side wait, so only the streams that actually share data are synchronized and the CPU is never blocked unnecessarily.

The one exception is DtoH transfers, where the CPU genuinely needs to wait for data to arrive in host memory, those use CUDA.synchronize(ev), which is a CPU wait scoped to a single event rather than the full device.

Changes

  • _sync_with_context: captures caller's stream before context switch, replaces CUDA.synchronize() with CuEvent record + GPU-side CUDA.wait
  • gpu_synchronize: captures user's ambient stream before context switch, same event pattern
  • HtoD move: drops CUDA.synchronize() entirely — stream ordering is sufficient
  • DtoH move: replaces full barrier with CUDA.synchronize(ev) scoped to the active stream
  • DtoD same-device move: drops barrier — task graph ordering guarantees upstream completion
  • DtoD cross-device move: replaces source-side barrier with a CuEvent that the destination stream GPU-waits on
  • gpu_synchronize(::Val{:CUDA}): refactored to call gpu_synchronize(proc) per device for consistency

Notes:

  • These implementations were tested on a single GTX 1060 6GB; therefore, I was not able to test DtoD
  • Debugging and partial implementation were assisted by Claude Sonnet 4.7 and Gemini

Copy link
Copy Markdown
Member

@jpsamaroo jpsamaroo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I have some tweaks for you to make, but this otherwise is looking good!

Comment thread ext/CUDAExt.jl
Comment thread ext/CUDAExt.jl
CUDA.record(ev, stream())
CUDA.synchronize(ev) # CPU waits ONLY for this stream to finish

return adapt(Array, x)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary - CUDA.synchronize() is a stream-local sync (it's the same as CUDA.synchronize(stream()).

Comment thread ext/CUDAExt.jl
function Dagger.move(from_proc::CPUProc, to_proc::CuArrayDeviceProc, x)
with_context(to_proc) do
arr = adapt(CuArray, x)
CUDA.synchronize()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are necessary to ensure that x isn't modified before it's read into arr, I believe.

Comment thread ext/CUDAExt.jl

_x = Array{T,N}(undef, size(x))
copyto!(_x, x)
CUDA.synchronize()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this function, this shouldn't be necessary.

Comment thread ext/CUDAExt.jl
Comment thread ext/CUDAExt.jl Outdated
Array(unwrap(x))
end
end
end
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
end
end

Comment thread ext/CUDAExt.jl Outdated
Comment on lines +269 to +271
ev = CUDA.CuEvent()
CUDA.record(ev, stream())
CUDA.synchronize(ev)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could actually just be done as CUDA.synchronize(), we don't need an event here.

Comment thread ext/CUDAExt.jl Outdated
if from_proc == to_proc
with_context(CUDA.synchronize, from_proc)
return x

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment thread ext/CUDAExt.jl Outdated
Comment thread ext/CUDAExt.jl Outdated
host_copy = with_context(from_proc) do
ev = CUDA.CuEvent()
CUDA.record(ev, stream())
CUDA.synchronize(ev)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one also is probably just CUDA.synchronize()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants