Replace CUDA.synchronize() with CuEvent-based synchronization#700
Open
victorcamaraa wants to merge 1 commit intoJuliaParallel:masterfrom
Open
Replace CUDA.synchronize() with CuEvent-based synchronization#700victorcamaraa wants to merge 1 commit intoJuliaParallel:masterfrom
victorcamaraa wants to merge 1 commit intoJuliaParallel:masterfrom
Conversation
jpsamaroo
approved these changes
Apr 21, 2026
Member
jpsamaroo
left a comment
There was a problem hiding this comment.
Great work! I have some tweaks for you to make, but this otherwise is looking good!
| CUDA.record(ev, stream()) | ||
| CUDA.synchronize(ev) # CPU waits ONLY for this stream to finish | ||
|
|
||
| return adapt(Array, x) |
Member
There was a problem hiding this comment.
I don't think this is necessary - CUDA.synchronize() is a stream-local sync (it's the same as CUDA.synchronize(stream()).
| function Dagger.move(from_proc::CPUProc, to_proc::CuArrayDeviceProc, x) | ||
| with_context(to_proc) do | ||
| arr = adapt(CuArray, x) | ||
| CUDA.synchronize() |
Member
There was a problem hiding this comment.
These are necessary to ensure that x isn't modified before it's read into arr, I believe.
|
|
||
| _x = Array{T,N}(undef, size(x)) | ||
| copyto!(_x, x) | ||
| CUDA.synchronize() |
Member
There was a problem hiding this comment.
Same with this function, this shouldn't be necessary.
| Array(unwrap(x)) | ||
| end | ||
| end | ||
| end |
Comment on lines
+269
to
+271
| ev = CUDA.CuEvent() | ||
| CUDA.record(ev, stream()) | ||
| CUDA.synchronize(ev) |
Member
There was a problem hiding this comment.
This could actually just be done as CUDA.synchronize(), we don't need an event here.
| if from_proc == to_proc | ||
| with_context(CUDA.synchronize, from_proc) | ||
| return x | ||
|
|
| host_copy = with_context(from_proc) do | ||
| ev = CUDA.CuEvent() | ||
| CUDA.record(ev, stream()) | ||
| CUDA.synchronize(ev) |
Member
There was a problem hiding this comment.
This one also is probably just CUDA.synchronize()
f45a3d2 to
7b6da27
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces all
CUDA.synchronize()device-wide barriers inCUDAExt.jlwithCuEvent-based synchronization.The old approach issued a full
cuDeviceSynchronizeon every data movement andgpu_synchronize()call, stalling the CPU until the entire device was idle. The new approach records aCuEventon the relevant stream and issues a GPU-side wait, so only the streams that actually share data are synchronized and the CPU is never blocked unnecessarily.The one exception is DtoH transfers, where the CPU genuinely needs to wait for data to arrive in host memory, those use
CUDA.synchronize(ev), which is a CPU wait scoped to a single event rather than the full device.Changes
_sync_with_context: captures caller's stream before context switch, replacesCUDA.synchronize()withCuEventrecord + GPU-sideCUDA.waitgpu_synchronize: captures user's ambient stream before context switch, same event patternmove: dropsCUDA.synchronize()entirely — stream ordering is sufficientmove: replaces full barrier withCUDA.synchronize(ev)scoped to the active streammove: drops barrier — task graph ordering guarantees upstream completionmove: replaces source-side barrier with aCuEventthat the destination stream GPU-waits ongpu_synchronize(::Val{:CUDA}): refactored to callgpu_synchronize(proc)per device for consistencyNotes: