-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Enable copies between different devices #3135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
+12
−2
Closed
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cuda docs are a little unclear how or if this is different from cudaMemcpy(). What's the difference? My read of the manual is that cudaMemcpy() blocks the host thread util the copy is done but cudaMemcpyPeer doesn't, but other than that they are the same. Or at least the manual doesn't call out any other difference.
The docs in gpu_data_abstract.h for this function should be updated in any case if we do this since this change would make them wrong. I.e. with regard to the blocking behavior.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean from that particular doc link. The latest version of the docs seem to clear this up a bit. In API synchronization behavior, "Synchronous" memory copies include "For transfers from device memory to device memory, no host-side synchronization is performed."
What I understand from this is that cudaMemcpy with cudaMemcpyDeviceToDevice will not block the host thread, but will still synchronize the device work. The function descriptions for cudaMemcpy and cudaMemcpyPeer both include a note stating they exhibit "synchronous" behavior for most use cases.
I think the asynchronous note in the description of cudaMemcpyPeer is reminding you that the function falls in the same category of behavior as cudaMemcpyDeviceToDevice, and to achieve fully async behavior requires the async function.
The docs in gpu_data_abstract.h may already be inaccurate if cudaMemcpyDeviceToDevice doesn't block the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @davisking, Let me know if you are willing to merge this.
I put together an example program (with error checking removed for brevity) that demonstrates the API synchronization behavior as described in the docs I linked in my previous comment. This program measures the function call time as well as the time to explicitly synchronize the host for the various methods.
I ran this on a machine with two RTX 4500's connected by an NVLink bridge. The output is below.
As you can see, the function call time and time to sync is nearly the same for both host->device and device->host transfers. This is consistent with the expected host blocking behavior.
However,
cudaMemcpyfor device->device andcudaMemcpyPeerreturn immediately and the time to sync is much longer. I believe this supports the conclusion thatcudaMemcpyDeviceToDevicedoes not block the host and introducingcudaMemcpyPeerwould not affect the current dlib behavior. This does mean that the docs fordlib::memcpyare already inaccurate in regards to host blocking.I also want to note that I also disabled peer access for the above test and
cudaMemcpyPeerbehaves the same except the transfer time is much longer because it is not using the NVLink connection.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let me look this over tonight or tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the dlib doc is a little confusing. It does say this at the top of the gpu_data_abstract.h file (but I only looked at the comment on the memcpy() function and forgot about this detail):
Although I still don't understand the purpose of this PR. I've been reading the cuda docs for a while and near as I can tell
cudaMemcpy(..., cudaMemcpyDeviceToDevice)should be the same ascudaMemcpyPeer()here. What am I missing?I'm assuming there was something you were trying to do that doesn't work with the current dlib code but this makes it work? What is that thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created this PR because my code uses
dlib::memcpyto copy a tensor on device 0 to device 1. This resulted in an illegal memory access error (even with peer access enabled). My CUDA version is 12.6.After researching the issue, I had found that
cudaMemcpy(..., cudaMemcpyDeviceToDevice)only supports copies within a device and that you must usecudaMemcpyPeerif you want to perform copies to another device. However, it turns out that you shouldn't have to usecudaMemcpyPeerwith Unified Addressing enabled (which I do). I now believe this was due to a driver bug because I tried to reproduce the problem yesterday with an updated driver and there is no error.The 12.6 documentation didn't mention the behavior of
cudaMemcpyDeviceToDevicewith Unified Addressing (it only mentionscudaMemcpyDefault). The 13.2 Programming Guide calls it out specifically in Explicit Memory Management.So, the difference between the two functions is that
cudaMemcpyPeeris required for systems (or applications) without Unified Addressing enabled. In my specific case, it was likely just a driver bug.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it would only be for ancient setups (like 15 years old now maybe, I forget) where this would matter. Since unified addressing has been around forever at this point.
And yeah, nvidia driver bugs are the worst. I broke my nvidia drivers updating a python package this weekend while testing some other dlib thing and then everything got wonky. You cough on the drivers and something breaks :|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, definitely. I need to start assuming its their fault first instead of my code or dlib :)
I do have cards that are 15 years old, but I'm not currently running dlib on them, so I'll close this since my immediate problem is solved.
Thanks again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem :D