Improved CUDA performance through pipelined reads#222
Open
rietmann-nv wants to merge 8 commits into
Open
Conversation
"DEFAULT" define messes with a CCCL internal define. Switched to BABEL_DEFAULT. The actual "DEFAULT" isn't used in the code, it just represents the "#else" case.
Comment on lines
+31
to
+39
| template <class T> | ||
| struct ThrustStream<T>::h_Impl{ | ||
| thrust::host_vector<T> a, b, c; | ||
| }; |
Contributor
There was a problem hiding this comment.
Suggestion: let's just add the new vectors to Impl:
Suggested change
| }; | |
| template <class T> | |
| struct ThrustStream<T>::Impl{ | |
| vector<T> a, b, c; | |
| #if !(defined(PAGEFAULT) || defined(MANAGED)) | |
| // we need separate host allocations to hold the data for get_arrays() | |
| thrust::host_vector<T> host_a, host_b, host_c; | |
| #endif | |
| }; |
Comment on lines
+23
to
+25
| struct h_Impl; | ||
| std::unique_ptr<Impl> impl; // avoid thrust vectors leaking into non-CUDA translation units | ||
| std::unique_ptr<h_Impl> h_impl; // If UVM is disabled, host arrays for verification purposes |
Contributor
There was a problem hiding this comment.
Not needed, if the host vectors are moved into Impl
Comment on lines
+6
to
+11
| # "DEFAULT" define causes a compile error in newer cuda CCCL, so we change to BABEL_DEFAULT | ||
| register_flag_optional(MEM "Device memory mode: | ||
| DEFAULT - allocate host and device memory pointers. | ||
| BABEL_DEFAULT - allocate host and device memory pointers. | ||
| MANAGED - use CUDA Managed Memory. | ||
| PAGEFAULT - shared memory, only host pointers allocated." | ||
| "DEFAULT") | ||
| "BABEL_DEFAULT") |
Contributor
There was a problem hiding this comment.
Important: That's a bug in CCCL, if we are senstive to that macro. Please file an issue. Maybe @miscco can have a look at that.
Contributor
There was a problem hiding this comment.
That's resolved now in upstream CCCL: NVIDIA/cccl#9406
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR improves performance of all CUDA stream examples implemented through pipelined reads, particularly on Blackwell compute GPUs like GB200. For large enough arrays, we see performance of well over 7TB/s, which is much closer to the theoretical bandwidth available. I increased the default array size, to better saturate modern devices.
In this PR, I also fixed the compilation of the thrust version, which didn't get updated when the Stream interface changed.
I also made a small fix for compilation on CUDA 13.2, which can't have
-DDEFAULTdue to a compatibility issue with CCCL. I changed it to-DBABEL_DEFAULT.Very open to feedback, thanks!