Skip to content

Improved CUDA performance through pipelined reads#222

Open
rietmann-nv wants to merge 8 commits into
UoB-HPC:developfrom
rietmann-nv:mr/cuda_pipeline
Open

Improved CUDA performance through pipelined reads#222
rietmann-nv wants to merge 8 commits into
UoB-HPC:developfrom
rietmann-nv:mr/cuda_pipeline

Conversation

@rietmann-nv

Copy link
Copy Markdown

This PR improves performance of all CUDA stream examples implemented through pipelined reads, particularly on Blackwell compute GPUs like GB200. For large enough arrays, we see performance of well over 7TB/s, which is much closer to the theoretical bandwidth available. I increased the default array size, to better saturate modern devices.

In this PR, I also fixed the compilation of the thrust version, which didn't get updated when the Stream interface changed.

I also made a small fix for compilation on CUDA 13.2, which can't have -DDEFAULT due to a compatibility issue with CCCL. I changed it to -DBABEL_DEFAULT.

Very open to feedback, thanks!

Comment thread src/thrust/ThrustStream.cu Outdated
Comment on lines +31 to +39
template <class T>
struct ThrustStream<T>::h_Impl{
thrust::host_vector<T> a, b, c;
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: let's just add the new vectors to Impl:

Suggested change
};
template <class T>
struct ThrustStream<T>::Impl{
vector<T> a, b, c;
#if !(defined(PAGEFAULT) || defined(MANAGED))
// we need separate host allocations to hold the data for get_arrays()
thrust::host_vector<T> host_a, host_b, host_c;
#endif
};

Comment thread src/thrust/ThrustStream.h Outdated
Comment on lines +23 to +25
struct h_Impl;
std::unique_ptr<Impl> impl; // avoid thrust vectors leaking into non-CUDA translation units
std::unique_ptr<h_Impl> h_impl; // If UVM is disabled, host arrays for verification purposes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, if the host vectors are moved into Impl

Comment thread src/cuda/model.cmake
Comment on lines +6 to +11
# "DEFAULT" define causes a compile error in newer cuda CCCL, so we change to BABEL_DEFAULT
register_flag_optional(MEM "Device memory mode:
DEFAULT - allocate host and device memory pointers.
BABEL_DEFAULT - allocate host and device memory pointers.
MANAGED - use CUDA Managed Memory.
PAGEFAULT - shared memory, only host pointers allocated."
"DEFAULT")
"BABEL_DEFAULT")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: That's a bug in CCCL, if we are senstive to that macro. Please file an issue. Maybe @miscco can have a look at that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's resolved now in upstream CCCL: NVIDIA/cccl#9406

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants