Cleanup CUDA implementation a bit by gonzalobg · Pull Request #199 · UoB-HPC/BabelStream

gonzalobg · 2024-05-30T14:55:57Z

Refactor all kernels into a generic "parallel for" algorithm
- Supports grid-stride and block-stride loops, configurable with model flag
- Handles devices of different sizes via occupancy APIs
Refactor memory allocation APIs
Prints more GPU details, in particular, the theoretical peak BW in GB/s of the current device, using the NVML library (which is part of the CUDA Toolkit and always available)
Fixes 2 bugs:
- Prints the "order" used to run the benchmarks (e.g. classic vs isolated)
- Fixes a division by zero bug in the solution checking

gonzalobg · 2024-05-31T14:45:45Z

This was passing. Seems like this and other PRs are spuriously failing due to some cache issue @tom91136 @tomdeakin

gonzalobg · 2024-06-05T08:21:15Z

Closing for #202

gonzalobg added 6 commits May 27, 2024 11:45

Refactor CUDA memory allocation

759f7b1

Refactor CUDA kernels and support block-stride loops

293ed77

Bugfix: division by zero in solution-check for individual benchmarks

13e870f

Print order used

46b6d41

Print device peak BW using NVML

51231ac

Capitalize order options for consistency with benchmarks

321ba62

gonzalobg force-pushed the cuda_cleanup branch from c2d9ef5 to 321ba62 Compare May 30, 2024 16:44

gonzalobg closed this Jun 5, 2024

Provide feedback