The Definitive Open-Source GPU Benchmarking Utility
A comprehensive CUDA-based tool for evaluating GPU performance across a variety of kernel configurations, memory access patterns and occupancy scenarios.
- Measures execution time over multiple trials for statistical significance
- Tests a range of thread block sizes and grid configurations
- Reports per-kernel occupancy and register utilisation
- Customisable benchmark parameters via
BenchmarkConfig - Outputs human-readable summaries to the console
- CUDA Toolkit (version 11.0 or later)
- NVCC compiler
- C++17-compatible standard library
- 1 or more CUDA-capable GPUs
-
Ensure the CUDA environment variables are set (e.g.
CUDA_HOME). -
Compile with NVCC:
nvcc -std=c++17 main.cu -o cubench -lcusparse -lcufft --extended-lambda -rdc=true
-
(Optional) Add optimisation flags:
nvcc -O3 -std=c++17 main.cu -o cubench -lcusparse -lcufft --extended-lambda -rdc=true
Run the benchmark executable. By default, it uses device 0 and the settings in BenchmarkConfig:
./cubenchSample output:
=== Rasterisation Benchmark ===
Triangles: 10000
Resolution: 1920x1080
Time: 158.24 ms (6.32 FPS)
Triangles/sec: 0.06 M
Pixels/sec: 13.10 M
Modify the BenchmarkConfig struct in main.cu to tweak:
- Number of trials per test
- Input sizes for occupancy and memory benchmarks
- Minimum and maximum block sizes
Recompile after changes.
I have a lot of other projects i need to create and maintain. expect delayed bugfixes / features / responses.
NUMA & NVLink Bandwidth
GPU power efficiency (GFLOPs/W under light and heavy sustained loads to measure clock drop-off)
This project is licensed under the Apache License 2.0. See LICENSE for details.
Stevenson Parker
Created: 24 July 2025