This repository contains information on the Babelstream benchmark for the UK NNSS procurement.
Important
Please do not contact the benchmark or code maintainers directly with any questions. All questions must be submitted via the procurement response mechanism.
The BabelStream benchmark was developed at the University of Bristol to measure the achievable main memory bandwidth across variety of CPUs and GPUs using simple kernels. These kernels process data that is larger than the largest level of cache so that transfers from main memory are always in play. Dynamically allocated arrays are used to prevent any compile time optimisations. BabelStream provides implementations in multiple programming models for CPUs and GPUs. When used for GPUs, this benchmark does not include the data transfer time for CPU-GPU transfers.
Git repository: BabelStream
Note
This benchmark/repository is closely based on the one used for the NERSC-10 benchmarks
Compiling the code involves the following steps:
-
Configure the build
cmake -B build -S . -DMODEL=<model> <CMAKE_OPTIONS>
where
<model>should be substituted with one of the programming models implemented in the current version of BabelStream. Current options for<model>are:omp; ocl; std; std20; hip; cuda; kokkos; sycl; sycl2020; acc; raja; tbb; thrust
Additional CMake variables may be needed for some programming models. For example,
Configuration Flags OpenMP -DMODEL=ompOpenMP-offload -DMODEL=omp -DCMAKE_CXX_COMPILER=nvc++ \
-DOFFLOAD=ON -DOFFLOAD_FLAGS="-mp=gpu -gpu=cc90 \
-Minfo"CUDA -DMODEL=cuda -DCMAKE_CXX_COMPILER=nvc++ \
-DCMAKE_CUDA_COMPILER=nvcc -DCUDA_ARCH=sm_90 -
Perform the build
cmake --build build
Bidders are permitted to modify the benchmark in the following ways.
Programming Pragmas
- The bidder may choose any of the programming models implemented in BabelStream.
- The bidder may modify the programming (e.g. OpenMP, OpenACC) pragmas in the benchmark as required to permit execution on the proposed system, provided:
- All modified sources and build scripts must be made available under the same licence as the BabelStream software
- Any modified code used for the response must continue to be a valid program (compliant to the standard being proposed in the bidder's response).
Memory Allocation
- For accelerators, arrays should only be allocated on device's global memory, any pre-staging of data or use of user controlled cache is not allowed.
- The sizes of the allocated arrays must be 4x larger than the largest level of cache. Array sizes can be modified by changing the variable
ARRAY_SIZEonline 55of./src/main.cppin BabelStream benchmark source code.
Concurrency & Affinity
- The bidder may change the kernel launch configurations, type of memory management (e.g. CUDA managed memory, separate host and device pointers etc.).
Any modifications must be fully documented (e.g., as a pull request, diff or patch file) and reported with the benchmark results.
The BabelStream executable, <model>-stream, can be found in the build directory.
The following arguments will typically be used to modify its runtime behaviour:
--arraysize SIZE- the size of the arrays to use for the tests. The sizes of the allocated arrays in BabelStream must be 4x larger than the largest level of cache.device INDEX- the index of the accelerator device to use (for accelerator memory tests). This option can be used to ensure all accelerator devices on a node are tested.
The benchmark can be used to test both CPU and GPU memory bandwidth.
-
CPU memory bandwidth:
- All CPU cores must be running BabelStream in parallel via OpenMP threads or another parallel model implemented in BabelStream.
- The size of the allocated arrays in BabelStream must be
4x larger than the largest level of cache. This can be set at run time
using the
--arraysizeoption to BabelStream. - A minimum of 100 iterations (BabelStream default) must be used for the test.
-
GPU memory bandwidth:
- Arrays should only be allocated on device's global memory, any pre-staging of data or use of user controlled cache is not allowed.
- Performance of all GPU/GCD on each node should be tested. The
--devicesoption to BabelStream may be used to target specific GPU/GCD on a node. - A minimum of 100 iterations (BabelStream default) must be used for the test.
- The size of the allocated arrays in BabelStream must be
4x larger than the largest level of cache. This can be set at run time
using the
--arraysizeoption to BabelStream .
Example job submission scripts from testing on the IsambardAI system are available below:
| Target | Function | Performance (MBytes/sec) |
|---|---|---|
| CPU | Triad | 1730906.365 |
| GPU | Triad | 3534593.282 |
Full output for the above runs is available here:
| Target | Function | Performance (MBytes/sec) |
|---|---|---|
| CPU | Triad | 814440.071 |
| GPU | Triad | 3338518.799 |
This benchmark description and associated files are released under the MIT license.
The following changes to this document have been made since initial release:
Date |
Change |
|---|---|
| 2026-04-29 | Updates to Hunter reference data |