Skip to content

UKNNSS-Benchmarks/uknnss-benchmark-babelstream

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 

Repository files navigation

UK-NNSS BabelStream benchmark

This repository contains information on the Babelstream benchmark for the UK NNSS procurement.

Important

Please do not contact the benchmark or code maintainers directly with any questions. All questions must be submitted via the procurement response mechanism.

Benchmark Overview

The BabelStream benchmark was developed at the University of Bristol to measure the achievable main memory bandwidth across variety of CPUs and GPUs using simple kernels. These kernels process data that is larger than the largest level of cache so that transfers from main memory are always in play. Dynamically allocated arrays are used to prevent any compile time optimisations. BabelStream provides implementations in multiple programming models for CPUs and GPUs. When used for GPUs, this benchmark does not include the data transfer time for CPU-GPU transfers.

Software

Git repository: BabelStream

Caution

All results submitted should be based on the following tag:

Note

This benchmark/repository is closely based on the one used for the NERSC-10 benchmarks

Building the benchmark

Compiling the code involves the following steps:

  1. Configure the build

    cmake -B build -S . -DMODEL=<model> <CMAKE_OPTIONS>

    where <model> should be substituted with one of the programming models implemented in the current version of BabelStream. Current options for <model> are:

    omp; ocl; std; std20; hip; cuda; kokkos; sycl; sycl2020; acc; raja; tbb; thrust

    Additional CMake variables may be needed for some programming models. For example,

    Configuration Flags
    OpenMP -DMODEL=omp
    OpenMP-offload -DMODEL=omp -DCMAKE_CXX_COMPILER=nvc++ \
    -DOFFLOAD=ON -DOFFLOAD_FLAGS="-mp=gpu -gpu=cc90 \
    -Minfo"
    CUDA -DMODEL=cuda -DCMAKE_CXX_COMPILER=nvc++ \
    -DCMAKE_CUDA_COMPILER=nvcc -DCUDA_ARCH=sm_90
  2. Perform the build

    cmake --build build

Pre-approved code modifications

Bidders are permitted to modify the benchmark in the following ways.

Programming Pragmas

  • The bidder may choose any of the programming models implemented in BabelStream.
  • The bidder may modify the programming (e.g. OpenMP, OpenACC) pragmas in the benchmark as required to permit execution on the proposed system, provided:
    • All modified sources and build scripts must be made available under the same licence as the BabelStream software
    • Any modified code used for the response must continue to be a valid program (compliant to the standard being proposed in the bidder's response).

Memory Allocation

  • For accelerators, arrays should only be allocated on device's global memory, any pre-staging of data or use of user controlled cache is not allowed.
  • The sizes of the allocated arrays must be 4x larger than the largest level of cache. Array sizes can be modified by changing the variable ARRAY_SIZE on line 55 of ./src/main.cpp in BabelStream benchmark source code.

Concurrency & Affinity

  • The bidder may change the kernel launch configurations, type of memory management (e.g. CUDA managed memory, separate host and device pointers etc.).

Any modifications must be fully documented (e.g., as a pull request, diff or patch file) and reported with the benchmark results.

Running the benchmark

The BabelStream executable, <model>-stream, can be found in the build directory. The following arguments will typically be used to modify its runtime behaviour:

  • --arraysize SIZE - the size of the arrays to use for the tests. The sizes of the allocated arrays in BabelStream must be 4x larger than the largest level of cache.
  • device INDEX - the index of the accelerator device to use (for accelerator memory tests). This option can be used to ensure all accelerator devices on a node are tested.

Benchmark execution

The benchmark can be used to test both CPU and GPU memory bandwidth.

  • CPU memory bandwidth:

    • All CPU cores must be running BabelStream in parallel via OpenMP threads or another parallel model implemented in BabelStream.
    • The size of the allocated arrays in BabelStream must be 4x larger than the largest level of cache. This can be set at run time using the --arraysize option to BabelStream.
    • A minimum of 100 iterations (BabelStream default) must be used for the test.
  • GPU memory bandwidth:

    • Arrays should only be allocated on device's global memory, any pre-staging of data or use of user controlled cache is not allowed.
    • Performance of all GPU/GCD on each node should be tested. The --devices option to BabelStream may be used to target specific GPU/GCD on a node.
    • A minimum of 100 iterations (BabelStream default) must be used for the test.
    • The size of the allocated arrays in BabelStream must be 4x larger than the largest level of cache. This can be set at run time using the --arraysize option to BabelStream .

Example job submission scripts from testing on the IsambardAI system are available below:

Results

Reference data

IsambardAI (GH200)

Target Function Performance (MBytes/sec)
CPU Triad 1730906.365
GPU Triad 3534593.282

Full output for the above runs is available here:

Hunter (Mi300A)

Target Function Performance (MBytes/sec)
CPU Triad 814440.071
GPU Triad 3338518.799

License

This benchmark description and associated files are released under the MIT license.

Changelog

The following changes to this document have been made since initial release:

Date
Change
2026-04-29 Updates to Hunter reference data

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages