Skip to content

UKNNSS-Benchmarks/uknnss-benchmark-dolfinx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UK NNSS DOLFINx Benchmark

This repository contains information on the DOLFINx benchmark for the UK NNSS procurement.

Important

Please do not contact the benchmark or code maintainers directly with any questions. All questions must be submitted via the procurement response mechanism.

Benchmark Overview

The DOLFINx Benchmark is a performance benchmark for testing matrix-free Finite Element operator evaluation on unstructured hexahedral grids.

For a given set of parameters, DOLFINx Benchmark constructs a mesh with a fixed number of degrees-of-freedom (DoFs) per MPI rank. The DoFs are initialised from data on the CPU, and transferred to GPU as a vector b. There is one GPU per MPI rank. On the GPU, one of the following two operations is performed repeatedly for a number of repetitions:

  • Operator Action: computes the matrix-free operation y=A.b
  • Conjugate Gradient iteration: Operator Action plus axpy and global reduce operations.

Each operator iteration involves an overlapped computation-communication round as follows:

  • Scatter halo data to neighbors
  • Compute GPU kernel on local cells
  • Unpack received halo data
  • Compute GPU kernel on halo cells

The main parameters are: Number of DoFs per GPU (ndofs) (range 1000-100000000+), Polynomial degree (degree) (range 2-7), Floating-point precision (float) (32/64). The maximum ndofs will be determined by the GPU memory size, but should be at least 100 million.

Software

Git repository: https://github.com/ukri-bench/benchmark-dolfinx

Caution

All results submitted should be based on the following repository commits:

  • benchmark-dolfinx repository: 7339a7a
  • dolfinx repository: tag v0.10.0.post4 ac47595
  • ffcx repository: tag v0.10.1.post0 009c0e7
  • basix repository: tag v0.10.0.post0 433fb7f
  • ufl repository: tag 2025.2.0 a53832a

The following diagram illustrates the dependencies of the dolfinx benchmark, and dolfinx. A more detailed list can be generated by using spack (see below).

dependency diagram

Building the benchmark

It is recommended to use Spack to install the DOLFINx benchmark to ensure appropriate versioning of depedencies.

  1. Add the UKRI benchmark spack repository

    spack repo add --name bench_pkgs https://github.com/ukri-bench/spack-packages.git bench_pkgs
    spack repo add --name fenics https://github.com/FEniCS/spack-fenics.git fenics
    
  2. Create a Spack environment and install the benchmark

    spack env create /path/to/spack-env
    spack env activate /path/to/spack-env
    spack add bench-dolfinx
    spack install
    

Requirements

  • The host-code compiler must support C++20 including std::format. This limits the choice of host-code compilers to reasonably recent versions (gcc-13 or later).

  • For NVIDIA GPUs, CUDA version 12.x is recommended.

  • For AMD GPUs, ROCm version 6.x is recommended.

  • A graph partitioner is a required dependency, which must be built with 64-bit integer support. PT-SCOTCH or ParMETIS are supported, with PT-SCOTCH the recommended choice.

benchmark-dolfinx has been written with standard C++20 and tested with ROCm 6.3.4 and CUDA 12.9. Modifications for later versions of ROCm and CUDA are permitted, if required to resolve unavoidable compilation or runtime errors. Modifications are allowed to installation scripts such as CMakeLists.txt for specific systems.

Example build configurations

Spack

The following configurations have been tested using the Spack installation method described in the repository:

  • LUMI-G: ROCm 6.3.4, GCC 14.3.0, HPE Cray MPICH 8.1.32
  • CSD3: CUDA 12.9.1, GCC 13.4.0, OpenMPI 4.1.1+CUDA
  • Isambard CUDA 12.9.0, GCC 14.3.0, HPE Cray MPICH 8.1.32

Manual

Alternatively manual installation instructions on Ubuntu 24.04 are also provided.

Running the benchmark

We use two flavours of DOLFINx: application of a stencil-based operator (mathematically a matrix-vector multiplication), and; a full conjugate gradient solver, which includes MPI collective operations. We also use DOLFINx with and without GPU support.

For GPU-based runs, the benchmark runs with one MPI process per GPU device, and it does not automatically bind MPI process to GPU devices. A description of how to bind devices and cores is given in the benchmark repository.

The full list of command line arguments can be shown with the -h option.

The benchmark executable can use either the CPU or available GPUs. The --platform parameter controls where the benchmark runs:

  • GPU runs, use
    --platform=gpu
    
  • CPU runs, use
    --platform=cpu
    

The qmode parameter changes whether the quadrature points are colocated with the degrees of freedom, or not:

  • Colocation, use
    --qmode=0
    
  • No collocation, use
    --qmode=1
    

For CPUs, only --qmode=0 is supported.

Benchmark execution

For benchmarking purposes, problem configurations similar to the following are needed:

  • Stencil throughput at Q3, 200M degrees-of-freedom:
    bench_dolfinx --degree=3 --ndofs=200000000 \
        --json MAT-Q3-200M.json | tee MAT-Q3-200M.out
    
  • Stencil throughput at Q6, 350M degrees-of-freedom:
    bench_dolfinx --degree=6 --ndofs=350000000 \
        --json MAT-Q6-350M.json | tee MAT-Q6-350M.out
    
  • CG throughput at Q3, 200M degrees-of-freedom:
    bench_dolfinx --cg --degree=3 --ndofs=200000000 \
        --json CG-Q3-200M.json | tee CG-Q3-200M.out
    
    

The precise run configurations should be taken from the data spreadsheets that list the assessment configurations.

Results

Correctness testing

Correctness can be verified using the validate.py script.

The validation script should be run as follows and produce output similar to the following:

./validate output.json output.out
 
# DOLFINx benchmark validation
 
                   P : 3
                ndof : 10000
               nreps : 1000
         scalar size : 64
 
  MAT COMP performance: 0.2957402083152624 Gdofs/s
 
  Validation: PASSED
 

Sanity check: The matrix comparison must be run on 1 GPU and 8 GPUs with no collocation (qmode=1), 10000 total dofs (ndofs_global), and in both cases should produce the same output ynorm and znorm (within numerical roundoff precision). For a problem with 10000 dofs, the numerical value of the ynorm and znorm should be 1.141577508 to 9 decimal places. The console output and the JSON file should be reported.

For the acceptance tests, with --qmode=0, all GPU-based computations must yield the same answer as a CPU-based variant, subject to numerical roundoffs.

The same correctness test should be performed with the CG operator on 1 and 8 GPUS:

  • Correctness comparison with matrix result: bench_dolfinx --mat_comp --cg --ndofs_global=10000 --degree=3 --json mat_comp_cg.json

In this case, ynorm and znorm should be 167.5924472. Console output and JSON should be reported.

Performance results

In addition to testing for correctness, validate.py will also print the Computation Rate, which is the sole FoM for the benchmark. The Computation Rate printed by validate.py corresponds to the total throughput in billion degrees of freedom per second (Gdofs/s).

Reference data

LUMI-G (MI250x): Throughput in GDoFs/s for 2-64 nodes (8-256 GPUs)

Problem size of Q3 200M and Q6 350M were chosen to fit in the 64GB memory constraint of the devices. No collocation was used (--qmode=1).

8 MPI processes per node (2 MPI processes per GPU, 1 MPI process per GCD).

--cg --degree --ndofs #GPU GDoF/s GDoF/s/device
Yes Q3 200M 8 32.4847 4.061
Yes Q6 350M 8 45.5109 5.689
Yes Q3 200M 16 63.9487 3.997
Yes Q6 350M 16 89.2596 5.579
Yes Q3 200M 32 126.518 3.954
Yes Q6 350M 32 177.345 5.542
Yes Q3 200M 64 245.983 3.843
Yes Q6 350M 64 349.948 5.468
Yes Q3 200M 128 499.028 3.899
Yes Q6 350M 128 695.995 5.437
Yes Q3 200M 256 997.509 3.897
Yes Q6 350M 256 1327.46 5.185

Isambard-AI (GH200): Throughput in GDoFs/s for 4-256 nodes (16-1024 GPUs)

Problem size of Q3 300M and Q6 500M were chosen to fit in the 96GB memory constraint of the devices. No collocation was used (--qmode=1).

4 MPI processes per node (1 MPI process per GPU).

--cg --degree --ndofs #GPU GDoF/s GDoF/s/device
Yes Q3 300M 16 64.2997 4.019
Yes Q6 500M 16 100.003 6.250
Yes Q3 300M 32 126.956 3.967
Yes Q6 500M 32 169.147 5.286
Yes Q3 300M 64 257.855 4.029
Yes Q6 500M 64 276.335 4.318
Yes Q3 300M 128 512.411 4.003
Yes Q6 500M 128 505.667 3.951
Yes Q3 300M 256 1002.64 3.916
Yes Q6 500M 256 973.149 3.801
Yes Q3 300M 512 1980.19 3.867
Yes Q6 500M 512 1801.89 3.519
Yes Q3 300M 1024 3781.04 3.692
Yes Q6 500M 1024 3410.36 3.330

Hunter (Mi300A)

Problem size of Q3 200M and Q6 350M were chosen. No collocation was used (--qmode=1).

--cg --degree --ndofs #GPU GDoF/s GDoF/s/device
No Q3 200M 1 6.15 6.15
No Q3 200M 4 23.7 5.93
No Q3 200M 32 179 5.59
No Q3 200M 64 346 5.41
No Q6 350M 1 9.49 9.49
Yes Q3 200M 1 5.19 5.19
Yes Q3 200M 4 18.2 4.55
Yes Q3 200M 64 155 2.42

License

This benchmark description and associated files are released under the MIT license.

Changelog

The following changes to this document have been made since initial release:

Date
Change
2026-06-05 Removes incorrect --mat-comp option from suggested configurations
2026-05-29 Correct validation script to support CG correctness test

About

Benchmark description for DOLFINx

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages