This repository contains information on the DOLFINx benchmark for the UK NNSS procurement.
Important
Please do not contact the benchmark or code maintainers directly with any questions. All questions must be submitted via the procurement response mechanism.
The DOLFINx Benchmark is a performance benchmark for testing matrix-free Finite Element operator evaluation on unstructured hexahedral grids.
For a given set of parameters, DOLFINx Benchmark constructs a mesh
with a fixed number of degrees-of-freedom (DoFs) per MPI rank. The
DoFs are initialised from data on the CPU, and transferred to
GPU as a vector b. There is one GPU per MPI rank. On the GPU, one of the following
two operations is performed repeatedly for a number of repetitions:
- Operator Action: computes the matrix-free operation
y=A.b - Conjugate Gradient iteration: Operator Action plus axpy and global reduce operations.
Each operator iteration involves an overlapped computation-communication round as follows:
- Scatter halo data to neighbors
- Compute GPU kernel on local cells
- Unpack received halo data
- Compute GPU kernel on halo cells
The main parameters are: Number of DoFs per GPU (ndofs) (range 1000-100000000+), Polynomial
degree (degree) (range 2-7), Floating-point precision (float)
(32/64). The maximum ndofs will be determined by the GPU memory
size, but should be at least 100 million.
Git repository: https://github.com/ukri-bench/benchmark-dolfinx
Caution
All results submitted should be based on the following repository commits:
The following diagram illustrates the dependencies of the dolfinx
benchmark, and dolfinx. A more detailed list can be generated by using
spack (see below).
It is recommended to use Spack to install the DOLFINx benchmark to ensure appropriate versioning of depedencies.
-
Add the UKRI benchmark spack repository
spack repo add --name bench_pkgs https://github.com/ukri-bench/spack-packages.git bench_pkgs spack repo add --name fenics https://github.com/FEniCS/spack-fenics.git fenics -
Create a Spack environment and install the benchmark
spack env create /path/to/spack-env spack env activate /path/to/spack-env spack add bench-dolfinx spack install
-
The host-code compiler must support C++20 including
std::format. This limits the choice of host-code compilers to reasonably recent versions (gcc-13 or later). -
For NVIDIA GPUs, CUDA version 12.x is recommended.
-
For AMD GPUs, ROCm version 6.x is recommended.
-
A graph partitioner is a required dependency, which must be built with 64-bit integer support. PT-SCOTCH or ParMETIS are supported, with PT-SCOTCH the recommended choice.
benchmark-dolfinx has been written with standard C++20 and tested with ROCm 6.3.4 and CUDA
12.9. Modifications for later versions of ROCm and CUDA are permitted,
if required to resolve unavoidable compilation or runtime
errors. Modifications are allowed to installation scripts such as
CMakeLists.txt for specific systems.
The following configurations have been tested using the Spack installation method described in the repository:
- LUMI-G: ROCm 6.3.4, GCC 14.3.0, HPE Cray MPICH 8.1.32
- CSD3: CUDA 12.9.1, GCC 13.4.0, OpenMPI 4.1.1+CUDA
- Isambard CUDA 12.9.0, GCC 14.3.0, HPE Cray MPICH 8.1.32
Alternatively manual installation instructions on Ubuntu 24.04 are also provided.
We use two flavours of DOLFINx: application of a stencil-based operator (mathematically a matrix-vector multiplication), and; a full conjugate gradient solver, which includes MPI collective operations. We also use DOLFINx with and without GPU support.
For GPU-based runs, the benchmark runs with one MPI process per GPU device, and it does not automatically bind MPI process to GPU devices. A description of how to bind devices and cores is given in the benchmark repository.
The full list of command line arguments can be shown with the -h option.
The benchmark executable can use either the CPU or available GPUs. The --platform parameter controls where the benchmark runs:
- GPU runs, use
--platform=gpu - CPU runs, use
--platform=cpu
The qmode parameter changes whether the quadrature points are colocated with the degrees of freedom, or not:
- Colocation, use
--qmode=0 - No collocation, use
--qmode=1
For CPUs, only --qmode=0 is supported.
For benchmarking purposes, problem configurations similar to the following are needed:
- Stencil throughput at Q3, 200M degrees-of-freedom:
bench_dolfinx --degree=3 --ndofs=200000000 \ --json MAT-Q3-200M.json | tee MAT-Q3-200M.out - Stencil throughput at Q6, 350M degrees-of-freedom:
bench_dolfinx --degree=6 --ndofs=350000000 \ --json MAT-Q6-350M.json | tee MAT-Q6-350M.out - CG throughput at Q3, 200M degrees-of-freedom:
bench_dolfinx --cg --degree=3 --ndofs=200000000 \ --json CG-Q3-200M.json | tee CG-Q3-200M.out
The precise run configurations should be taken from the data spreadsheets that list the assessment configurations.
Correctness can be verified using the validate.py script.
The validation script should be run as follows and produce output similar to the following:
./validate output.json output.out
# DOLFINx benchmark validation
P : 3
ndof : 10000
nreps : 1000
scalar size : 64
MAT COMP performance: 0.2957402083152624 Gdofs/s
Validation: PASSED
Sanity check: The matrix comparison must be run on 1 GPU and 8 GPUs with no collocation (qmode=1), 10000
total dofs (ndofs_global), and in both cases should produce the same
output ynorm and znorm (within numerical roundoff precision).
For a problem with 10000 dofs, the numerical value of the ynorm and
znorm should be 1.141577508 to 9 decimal places. The console output
and the JSON file should be reported.
For the acceptance tests, with --qmode=0, all GPU-based computations must
yield the same answer as a CPU-based variant, subject to numerical
roundoffs.
The same correctness test should be performed with the CG operator on 1 and 8 GPUS:
- Correctness comparison with matrix result:
bench_dolfinx --mat_comp --cg --ndofs_global=10000 --degree=3 --json mat_comp_cg.json
In this case, ynorm and znorm should be 167.5924472. Console output
and JSON should be reported.
In addition to testing for correctness, validate.py will also print the Computation Rate, which is the sole FoM for the benchmark.
The Computation Rate printed by validate.py corresponds to the
total throughput in billion degrees of freedom per second (Gdofs/s).
Problem size of Q3 200M and Q6 350M were chosen to fit in the 64GB
memory constraint of the devices. No collocation was used (--qmode=1).
8 MPI processes per node (2 MPI processes per GPU, 1 MPI process per GCD).
--cg |
--degree |
--ndofs |
#GPU | GDoF/s | GDoF/s/device |
|---|---|---|---|---|---|
| Yes | Q3 | 200M | 8 | 32.4847 | 4.061 |
| Yes | Q6 | 350M | 8 | 45.5109 | 5.689 |
| Yes | Q3 | 200M | 16 | 63.9487 | 3.997 |
| Yes | Q6 | 350M | 16 | 89.2596 | 5.579 |
| Yes | Q3 | 200M | 32 | 126.518 | 3.954 |
| Yes | Q6 | 350M | 32 | 177.345 | 5.542 |
| Yes | Q3 | 200M | 64 | 245.983 | 3.843 |
| Yes | Q6 | 350M | 64 | 349.948 | 5.468 |
| Yes | Q3 | 200M | 128 | 499.028 | 3.899 |
| Yes | Q6 | 350M | 128 | 695.995 | 5.437 |
| Yes | Q3 | 200M | 256 | 997.509 | 3.897 |
| Yes | Q6 | 350M | 256 | 1327.46 | 5.185 |
Problem size of Q3 300M and Q6 500M were chosen to fit in the 96GB
memory constraint of the devices. No collocation was used (--qmode=1).
4 MPI processes per node (1 MPI process per GPU).
--cg |
--degree |
--ndofs |
#GPU | GDoF/s | GDoF/s/device |
|---|---|---|---|---|---|
| Yes | Q3 | 300M | 16 | 64.2997 | 4.019 |
| Yes | Q6 | 500M | 16 | 100.003 | 6.250 |
| Yes | Q3 | 300M | 32 | 126.956 | 3.967 |
| Yes | Q6 | 500M | 32 | 169.147 | 5.286 |
| Yes | Q3 | 300M | 64 | 257.855 | 4.029 |
| Yes | Q6 | 500M | 64 | 276.335 | 4.318 |
| Yes | Q3 | 300M | 128 | 512.411 | 4.003 |
| Yes | Q6 | 500M | 128 | 505.667 | 3.951 |
| Yes | Q3 | 300M | 256 | 1002.64 | 3.916 |
| Yes | Q6 | 500M | 256 | 973.149 | 3.801 |
| Yes | Q3 | 300M | 512 | 1980.19 | 3.867 |
| Yes | Q6 | 500M | 512 | 1801.89 | 3.519 |
| Yes | Q3 | 300M | 1024 | 3781.04 | 3.692 |
| Yes | Q6 | 500M | 1024 | 3410.36 | 3.330 |
Problem size of Q3 200M and Q6 350M were chosen. No collocation was used (--qmode=1).
--cg |
--degree |
--ndofs |
#GPU | GDoF/s | GDoF/s/device |
|---|---|---|---|---|---|
| No | Q3 | 200M | 1 | 6.15 | 6.15 |
| No | Q3 | 200M | 4 | 23.7 | 5.93 |
| No | Q3 | 200M | 32 | 179 | 5.59 |
| No | Q3 | 200M | 64 | 346 | 5.41 |
| No | Q6 | 350M | 1 | 9.49 | 9.49 |
| Yes | Q3 | 200M | 1 | 5.19 | 5.19 |
| Yes | Q3 | 200M | 4 | 18.2 | 4.55 |
| Yes | Q3 | 200M | 64 | 155 | 2.42 |
This benchmark description and associated files are released under the MIT license.
The following changes to this document have been made since initial release:
Date |
Change |
|---|---|
| 2026-06-05 | Removes incorrect --mat-comp option from suggested configurations |
| 2026-05-29 | Correct validation script to support CG correctness test |
