Skip to content

Commit 25a4e1b

Browse files
authored
Merge pull request #551 from EESSI/EESSI-Cray-Slingshot11-blog
blog-EESSI-Cray-Slingshot11
2 parents 09761cd + 73369b3 commit 25a4e1b

3 files changed

Lines changed: 266 additions & 0 deletions

File tree

Lines changed: 266 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,266 @@
1+
---
2+
author: [Richard]
3+
date: 2025-11-14
4+
slug: EESSI-on-Cray-Slingshot
5+
---
6+
7+
# MPI at Warp Speed: EESSI Meets Slingshot-11
8+
9+
High-performance computing environments are constantly evolving, and keeping pace with the latest interconnect technologies is crucial for maximising application performance. However, we cannot rebuild all the software in EESSI that depends on improvements to communication libraries. So how do we take advantage of new technological developments?
10+
11+
Specifically we look at taking benefit of the HPE/Cray Slingshot-11.
12+
Slingshot-11 promises to offer a significant advancement in HPC networking, offering improved bandwidth, lower latency, and better scalability for exascale computing workloads ... so
13+
this should be worth the effort!
14+
15+
In this blog post, we present the requirements for building OpenMPI 5.x with Slingshot-11 support on HPE/Cray systems and its integration with EESSI using the [host_injections](../../../../site_specific_config/host_injections.md)
16+
mechanism of EESSI to inject custom-built OpenMPI libraries. This approach enables overriding EESSI’s default MPI library with an ABI-compatible, Slingshot-optimized version which should give us optimal performance.
17+
<!-- more -->
18+
19+
## The Challenge
20+
21+
EESSI provides a comprehensive software stack, but specialized interconnect support like Slingshot-11 can sometimes require custom-built libraries that aren't yet available in the standard EESSI distribution. Our goal is to:
22+
23+
1. Build OpenMPI 5.x with native Slingshot-11 support
24+
2. Create ABI-compatible replacements for EESSI's OpenMPI libraries
25+
3. Place the libraries somewhere where EESSI automatically picks them up
26+
4. Support both x86_64 AMD CPU partitions and NVIDIA Grace CPU partitions with Hopper accelerators
27+
28+
The main task is to build the required dependencies on top of EESSI, since many of the libraries needed for libfabric with CXI support are not yet available in the current EESSI stack.
29+
30+
### System Architecture
31+
32+
Our target system is [Olivia](https://documentation.sigma2.no/hpc_machines/olivia.html#olivia) which is based on HPE Cray EX platforms for compute and accelerator nodes, and HPE Cray ClusterStor for global storage, all
33+
connected via HPE Slingshot high-speed interconnect.
34+
It consists of two main distinct partitions:
35+
36+
- **Partition 1**: x86_64 AMD CPUs without accelerators
37+
- **Partition 2**: NVIDIA Grace CPUs with Hopper accelerators
38+
39+
For the Grace/Hopper partition we also need to enable CUDA support in libfabric.
40+
41+
## Building the Dependency Chain
42+
43+
### Building Strategy
44+
45+
Rather than relying on Cray-provided system packages, we opted to build all dependencies from source [on top of EESSI](../../../../using_eessi/building_on_eessi.md). This approach provides several advantages:
46+
47+
- **Consistency**: All libraries built with the same compiler toolchain
48+
- **Compatibility**: Ensures ABI compatibility with EESSI libraries
49+
- **Control**: Full control over build configurations and optimizations
50+
51+
### Required Dependencies
52+
53+
To build OpenMPI 5.x with libfabric and CXI support, we needed the following missing dependencies:
54+
55+
1. **libuv** - Asynchronous I/O library
56+
2. **libnl** - Netlink library for network configuration
57+
3. **libconfig** - Library designed for processing structured configuration files
58+
4. **libfuse** - Filesystem in Userspace library
59+
5. **libpdap** - Performance Data Access Protocol library
60+
6. **shs-libcxi** - Slingshot CXI library
61+
7. **lm-sensors** - Monitoring tools and drivers
62+
8. **libfabric 2.x** - OpenFabrics Interfaces library with CXI provider
63+
9. **OpenMPI 5.x** - The final MPI implementation
64+
65+
## EESSI Integration via `host_injections`
66+
67+
EESSI's `host_injections` mechanism allows us to override EESSI's MPI library with an ABI compatible host MPI while maintaining compatibility with the rest of the software stack. We just need to make sure that the libraries are in the right
68+
location to be automatically picked up by the software shipped with EESSI. This location is EESSI-version specific, for `2023.06`, with the NVIDIA Grace architecture, that location is:
69+
```
70+
/cvmfs/software.eessi.io/host_injections/2023.06/software/linux/aarch64/nvidia/grace/rpath_overrides/OpenMPI/system/lib
71+
```
72+
73+
**OpenMPI/5.0.7 on ARM nodes built with:**
74+
```
75+
./configure --prefix=/cluster/installations/eessi/default/aarch64/software/OpenMPI/5.0.7-GCC-12.3.0 --with-cuda=${EBROOTCUDA} --with-cuda-libdir=${EBROOTCUDA}/lib64 --with-slurm --enable-mpi-ext=cuda --with-libfabric=${EBROOTLIBFABRIC} --with-ucx=${EBROOTUCX} --enable-mpirun-prefix-by-default --enable-shared --with-hwloc=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/hwloc/2.9.1-GCCcore-12.3.0 --with-libevent=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/libevent/2.1.12-GCCcore-12.3.0 --with-pmix=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/PMIx/4.2.4-GCCcore-12.3.0 --with-ucc=/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/UCC/1.2.0-GCCcore-12.3.0 --with-prrte=internal
76+
```
77+
### Testing
78+
79+
We plan to provide more comprehensive test results in the future. In this blog post we want to report that the approach works in principle, and that the EESSI stack can pick up and use the custom OpenMPI build and extract
80+
performance from the host interconnect **without the need to rebuild any software packages**.
81+
82+
The following tests were conducted on Olivia accel partition (Grace nodes with Hopper GPUs), using two-node, two-GPU configuration with one MPI task per node.
83+
84+
We evaluated two OSU Micro-Benchmark builds:
85+
86+
1- OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0 from EESSI
87+
88+
2- OSU-Micro-Benchmarks/7.5 compiled with PrgEnv-cray.
89+
90+
The following commands were used to run the benchmarks:
91+
92+
`mpirun -np 2 osu_bibw D D`
93+
94+
`mpirun -np 2 osu_latency D D`
95+
96+
![OSU CUDA Bandwidth](osu_cuda_bibandwidth.png) ![OSU CUDA Latency](osu_cuda_latency.png)
97+
98+
<details>
99+
<summary>See details</summary>
100+
101+
<b>Test using OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0 from EESSI</b>:
102+
```
103+
Environment set up to use EESSI (2023.06), have fun!
104+
105+
hostname:
106+
x1000c4s4b1n0
107+
x1000c5s3b0n0
108+
109+
CPU info:
110+
Vendor ID: ARM
111+
112+
Currently Loaded Modules:
113+
1) GCCcore/13.2.0
114+
2) GCC/13.2.0
115+
3) numactl/2.0.16-GCCcore-13.2.0
116+
4) libxml2/2.11.5-GCCcore-13.2.0
117+
5) libpciaccess/0.17-GCCcore-13.2.0
118+
6) hwloc/2.9.2-GCCcore-13.2.0
119+
7) OpenSSL/1.1
120+
8) libevent/2.1.12-GCCcore-13.2.0
121+
9) UCX/1.15.0-GCCcore-13.2.0
122+
10) libfabric/1.19.0-GCCcore-13.2.0
123+
11) PMIx/4.2.6-GCCcore-13.2.0
124+
12) UCC/1.2.0-GCCcore-13.2.0
125+
13) OpenMPI/4.1.6-GCC-13.2.0
126+
14) gompi/2023b
127+
15) GDRCopy/2.4-GCCcore-13.2.0
128+
16) UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0 (g)
129+
17) NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0 (g)
130+
18) UCC-CUDA/1.2.0-GCCcore-13.2.0-CUDA-12.4.0 (g)
131+
19) OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0 (g)
132+
133+
Where:
134+
g: built for GPU
135+
136+
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
137+
# Datatype: MPI_CHAR.
138+
# Size Bandwidth (MB/s)
139+
1 0.18
140+
2 0.37
141+
4 0.75
142+
8 1.49
143+
16 2.99
144+
32 5.93
145+
64 11.88
146+
128 23.76
147+
256 72.78
148+
512 145.45
149+
1024 282.03
150+
2048 535.46
151+
4096 1020.24
152+
8192 16477.70
153+
16384 25982.96
154+
32768 30728.30
155+
65536 37637.46
156+
131072 41808.92
157+
262144 44832.61
158+
524288 45602.20
159+
1048576 45873.58
160+
2097152 45995.32
161+
4194304 46061.86
162+
163+
# OSU MPI-CUDA Latency Test v7.5
164+
# Datatype: MPI_CHAR.
165+
# Size Avg Latency(us)
166+
1 11.71
167+
2 11.66
168+
4 11.66
169+
8 11.71
170+
16 11.67
171+
32 11.68
172+
64 11.66
173+
128 12.45
174+
256 3.76
175+
512 3.82
176+
1024 3.91
177+
2048 4.08
178+
4096 4.25
179+
8192 4.49
180+
16384 5.09
181+
32768 8.02
182+
65536 9.56
183+
131072 13.52
184+
262144 17.96
185+
524288 28.94
186+
1048576 50.50
187+
2097152 93.98
188+
4194304 180.14
189+
```
190+
191+
<b>Test using OSU-Micro-Benchmarks/7.5 with PrgEnv-cray</b>:
192+
```
193+
194+
hostname:
195+
x1000c4s4b1n0
196+
x1000c5s3b0n0
197+
198+
CPU info:
199+
Vendor ID: ARM
200+
201+
Currently Loaded Modules:
202+
1) craype-arm-grace 8) craype/2.7.34
203+
2) libfabric/1.22.0 9) cray-dsmml/0.3.1
204+
3) craype-network-ofi 10) cray-mpich/8.1.32
205+
4) perftools-base/25.03.0 11) cray-libsci/25.03.0
206+
5) xpmem/2.11.3-1.3_gdbda01a1eb3d 12) PrgEnv-cray/8.6.0
207+
6) cce/19.0.0 13) cudatoolkit/24.11_12.6
208+
209+
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
210+
# Datatype: MPI_CHAR.
211+
# Size Bandwidth (MB/s)
212+
1 1.06
213+
2 2.17
214+
4 4.40
215+
8 8.80
216+
16 17.64
217+
32 35.17
218+
64 70.55
219+
128 140.91
220+
256 281.22
221+
512 559.04
222+
1024 1114.45
223+
2048 2081.25
224+
4096 4068.64
225+
8192 1852.11
226+
16384 18564.47
227+
32768 22647.40
228+
65536 33108.03
229+
131072 39553.95
230+
262144 43140.01
231+
524288 44853.40
232+
1048576 45761.69
233+
2097152 46228.10
234+
4194304 46470.29
235+
236+
# OSU MPI-CUDA Latency Test v7.5
237+
# Datatype: MPI_CHAR.
238+
# Size Avg Latency(us)
239+
1 2.76
240+
2 2.72
241+
4 2.90
242+
8 2.86
243+
16 2.85
244+
32 2.73
245+
64 2.60
246+
128 3.41
247+
256 4.17
248+
512 4.19
249+
1024 4.29
250+
2048 4.44
251+
4096 4.66
252+
8192 7.59
253+
16384 8.17
254+
32768 8.44
255+
65536 9.92
256+
131072 12.59
257+
262144 18.07
258+
524288 29.00
259+
1048576 50.64
260+
2097152 94.06
261+
4194304 180.44
262+
```
263+
</details>
264+
265+
## Conclusion
266+
The approach demonstrates EESSI's flexibility in accommodating specialized hardware requirements while preserving the benefits of a standardized software stack! There is plenty of more testing to do, but the signs at this stage are very good!
203 KB
Loading
147 KB
Loading

0 commit comments

Comments
 (0)