Skip to content

Commit 7eccf0e

Browse files
authored
Merge pull request EESSI#758 from TopRichard/MPI-at-Warp-Speed-part2
EESSI Meets Slingshot-11 (part 2)
2 parents 15b504c + a37a0c1 commit 7eccf0e

4 files changed

Lines changed: 223 additions & 1 deletion

File tree

docs/blog/posts/2025/09/eessi-cray-slingshot11.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
author: [Richard]
2+
authors: [TopRichard]
33
date: 2025-11-14
44
slug: EESSI-on-Cray-Slingshot
55
---
162 KB
Loading
191 KB
Loading
Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
---
2+
authors: [TopRichard]
3+
date: 2026-05-11
4+
slug: EESSI-on-Cray-Slingshot-part2
5+
---
6+
7+
# MPI at Warp Speed: EESSI Meets Slingshot-11<sub><sup>(bis)</sup></sub>
8+
9+
Building on our initial HPE/Cray Slingshot‑11 results, we further refined MPI tuning and validated the setup using EESSI 2025.06.
10+
11+
The outcome is a significant performance improvement, bringing MPI support in EESSI much closer to vendor tuned Cray MPI environments.
12+
13+
<!-- more -->
14+
15+
In our previous blog post, [MPI at Warp Speed: EESSI Meets Slingshot‑11](../../2025/09/eessi-cray-slingshot11.md),
16+
we demonstrated that EESSI could successfully leverage the HPE Cray Slingshot‑11 interconnect via the
17+
[host_injections](../../../../site_specific_config/host_injections.md) mechanism.
18+
19+
Even as a proof‑of‑concept, the results were promising, especially for GPU aware MPI communication on NVIDIA Grace Hopper systems.
20+
21+
We have continued to tune and refine MPI communication while using EESSI 2025.06 software stack. Through updates to several core components
22+
and improvements to library configuration, we significantly reduced latency overheads and improved bandwidth utilization across Slingshot‑11.
23+
24+
In this follow-up blog post we present the results using OSU-Micro-Benchmarks 7.5, and show how close EESSI can now get to native,
25+
vendor-optimized MPI performance on Slingshot‑11 systems.
26+
27+
### System Architecture
28+
29+
Our target system is [Olivia](https://documentation.sigma2.no/hpc_machines/olivia.html#olivia),
30+
which is based on HPE Cray EX platforms for compute and accelerator nodes, and HPE Cray ClusterStor for global storage,
31+
all connected via HPE Slingshot high-speed interconnect. It consists of two main distinct partitions:
32+
33+
- **Partition 1**: x86_64 AMD CPUs without accelerators
34+
- **Partition 2**: NVIDIA Grace CPUs with Hopper accelerators
35+
36+
### Testing
37+
38+
The following tests were conducted on the `accel` partition of Olivia (Grace nodes with Hopper GPUs),
39+
using a 2-node 2-GPU configuration with one MPI task per node.
40+
41+
We evaluated two OSU Micro-Benchmark builds:
42+
43+
- `OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0` from EESSI;
44+
- `OSU-Micro-Benchmarks/7.5` compiled with `PrgEnv-cray`.
45+
46+
The following commands were used to run the benchmarks:
47+
48+
```{ .bash .copy }
49+
srun -N 2 --ntasks-per-node=1 osu_bibw -i 10 D D
50+
```
51+
52+
```{ .bash .copy }
53+
srun -N 2 --ntasks-per-node=1 osu_latency -i 10 D D
54+
```
55+
56+
![OSU CUDA Bi-bandwidth](OSU‑7.5-CUDA-bibw.png) ![OSU CUDA Latency](OSU‑7.5-CUDA-Latency.png)
57+
58+
<details>
59+
<summary>See details</summary>
60+
61+
Test using `OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0` from EESSI:
62+
```
63+
Environment set up to use EESSI (2025.06), have fun!
64+
65+
hostname:
66+
gpu-1-111
67+
gpu-1-102
68+
69+
CPU info:
70+
Vendor ID: ARM
71+
72+
Currently Loaded Modules:
73+
1) EESSI/2025.06 12) PMIx/5.0.2-GCCcore-13.3.0
74+
2) GCCcore/13.3.0 13) PRRTE/3.0.5-GCCcore-13.3.0
75+
3) GCC/13.3.0 14) UCC/1.3.0-GCCcore-13.3.0
76+
4) numactl/2.0.18-GCCcore-13.3.0 15) OpenMPI/5.0.3-GCC-13.3.0
77+
5) libxml2/2.12.7-GCCcore-13.3.0 16) gompi/2024a
78+
6) libpciaccess/0.18.1-GCCcore-13.3.0 17) GDRCopy/2.4.1-GCCcore-13.3.0
79+
7) hwloc/2.10.0-GCCcore-13.3.0 18) UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0 (g)
80+
8) OpenSSL/3 19) NCCL/2.22.3-GCCcore-13.3.0-CUDA-12.6.0 (g)
81+
9) libevent/2.1.12-GCCcore-13.3.0 20) UCC-CUDA/1.3.0-GCCcore-13.3.0-CUDA-12.6.0 (g)
82+
10) UCX/1.16.0-GCCcore-13.3.0 21) OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 (g)
83+
11) libfabric/1.21.0-GCCcore-13.3.0
84+
85+
Where:
86+
g: built for GPU
87+
88+
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
89+
# Datatype: MPI_CHAR.
90+
# Size Bandwidth (MB/s)
91+
1 2.57
92+
2 5.11
93+
4 10.22
94+
8 20.66
95+
16 40.44
96+
32 80.95
97+
64 165.02
98+
128 329.14
99+
256 650.10
100+
512 1301.93
101+
1024 2608.66
102+
2048 5189.90
103+
4096 10332.67
104+
8192 19474.04
105+
16384 28342.00
106+
32768 33507.82
107+
65536 37659.55
108+
131072 41730.65
109+
262144 44740.60
110+
524288 45448.67
111+
1048576 45700.68
112+
2097152 45895.85
113+
4194304 46035.77
114+
115+
# OSU MPI-CUDA Latency Test v7.5
116+
# Datatype: MPI_CHAR.
117+
# Size Avg Latency(us)
118+
1 2.38
119+
2 2.34
120+
4 2.34
121+
8 2.32
122+
16 2.34
123+
32 2.34
124+
64 2.34
125+
128 3.16
126+
256 3.31
127+
512 3.35
128+
1024 3.46
129+
2048 3.60
130+
4096 3.80
131+
8192 4.08
132+
16384 4.63
133+
32768 7.55
134+
65536 10.07
135+
131072 12.15
136+
262144 17.37
137+
524288 28.50
138+
1048576 50.04
139+
2097152 93.27
140+
4194304 179.65
141+
```
142+
143+
Test using `OSU-Micro-Benchmarks/7.5` with `PrgEnv-cray`:
144+
```
145+
146+
hostname:
147+
gpu-1-111
148+
gpu-1-102
149+
150+
CPU info:
151+
Vendor ID: ARM
152+
153+
Currently Loaded Modules:
154+
1) craype-arm-grace 7) cray-dsmml/0.3.0
155+
2) libfabric/2.3.1 8) cray-mpich/9.1.0
156+
3) craype-network-ofi 9) cray-libsci/26.03.0
157+
4) perftools-base/26.03.0 10) PrgEnv-cray/8.7.0
158+
5) xpmem/2.11.3-1.3_gdbda01a1eb3d 11) cuda/13.0
159+
6) cce/21.0.0 12) CrayEnv
160+
7) craype/2.7.36
161+
162+
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
163+
# Datatype: MPI_CHAR.
164+
# Size Bandwidth (MB/s)
165+
1 1.14
166+
2 2.23
167+
4 4.56
168+
8 9.18
169+
16 18.41
170+
32 36.77
171+
64 74.20
172+
128 147.12
173+
256 275.37
174+
512 569.29
175+
1024 1161.92
176+
2048 2339.97
177+
4096 4640.06
178+
8192 9350.01
179+
16384 18583.90
180+
32768 23840.66
181+
65536 34521.83
182+
131072 39704.04
183+
262144 41814.18
184+
524288 44072.94
185+
1048576 44682.92
186+
2097152 45122.15
187+
4194304 45029.99
188+
189+
# OSU MPI-CUDA Latency Test v7.5
190+
# Datatype: MPI_CHAR.
191+
# Size Avg Latency(us)
192+
1 3.31
193+
2 3.30
194+
4 3.24
195+
8 3.36
196+
16 3.21
197+
32 3.36
198+
64 3.24
199+
128 4.45
200+
256 4.43
201+
512 4.56
202+
1024 4.62
203+
2048 4.81
204+
4096 4.92
205+
8192 5.36
206+
16384 6.46
207+
32768 10.14
208+
65536 11.58
209+
131072 14.56
210+
262144 19.77
211+
524288 31.93
212+
1048576 56.43
213+
2097152 102.16
214+
4194304 181.70
215+
```
216+
</details>
217+
218+
## Conclusion
219+
220+
There is a notable improvement in performance compared to the [previous blog post](../../2025/09/eessi-cray-slingshot11.md).
221+
222+
While additional testing is still required, the current results are highly satisfactory.

0 commit comments

Comments
 (0)