Skip to content

Commit eb2a6a3

Browse files
author
Richard Top
committed
Added pngs and conclusion
1 parent 2eb7040 commit eb2a6a3

3 files changed

Lines changed: 112 additions & 104 deletions

File tree

162 KB
Loading
191 KB
Loading
Lines changed: 112 additions & 104 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
author: [Richard]
3-
date: 2026-05-08
3+
date: 2026-05-11
44
slug: EESSI-on-Cray-Slingshot-part2
55
---
66

@@ -9,7 +9,7 @@ slug: EESSI-on-Cray-Slingshot-part2
99
Building on our initial HPE/Cray Slingshot‑11 results, we further refined MPI tuning and validated the setup using EESSI/2025.06. The outcome is a significant performance improvement, bringing EESSI MPI behavior much closer to vendor tuned Cray MPI environments.
1010
In our previous blog post, [MPI at Warp Speed: EESSI Meets Slingshot‑11](https://www.eessi.io/docs/blog/2025/11/14/EESSI-on-Cray-Slingshot/), we demonstrated that EESSI could successfully leverage the HPE Cray Slingshot‑11 interconnect via the [host_injections](https://www.eessi.io/docs/site_specific_config/host_injections/) mechanism. Even as a proof‑of‑concept, the results were promising especially for GPU aware MPI communication on NVIDIA Grace Hopper systems.
1111
We have continued to tune and refine MPI communication while using EESSI/2025.06 software stack. Through updates to several core components and improvements to library configuration, we significantly reduced latency overheads and improved bandwidth utilization across Slingshot‑11.
12-
In this followup post, we present the results using OSU-Micro-Benchmarks/7.5 and discuss show how close EESSI can now get to native, vendor‑optimized MPI performance on Slingshot‑11 systems.
12+
In this follow up blog post, we present the results using OSU-Micro-Benchmarks/7.5 and show how close EESSI can now get to native, vendor‑optimized MPI performance on Slingshot‑11 systems.
1313

1414
### System Architecture
1515

@@ -32,9 +32,11 @@ We evaluated two OSU Micro-Benchmark builds:
3232

3333
The following commands were used to run the benchmarks:
3434

35-
`mpirun -np 2 osu_bibw D D`
35+
`srun -N 2 --ntasks-per-node=1 osu_bibw -i 10 D D`
3636

37-
`mpirun -np 2 osu_latency D D`
37+
`srun -N 2 --ntasks-per-node=1 osu_latency -i 10 D D`
38+
39+
![OSU CUDA Bi-bandwidth](OSU‑7.5-CUDA-bibw.png) ![OSU CUDA Latency](OSU‑7.5-CUDA-Latency.png)
3840

3941
<details>
4042
<summary>See details</summary>
@@ -60,63 +62,65 @@ Currently Loaded Modules:
6062
7) hwloc/2.10.0-GCCcore-13.3.0 18) UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0 (g)
6163
8) OpenSSL/3 19) NCCL/2.22.3-GCCcore-13.3.0-CUDA-12.6.0 (g)
6264
9) libevent/2.1.12-GCCcore-13.3.0 20) UCC-CUDA/1.3.0-GCCcore-13.3.0-CUDA-12.6.0 (g)
63-
10) UCX/1.16.0-GCCcore-13.3.0 21) OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 (g)
65+
10) UCX/1.16.0-GCCcore-13.3.0 21) OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 (g)
66+
11) libfabric/1.21.0-GCCcore-13.3.0
67+
6468
Where:
6569
g: built for GPU
6670
6771
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
6872
# Datatype: MPI_CHAR.
6973
# Size Bandwidth (MB/s)
70-
1 1.24
71-
2 2.53
72-
4 5.09
73-
8 10.21
74-
16 20.56
75-
32 41.06
76-
64 82.56
77-
128 164.61
78-
256 328.11
79-
512 652.27
80-
1024 1295.71
81-
2048 2568.34
82-
4096 3161.87
83-
8192 10383.73
84-
16384 19679.28
85-
32768 26194.74
86-
65536 34068.25
87-
131072 38747.45
88-
262144 38515.90
89-
524288 37048.28
90-
1048576 44631.12
91-
2097152 44871.95
92-
4194304 45065.66
74+
1 2.57
75+
2 5.11
76+
4 10.22
77+
8 20.66
78+
16 40.44
79+
32 80.95
80+
64 165.02
81+
128 329.14
82+
256 650.10
83+
512 1301.93
84+
1024 2608.66
85+
2048 5189.90
86+
4096 10332.67
87+
8192 19474.04
88+
16384 28342.00
89+
32768 33507.82
90+
65536 37659.55
91+
131072 41730.65
92+
262144 44740.60
93+
524288 45448.67
94+
1048576 45700.68
95+
2097152 45895.85
96+
4194304 46035.77
9397
9498
# OSU MPI-CUDA Latency Test v7.5
9599
# Datatype: MPI_CHAR.
96100
# Size Avg Latency(us)
97-
1 2.79
98-
2 2.82
99-
4 2.91
100-
8 2.76
101-
16 2.82
102-
32 2.89
103-
64 2.80
104-
128 3.71
105-
256 4.14
106-
512 4.21
107-
1024 4.31
108-
2048 4.44
109-
4096 4.85
110-
8192 8.40
111-
16384 9.31
112-
32768 15.94
113-
65536 12.02
114-
131072 13.51
115-
262144 18.55
116-
524288 29.56
117-
1048576 51.48
118-
2097152 94.93
119-
4194304 180.92
101+
1 2.38
102+
2 2.34
103+
4 2.34
104+
8 2.32
105+
16 2.34
106+
32 2.34
107+
64 2.34
108+
128 3.16
109+
256 3.31
110+
512 3.35
111+
1024 3.46
112+
2048 3.60
113+
4096 3.80
114+
8192 4.08
115+
16384 4.63
116+
32768 7.55
117+
65536 10.07
118+
131072 12.15
119+
262144 17.37
120+
524288 28.50
121+
1048576 50.04
122+
2097152 93.27
123+
4194304 179.65
120124
```
121125

122126
<b>Test using OSU-Micro-Benchmarks/7.5 with PrgEnv-cray</b>:
@@ -130,65 +134,69 @@ CPU info:
130134
Vendor ID: ARM
131135
132136
Currently Loaded Modules:
133-
1) craype-arm-grace 8) craype/2.7.34
134-
2) libfabric/1.22.0 9) cray-dsmml/0.3.1
135-
3) craype-network-ofi 10) cray-mpich/8.1.32
136-
4) perftools-base/25.03.0 11) cray-libsci/25.03.0
137-
5) xpmem/2.11.3-1.3_gdbda01a1eb3d 12) PrgEnv-cray/8.6.0
138-
6) cce/19.0.0 13) cudatoolkit/24.11_12.6
139-
137+
1) craype-arm-grace 7) cray-dsmml/0.3.0
138+
2) libfabric/2.3.1 8) cray-mpich/9.1.0
139+
3) craype-network-ofi 9) cray-libsci/26.03.0
140+
4) perftools-base/26.03.0 10) PrgEnv-cray/8.7.0
141+
5) xpmem/2.11.3-1.3_gdbda01a1eb3d 11) cuda/13.0
142+
6) cce/21.0.0 12) CrayEnv
143+
7) craype/2.7.36
144+
140145
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
141146
# Datatype: MPI_CHAR.
142147
# Size Bandwidth (MB/s)
143-
1 1.06
144-
2 2.17
145-
4 4.40
146-
8 8.80
147-
16 17.64
148-
32 35.17
149-
64 70.55
150-
128 140.91
151-
256 281.22
152-
512 559.04
153-
1024 1114.45
154-
2048 2081.25
155-
4096 4068.64
156-
8192 1852.11
157-
16384 18564.47
158-
32768 22647.40
159-
65536 33108.03
160-
131072 39553.95
161-
262144 43140.01
162-
524288 44853.40
163-
1048576 45761.69
164-
2097152 46228.10
165-
4194304 46470.29
148+
1 1.14
149+
2 2.23
150+
4 4.56
151+
8 9.18
152+
16 18.41
153+
32 36.77
154+
64 74.20
155+
128 147.12
156+
256 275.37
157+
512 569.29
158+
1024 1161.92
159+
2048 2339.97
160+
4096 4640.06
161+
8192 9350.01
162+
16384 18583.90
163+
32768 23840.66
164+
65536 34521.83
165+
131072 39704.04
166+
262144 41814.18
167+
524288 44072.94
168+
1048576 44682.92
169+
2097152 45122.15
170+
4194304 45029.99
166171
167172
# OSU MPI-CUDA Latency Test v7.5
168173
# Datatype: MPI_CHAR.
169174
# Size Avg Latency(us)
170-
1 2.76
171-
2 2.72
172-
4 2.90
173-
8 2.86
174-
16 2.85
175-
32 2.73
176-
64 2.60
177-
128 3.41
178-
256 4.17
179-
512 4.19
180-
1024 4.29
181-
2048 4.44
182-
4096 4.66
183-
8192 7.59
184-
16384 8.17
185-
32768 8.44
186-
65536 9.92
187-
131072 12.59
188-
262144 18.07
189-
524288 29.00
190-
1048576 50.64
191-
2097152 94.06
192-
4194304 180.44
175+
1 3.31
176+
2 3.30
177+
4 3.24
178+
8 3.36
179+
16 3.21
180+
32 3.36
181+
64 3.24
182+
128 4.45
183+
256 4.43
184+
512 4.56
185+
1024 4.62
186+
2048 4.81
187+
4096 4.92
188+
8192 5.36
189+
16384 6.46
190+
32768 10.14
191+
65536 11.58
192+
131072 14.56
193+
262144 19.77
194+
524288 31.93
195+
1048576 56.43
196+
2097152 102.16
197+
4194304 181.70
193198
```
194199
</details>
200+
201+
## Conclusion
202+
There is a notable improvement in performance. While additional testing is still required, the current results are highly satisfactory.

0 commit comments

Comments
 (0)