|
| 1 | +--- |
| 2 | +authors: [TopRichard] |
| 3 | +date: 2026-05-11 |
| 4 | +slug: EESSI-on-Cray-Slingshot-part2 |
| 5 | +--- |
| 6 | + |
| 7 | +# MPI at Warp Speed: EESSI Meets Slingshot-11<sub><sup>(bis)</sup></sub> |
| 8 | + |
| 9 | +Building on our initial HPE/Cray Slingshot‑11 results, we further refined MPI tuning and validated the setup using EESSI 2025.06. |
| 10 | + |
| 11 | +The outcome is a significant performance improvement, bringing MPI support in EESSI much closer to vendor tuned Cray MPI environments. |
| 12 | + |
| 13 | +<!-- more --> |
| 14 | + |
| 15 | +In our previous blog post, [MPI at Warp Speed: EESSI Meets Slingshot‑11](../../2025/09/eessi-cray-slingshot11.md), |
| 16 | +we demonstrated that EESSI could successfully leverage the HPE Cray Slingshot‑11 interconnect via the |
| 17 | +[host_injections](../../../../site_specific_config/host_injections.md) mechanism. |
| 18 | + |
| 19 | +Even as a proof‑of‑concept, the results were promising, especially for GPU aware MPI communication on NVIDIA Grace Hopper systems. |
| 20 | + |
| 21 | +We have continued to tune and refine MPI communication while using EESSI 2025.06 software stack. Through updates to several core components |
| 22 | +and improvements to library configuration, we significantly reduced latency overheads and improved bandwidth utilization across Slingshot‑11. |
| 23 | + |
| 24 | +In this follow-up blog post we present the results using OSU-Micro-Benchmarks 7.5, and show how close EESSI can now get to native, |
| 25 | +vendor-optimized MPI performance on Slingshot‑11 systems. |
| 26 | + |
| 27 | +### System Architecture |
| 28 | + |
| 29 | +Our target system is [Olivia](https://documentation.sigma2.no/hpc_machines/olivia.html#olivia), |
| 30 | +which is based on HPE Cray EX platforms for compute and accelerator nodes, and HPE Cray ClusterStor for global storage, |
| 31 | +all connected via HPE Slingshot high-speed interconnect. It consists of two main distinct partitions: |
| 32 | + |
| 33 | +- **Partition 1**: x86_64 AMD CPUs without accelerators |
| 34 | +- **Partition 2**: NVIDIA Grace CPUs with Hopper accelerators |
| 35 | + |
| 36 | +### Testing |
| 37 | + |
| 38 | +The following tests were conducted on the `accel` partition of Olivia (Grace nodes with Hopper GPUs), |
| 39 | +using a 2-node 2-GPU configuration with one MPI task per node. |
| 40 | + |
| 41 | +We evaluated two OSU Micro-Benchmark builds: |
| 42 | + |
| 43 | +- `OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0` from EESSI; |
| 44 | +- `OSU-Micro-Benchmarks/7.5` compiled with `PrgEnv-cray`. |
| 45 | + |
| 46 | +The following commands were used to run the benchmarks: |
| 47 | + |
| 48 | +```{ .bash .copy } |
| 49 | +srun -N 2 --ntasks-per-node=1 osu_bibw -i 10 D D |
| 50 | +``` |
| 51 | + |
| 52 | +```{ .bash .copy } |
| 53 | +srun -N 2 --ntasks-per-node=1 osu_latency -i 10 D D |
| 54 | +``` |
| 55 | + |
| 56 | +  |
| 57 | + |
| 58 | +<details> |
| 59 | +<summary>See details</summary> |
| 60 | + |
| 61 | +Test using `OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0` from EESSI: |
| 62 | +``` |
| 63 | +Environment set up to use EESSI (2025.06), have fun! |
| 64 | +
|
| 65 | +hostname: |
| 66 | +gpu-1-111 |
| 67 | +gpu-1-102 |
| 68 | +
|
| 69 | +CPU info: |
| 70 | +Vendor ID: ARM |
| 71 | +
|
| 72 | +Currently Loaded Modules: |
| 73 | + 1) EESSI/2025.06 12) PMIx/5.0.2-GCCcore-13.3.0 |
| 74 | + 2) GCCcore/13.3.0 13) PRRTE/3.0.5-GCCcore-13.3.0 |
| 75 | + 3) GCC/13.3.0 14) UCC/1.3.0-GCCcore-13.3.0 |
| 76 | + 4) numactl/2.0.18-GCCcore-13.3.0 15) OpenMPI/5.0.3-GCC-13.3.0 |
| 77 | + 5) libxml2/2.12.7-GCCcore-13.3.0 16) gompi/2024a |
| 78 | + 6) libpciaccess/0.18.1-GCCcore-13.3.0 17) GDRCopy/2.4.1-GCCcore-13.3.0 |
| 79 | + 7) hwloc/2.10.0-GCCcore-13.3.0 18) UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0 (g) |
| 80 | + 8) OpenSSL/3 19) NCCL/2.22.3-GCCcore-13.3.0-CUDA-12.6.0 (g) |
| 81 | + 9) libevent/2.1.12-GCCcore-13.3.0 20) UCC-CUDA/1.3.0-GCCcore-13.3.0-CUDA-12.6.0 (g) |
| 82 | + 10) UCX/1.16.0-GCCcore-13.3.0 21) OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 (g) |
| 83 | + 11) libfabric/1.21.0-GCCcore-13.3.0 |
| 84 | +
|
| 85 | + Where: |
| 86 | + g: built for GPU |
| 87 | +
|
| 88 | +# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5 |
| 89 | +# Datatype: MPI_CHAR. |
| 90 | +# Size Bandwidth (MB/s) |
| 91 | +1 2.57 |
| 92 | +2 5.11 |
| 93 | +4 10.22 |
| 94 | +8 20.66 |
| 95 | +16 40.44 |
| 96 | +32 80.95 |
| 97 | +64 165.02 |
| 98 | +128 329.14 |
| 99 | +256 650.10 |
| 100 | +512 1301.93 |
| 101 | +1024 2608.66 |
| 102 | +2048 5189.90 |
| 103 | +4096 10332.67 |
| 104 | +8192 19474.04 |
| 105 | +16384 28342.00 |
| 106 | +32768 33507.82 |
| 107 | +65536 37659.55 |
| 108 | +131072 41730.65 |
| 109 | +262144 44740.60 |
| 110 | +524288 45448.67 |
| 111 | +1048576 45700.68 |
| 112 | +2097152 45895.85 |
| 113 | +4194304 46035.77 |
| 114 | +
|
| 115 | +# OSU MPI-CUDA Latency Test v7.5 |
| 116 | +# Datatype: MPI_CHAR. |
| 117 | +# Size Avg Latency(us) |
| 118 | +1 2.38 |
| 119 | +2 2.34 |
| 120 | +4 2.34 |
| 121 | +8 2.32 |
| 122 | +16 2.34 |
| 123 | +32 2.34 |
| 124 | +64 2.34 |
| 125 | +128 3.16 |
| 126 | +256 3.31 |
| 127 | +512 3.35 |
| 128 | +1024 3.46 |
| 129 | +2048 3.60 |
| 130 | +4096 3.80 |
| 131 | +8192 4.08 |
| 132 | +16384 4.63 |
| 133 | +32768 7.55 |
| 134 | +65536 10.07 |
| 135 | +131072 12.15 |
| 136 | +262144 17.37 |
| 137 | +524288 28.50 |
| 138 | +1048576 50.04 |
| 139 | +2097152 93.27 |
| 140 | +4194304 179.65 |
| 141 | +``` |
| 142 | + |
| 143 | +Test using `OSU-Micro-Benchmarks/7.5` with `PrgEnv-cray`: |
| 144 | +``` |
| 145 | +
|
| 146 | +hostname: |
| 147 | +gpu-1-111 |
| 148 | +gpu-1-102 |
| 149 | +
|
| 150 | +CPU info: |
| 151 | +Vendor ID: ARM |
| 152 | +
|
| 153 | +Currently Loaded Modules: |
| 154 | + 1) craype-arm-grace 7) cray-dsmml/0.3.0 |
| 155 | + 2) libfabric/2.3.1 8) cray-mpich/9.1.0 |
| 156 | + 3) craype-network-ofi 9) cray-libsci/26.03.0 |
| 157 | + 4) perftools-base/26.03.0 10) PrgEnv-cray/8.7.0 |
| 158 | + 5) xpmem/2.11.3-1.3_gdbda01a1eb3d 11) cuda/13.0 |
| 159 | + 6) cce/21.0.0 12) CrayEnv |
| 160 | + 7) craype/2.7.36 |
| 161 | + |
| 162 | +# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5 |
| 163 | +# Datatype: MPI_CHAR. |
| 164 | +# Size Bandwidth (MB/s) |
| 165 | +1 1.14 |
| 166 | +2 2.23 |
| 167 | +4 4.56 |
| 168 | +8 9.18 |
| 169 | +16 18.41 |
| 170 | +32 36.77 |
| 171 | +64 74.20 |
| 172 | +128 147.12 |
| 173 | +256 275.37 |
| 174 | +512 569.29 |
| 175 | +1024 1161.92 |
| 176 | +2048 2339.97 |
| 177 | +4096 4640.06 |
| 178 | +8192 9350.01 |
| 179 | +16384 18583.90 |
| 180 | +32768 23840.66 |
| 181 | +65536 34521.83 |
| 182 | +131072 39704.04 |
| 183 | +262144 41814.18 |
| 184 | +524288 44072.94 |
| 185 | +1048576 44682.92 |
| 186 | +2097152 45122.15 |
| 187 | +4194304 45029.99 |
| 188 | +
|
| 189 | +# OSU MPI-CUDA Latency Test v7.5 |
| 190 | +# Datatype: MPI_CHAR. |
| 191 | +# Size Avg Latency(us) |
| 192 | +1 3.31 |
| 193 | +2 3.30 |
| 194 | +4 3.24 |
| 195 | +8 3.36 |
| 196 | +16 3.21 |
| 197 | +32 3.36 |
| 198 | +64 3.24 |
| 199 | +128 4.45 |
| 200 | +256 4.43 |
| 201 | +512 4.56 |
| 202 | +1024 4.62 |
| 203 | +2048 4.81 |
| 204 | +4096 4.92 |
| 205 | +8192 5.36 |
| 206 | +16384 6.46 |
| 207 | +32768 10.14 |
| 208 | +65536 11.58 |
| 209 | +131072 14.56 |
| 210 | +262144 19.77 |
| 211 | +524288 31.93 |
| 212 | +1048576 56.43 |
| 213 | +2097152 102.16 |
| 214 | +4194304 181.70 |
| 215 | +``` |
| 216 | +</details> |
| 217 | + |
| 218 | +## Conclusion |
| 219 | + |
| 220 | +There is a notable improvement in performance compared to the [previous blog post](../../2025/09/eessi-cray-slingshot11.md). |
| 221 | + |
| 222 | +While additional testing is still required, the current results are highly satisfactory. |
0 commit comments