|
| 1 | +--- |
| 2 | +author: [Richard] |
| 3 | +date: 2026-05-08 |
| 4 | +slug: EESSI-on-Cray-Slingshot-part2 |
| 5 | +--- |
| 6 | + |
| 7 | +# MPI at Warp Speed: EESSI Meets Slingshot-11<sub><sup>(part2)</sup></sub> |
| 8 | + |
| 9 | +Building on our initial HPE/Cray Slingshot‑11 results, we further refined MPI tuning and validated the setup using EESSI/2025.06. The outcome is a significant performance improvement, bringing EESSI MPI behavior much closer to vendor tuned Cray MPI environments. |
| 10 | +In our previous blog post, [MPI at Warp Speed: EESSI Meets Slingshot‑11](https://www.eessi.io/docs/blog/2025/11/14/EESSI-on-Cray-Slingshot/), we demonstrated that EESSI could successfully leverage the HPE Cray Slingshot‑11 interconnect via the [host_injections](https://www.eessi.io/docs/site_specific_config/host_injections/) mechanism. Even as a proof‑of‑concept, the results were promising especially for GPU aware MPI communication on NVIDIA Grace Hopper systems. |
| 11 | +We have continued to tune and refine MPI communication while using EESSI/2025.06 software stack. Through updates to several core components and improvements to library configuration, we significantly reduced latency overheads and improved bandwidth utilization across Slingshot‑11. |
| 12 | +In this follow‑up post, we present the results using OSU-Micro-Benchmarks/7.5 and discuss show how close EESSI can now get to native, vendor‑optimized MPI performance on Slingshot‑11 systems. |
| 13 | + |
| 14 | +### System Architecture |
| 15 | + |
| 16 | +Our target system is [Olivia](https://documentation.sigma2.no/hpc_machines/olivia.html#olivia) which is based on HPE Cray EX platforms for compute and accelerator nodes, and HPE Cray ClusterStor for global storage, all |
| 17 | +connected via HPE Slingshot high-speed interconnect. |
| 18 | +It consists of two main distinct partitions: |
| 19 | + |
| 20 | +- **Partition 1**: x86_64 AMD CPUs without accelerators |
| 21 | +- **Partition 2**: NVIDIA Grace CPUs with Hopper accelerators |
| 22 | + |
| 23 | +### Testing |
| 24 | + |
| 25 | +The following tests were conducted on Olivia accel partition (Grace nodes with Hopper GPUs), using two-node, two-GPU configuration with one MPI task per node. |
| 26 | + |
| 27 | +We evaluated two OSU Micro-Benchmark builds: |
| 28 | + |
| 29 | +1- OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 from EESSI |
| 30 | + |
| 31 | +2- OSU-Micro-Benchmarks/7.5 compiled with PrgEnv-cray. |
| 32 | + |
| 33 | +The following commands were used to run the benchmarks: |
| 34 | + |
| 35 | +`mpirun -np 2 osu_bibw D D` |
| 36 | + |
| 37 | +`mpirun -np 2 osu_latency D D` |
| 38 | + |
| 39 | +<details> |
| 40 | +<summary>See details</summary> |
| 41 | + |
| 42 | +<b>Test using OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 from EESSI</b>: |
| 43 | +``` |
| 44 | +Environment set up to use EESSI (2025.06), have fun! |
| 45 | +
|
| 46 | +hostname: |
| 47 | +gpu-1-111 |
| 48 | +gpu-1-102 |
| 49 | +
|
| 50 | +CPU info: |
| 51 | +Vendor ID: ARM |
| 52 | +
|
| 53 | +Currently Loaded Modules: |
| 54 | + 1) EESSI/2025.06 12) PMIx/5.0.2-GCCcore-13.3.0 |
| 55 | + 2) GCCcore/13.3.0 13) PRRTE/3.0.5-GCCcore-13.3.0 |
| 56 | + 3) GCC/13.3.0 14) UCC/1.3.0-GCCcore-13.3.0 |
| 57 | + 4) numactl/2.0.18-GCCcore-13.3.0 15) OpenMPI/5.0.3-GCC-13.3.0 |
| 58 | + 5) libxml2/2.12.7-GCCcore-13.3.0 16) gompi/2024a |
| 59 | + 6) libpciaccess/0.18.1-GCCcore-13.3.0 17) GDRCopy/2.4.1-GCCcore-13.3.0 |
| 60 | + 7) hwloc/2.10.0-GCCcore-13.3.0 18) UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0 (g) |
| 61 | + 8) OpenSSL/3 19) NCCL/2.22.3-GCCcore-13.3.0-CUDA-12.6.0 (g) |
| 62 | + 9) libevent/2.1.12-GCCcore-13.3.0 20) UCC-CUDA/1.3.0-GCCcore-13.3.0-CUDA-12.6.0 (g) |
| 63 | + 10) UCX/1.16.0-GCCcore-13.3.0 21) OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 (g) |
| 64 | + Where: |
| 65 | + g: built for GPU |
| 66 | +
|
| 67 | +# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5 |
| 68 | +# Datatype: MPI_CHAR. |
| 69 | +# Size Bandwidth (MB/s) |
| 70 | +1 1.24 |
| 71 | +2 2.53 |
| 72 | +4 5.09 |
| 73 | +8 10.21 |
| 74 | +16 20.56 |
| 75 | +32 41.06 |
| 76 | +64 82.56 |
| 77 | +128 164.61 |
| 78 | +256 328.11 |
| 79 | +512 652.27 |
| 80 | +1024 1295.71 |
| 81 | +2048 2568.34 |
| 82 | +4096 3161.87 |
| 83 | +8192 10383.73 |
| 84 | +16384 19679.28 |
| 85 | +32768 26194.74 |
| 86 | +65536 34068.25 |
| 87 | +131072 38747.45 |
| 88 | +262144 38515.90 |
| 89 | +524288 37048.28 |
| 90 | +1048576 44631.12 |
| 91 | +2097152 44871.95 |
| 92 | +4194304 45065.66 |
| 93 | +
|
| 94 | +# OSU MPI-CUDA Latency Test v7.5 |
| 95 | +# Datatype: MPI_CHAR. |
| 96 | +# Size Avg Latency(us) |
| 97 | +1 2.79 |
| 98 | +2 2.82 |
| 99 | +4 2.91 |
| 100 | +8 2.76 |
| 101 | +16 2.82 |
| 102 | +32 2.89 |
| 103 | +64 2.80 |
| 104 | +128 3.71 |
| 105 | +256 4.14 |
| 106 | +512 4.21 |
| 107 | +1024 4.31 |
| 108 | +2048 4.44 |
| 109 | +4096 4.85 |
| 110 | +8192 8.40 |
| 111 | +16384 9.31 |
| 112 | +32768 15.94 |
| 113 | +65536 12.02 |
| 114 | +131072 13.51 |
| 115 | +262144 18.55 |
| 116 | +524288 29.56 |
| 117 | +1048576 51.48 |
| 118 | +2097152 94.93 |
| 119 | +4194304 180.92 |
| 120 | +``` |
| 121 | + |
| 122 | +<b>Test using OSU-Micro-Benchmarks/7.5 with PrgEnv-cray</b>: |
| 123 | +``` |
| 124 | +
|
| 125 | +hostname: |
| 126 | +gpu-1-111 |
| 127 | +gpu-1-102 |
| 128 | +
|
| 129 | +CPU info: |
| 130 | +Vendor ID: ARM |
| 131 | +
|
| 132 | +Currently Loaded Modules: |
| 133 | + 1) craype-arm-grace 8) craype/2.7.34 |
| 134 | + 2) libfabric/1.22.0 9) cray-dsmml/0.3.1 |
| 135 | + 3) craype-network-ofi 10) cray-mpich/8.1.32 |
| 136 | + 4) perftools-base/25.03.0 11) cray-libsci/25.03.0 |
| 137 | + 5) xpmem/2.11.3-1.3_gdbda01a1eb3d 12) PrgEnv-cray/8.6.0 |
| 138 | + 6) cce/19.0.0 13) cudatoolkit/24.11_12.6 |
| 139 | +
|
| 140 | +# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5 |
| 141 | +# Datatype: MPI_CHAR. |
| 142 | +# Size Bandwidth (MB/s) |
| 143 | +1 1.06 |
| 144 | +2 2.17 |
| 145 | +4 4.40 |
| 146 | +8 8.80 |
| 147 | +16 17.64 |
| 148 | +32 35.17 |
| 149 | +64 70.55 |
| 150 | +128 140.91 |
| 151 | +256 281.22 |
| 152 | +512 559.04 |
| 153 | +1024 1114.45 |
| 154 | +2048 2081.25 |
| 155 | +4096 4068.64 |
| 156 | +8192 1852.11 |
| 157 | +16384 18564.47 |
| 158 | +32768 22647.40 |
| 159 | +65536 33108.03 |
| 160 | +131072 39553.95 |
| 161 | +262144 43140.01 |
| 162 | +524288 44853.40 |
| 163 | +1048576 45761.69 |
| 164 | +2097152 46228.10 |
| 165 | +4194304 46470.29 |
| 166 | +
|
| 167 | +# OSU MPI-CUDA Latency Test v7.5 |
| 168 | +# Datatype: MPI_CHAR. |
| 169 | +# Size Avg Latency(us) |
| 170 | +1 2.76 |
| 171 | +2 2.72 |
| 172 | +4 2.90 |
| 173 | +8 2.86 |
| 174 | +16 2.85 |
| 175 | +32 2.73 |
| 176 | +64 2.60 |
| 177 | +128 3.41 |
| 178 | +256 4.17 |
| 179 | +512 4.19 |
| 180 | +1024 4.29 |
| 181 | +2048 4.44 |
| 182 | +4096 4.66 |
| 183 | +8192 7.59 |
| 184 | +16384 8.17 |
| 185 | +32768 8.44 |
| 186 | +65536 9.92 |
| 187 | +131072 12.59 |
| 188 | +262144 18.07 |
| 189 | +524288 29.00 |
| 190 | +1048576 50.64 |
| 191 | +2097152 94.06 |
| 192 | +4194304 180.44 |
| 193 | +``` |
| 194 | +</details> |
0 commit comments