Skip to content

Commit 3ee2292

Browse files
[Blog] Supporting MPI and NCCL/RCCL tests (#2465)
1 parent ebbef3c commit 3ee2292

File tree

3 files changed

+126
-9
lines changed

3 files changed

+126
-9
lines changed

docs/blog/posts/mpi.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
---
2+
title: "Supporting MPI and NCCL/RCCL tests"
3+
date: 2025-04-02
4+
description: "TBA"
5+
slug: cursor
6+
image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-mpi-v2.png?raw=true
7+
categories:
8+
- SSH fleets
9+
- Cloud fleets
10+
---
11+
12+
# Supporting MPI and NCCL/RCCL tests
13+
14+
As AI models grow in complexity, efficient orchestration tools become increasingly important.
15+
[Fleets](../../docs/concepts/fleets.md) introduced by `dstack` last year streamline
16+
[task execution](../../docs/concepts/tasks.md) on both cloud and
17+
on-prem clusters, whether it's pre-training, fine-tuning, or batch processing.
18+
19+
The strength of `dstack` lies in its flexibility. Users can leverage distributed framework like
20+
`torchrun`, `accelerate`, or others. `dstack` handles node provisioning, job execution, and automatically propagates
21+
system environment variables—such as `DSTACK_NODE_RANK`, `DSTACK_MASTER_NODE_IP`,
22+
`DSTACK_GPUS_PER_NODE` and [others](../../docs/concepts/tasks.md#system-environment-variables)—to containers.
23+
24+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-mpi-v2.png?raw=true" width="630"/>
25+
26+
One use case `dstack` hasn’t supported until now is MPI, as it requires a scheduled environment or
27+
direct SSH connections between containers. Since `mpirun` is essential for running NCCL/RCCL tests—crucial for large-scale
28+
cluster usage—we’ve added support for it.
29+
30+
<!-- more -->
31+
32+
Below is an example of a task that runs AllReduce test on 2 nodes, each with 4 GPUs (8 processes in total).
33+
34+
<div editor-title="examples/misc/nccl-tests/.dstack.yml">
35+
36+
```yaml
37+
type: task
38+
name: nccl-tests
39+
40+
nodes: 2
41+
42+
image: dstackai/efa
43+
env:
44+
- NCCL_DEBUG=INFO
45+
commands:
46+
- |
47+
# We use FIFO for inter-node communication
48+
FIFO=/tmp/dstack_job
49+
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
50+
cd /root/nccl-tests/build
51+
# Generate hostfile for mpirun
52+
: > hostfile
53+
for ip in ${DSTACK_NODES_IPS}; do
54+
echo "${ip} slots=${DSTACK_GPUS_PER_NODE}" >> hostfile
55+
done
56+
MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
57+
# Wait for other nodes
58+
while true; do
59+
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
60+
break
61+
fi
62+
echo 'Waiting for nodes...'
63+
sleep 5
64+
done
65+
# Run NCCL tests
66+
${MPIRUN} \
67+
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
68+
--mca pml ^cm \
69+
--mca btl tcp,self \
70+
--mca btl_tcp_if_exclude lo,docker0 \
71+
--bind-to none \
72+
./all_reduce_perf -b 8 -e 8G -f 2 -g 1
73+
# Notify nodes the job is done
74+
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
75+
else
76+
mkfifo ${FIFO}
77+
# Wait for a message from the first node
78+
cat ${FIFO}
79+
fi
80+
81+
resources:
82+
gpu: nvidia:4:16GB
83+
shm_size: 16GB
84+
85+
```
86+
87+
</div>
88+
89+
The first worker node (`DSTACK_NODE_RANK=0`) generates a `hostfile` listing all node IPs and waits until all nodes are
90+
reachable via MPI. Once confirmed, it launches the `/root/nccl-tests/build/all_reduce_perf` benchmark across all available GPUs in the cluster.
91+
92+
The other worker nodes remain blocked until they receive a termination signal from the master node via a FIFO pipe.
93+
94+
With this, now you can use such a task to run both NCCL or RCCL tests on both cloud and SSH fleets,
95+
as well as use MPI for other tasks.
96+
97+
> The `dstackai/efa` image used in the example comes with MPI and NCCL tests pre-installed. While it is optimized for
98+
> [AWS EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}, it can also
99+
> be used with regular TCP/IP network adapters and InfiniBand.
100+
> See the [source code :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/docker/efa) for the image.
101+
102+
!!! info "What's next?"
103+
1. Learn more about [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md)
104+
2. Check the [NCCL tests](../../examples/misc/nccl-tests/index.md) example
105+
2. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"}

docs/examples.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -167,11 +167,11 @@ hide:
167167
<a href="/examples/misc/nccl-tests"
168168
class="feature-cell sky">
169169
<h3>
170-
NCCL Tests
170+
NCCL tests
171171
</h3>
172172

173173
<p>
174-
Run multi-node NCCL Tests with MPI
174+
Run multi-node NCCL tests with MPI
175175
</p>
176176
</a>
177177
</div>

examples/misc/nccl-tests/README.md

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# NCCL Tests
1+
# NCCL tests
22

3-
This example shows how to run distributed [NCCL Tests :material-arrow-top-right-thin:{ .external }](https://github.com/NVIDIA/nccl-tests){:target="_blank"} with MPI using `dstack`.
3+
This example shows how to run distributed [NCCL tests :material-arrow-top-right-thin:{ .external }](https://github.com/NVIDIA/nccl-tests){:target="_blank"} with MPI using `dstack`.
44

55
## Running as a task
66

@@ -37,7 +37,7 @@ commands:
3737
echo 'Waiting for nodes...'
3838
sleep 5
3939
done
40-
# Run NCCL Tests
40+
# Run NCCL tests
4141
${MPIRUN} \
4242
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
4343
--mca pml ^cm \
@@ -62,15 +62,17 @@ resources:
6262
</div>
6363

6464
The script orchestrates distributed execution across multiple nodes using MPI. The master node (identified by
65-
`DSTACK_NODE_RANK=0`) generates a hostfile listing all node IPs and continuously checks until all worker nodes are
66-
accessible via MPI. Once confirmed, it executes the `all_reduce_perf` benchmark across all available GPUs.
65+
`DSTACK_NODE_RANK=0`) generates `hostfile` listing all node IPs and continuously checks until all worker nodes are
66+
accessible via MPI. Once confirmed, it executes the `/root/nccl-tests/build/all_reduce_perf` benchmark script across all available GPUs.
6767

6868
Worker nodes use a FIFO pipe to block execution until they receive a termination signal from the master
6969
node. This ensures worker nodes remain active during the test and only exit once the master node completes the
7070
benchmark.
7171

72-
> The `dstackai/efa` image is optimized for [AWS EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}
73-
> but also works with regular TCP/IP network adapters as well as InfiniBand.
72+
> The `dstackai/efa` image used in the example comes with MPI and NCCL tests pre-installed. While it is optimized for
73+
> [AWS EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}, it can also
74+
> be used with regular TCP/IP network adapters and InfiniBand.
75+
> See the [source code :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/docker/efa) for the image.
7476
7577
### Apply a configuration
7678

@@ -90,3 +92,13 @@ Submit the run nccl-tests? [y/n]: y
9092
```
9193

9294
</div>
95+
96+
## Source code
97+
98+
The source-code of this example can be found in
99+
[`examples/misc/nccl-tests` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/misc/nccl-tests).
100+
101+
## What's next?
102+
103+
1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
104+
[services](https://dstack.ai/docs/services), and [fleets](https://dstack.ai/docs/concepts/fleets).

0 commit comments

Comments
 (0)