Skip to content

Commit d7d41f4

Browse files
BihanBihan  Ranapeterschmidt85
authored
Add rccl test (#2613)
* Add rccl test * [Docs] Minor refactoring and cleanup of RCCL tests examples --------- Co-authored-by: Bihan Rana <bihan@Bihans-MacBook-Pro.local> Co-authored-by: peterschmidt85 <andrey.cheptsov@gmail.com>
1 parent 2d733b0 commit d7d41f4

File tree

6 files changed

+217
-13
lines changed

6 files changed

+217
-13
lines changed

docs/examples.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,16 @@ hide:
183183
Run multi-node NCCL tests with MPI
184184
</p>
185185
</a>
186+
<a href="/examples/misc/rccl-tests"
187+
class="feature-cell sky">
188+
<h3>
189+
RCCL tests
190+
</h3>
191+
192+
<p>
193+
Run multi-node RCCL tests with MPI
194+
</p>
195+
</a>
186196
<a href="/examples/misc/a3mega-clusters"
187197
class="feature-cell sky">
188198
<h3>

docs/examples/misc/rccl-tests/index.md

Whitespace-only changes.

examples/misc/nccl-tests/README.md

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -61,18 +61,21 @@ resources:
6161

6262
</div>
6363

64-
The script orchestrates distributed execution across multiple nodes using MPI. The master node (identified by
65-
`DSTACK_NODE_RANK=0`) generates `hostfile` listing all node IPs and continuously checks until all worker nodes are
66-
accessible via MPI. Once confirmed, it executes the `/root/nccl-tests/build/all_reduce_perf` benchmark script across all available GPUs.
67-
68-
Worker nodes use a FIFO pipe to block execution until they receive a termination signal from the master
69-
node. This ensures worker nodes remain active during the test and only exit once the master node completes the
70-
benchmark.
71-
72-
> The `dstackai/efa` image used in the example comes with MPI and NCCL tests pre-installed. While it is optimized for
73-
> [AWS EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}, it can also
74-
> be used with regular TCP/IP network adapters and InfiniBand.
75-
> See the [source code :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/docker/efa) for the image.
64+
!!! info "MPI"
65+
NCCL tests rely on MPI to run on multiple processes. The master node (`DSTACK_NODE_RANK=0`) generates `hostfile` (using `DSTACK_NODES_IPS`)
66+
and waits until worker nodes are accessible via MPI.
67+
Then, it executes `/nccl-tests/build/all_reduce_perf` across all GPUs.
68+
69+
Worker nodes use a `FIFO` pipe to wait for until the MPI run is finished.
70+
71+
There is an open [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/2467){:target="_blank"} to simplify the use of MPI with distributed tasks.
72+
73+
!!! info "Docker image"
74+
The `dstackai/efa` image used in the example comes with MPI and NCCL tests pre-installed. While it is optimized for
75+
[AWS EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}, it can also
76+
be used with regular TCP/IP network adapters and InfiniBand.
77+
78+
See the [source code :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/docker/efa) for the image.
7679

7780
### Apply a configuration
7881

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
type: task
2+
name: rccl-tests
3+
4+
nodes: 2
5+
6+
# Uncomment to mount the system libraries folder from the host
7+
#volumes:
8+
# - /usr/local/lib:/mnt/lib
9+
10+
image: rocm/dev-ubuntu-22.04:6.4-complete
11+
env:
12+
- NCCL_DEBUG=INFO
13+
- OPEN_MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi
14+
commands:
15+
# Setup MPI and build RCCL tests
16+
- apt-get install -y git libopenmpi-dev openmpi-bin
17+
- git clone https://github.com/ROCm/rccl-tests.git
18+
- cd rccl-tests
19+
- make MPI=1 MPI_HOME=${OPEN_MPI_HOME}
20+
21+
# Uncomment to preload the RoCE driver library from the host (for Broadcom driver compatibility)
22+
#- export LD_PRELOAD=/mnt/lib/libbnxt_re-rdmav34.so
23+
24+
# Run RCCL tests via MPI
25+
- |
26+
FIFO=/tmp/${DSTACK_RUN_NAME}
27+
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
28+
sleep 10
29+
echo "$DSTACK_NODES_IPS" | tr ' ' '\n' > hostfile
30+
MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
31+
# Wait for other nodes
32+
while true; do
33+
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
34+
break
35+
fi
36+
echo 'Waiting for worker nodes...'
37+
sleep 5
38+
done
39+
# Run NCCL Tests
40+
${MPIRUN} \
41+
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
42+
--mca btl_tcp_if_include ens41np0 \
43+
-x LD_PRELOAD \
44+
-x NCCL_IB_HCA=mlx5_0/1,bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 \
45+
-x NCCL_IB_GID_INDEX=3 \
46+
-x NCCL_IB_DISABLE=0 \
47+
./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 20 -c 0;
48+
# Notify worker nodes the MPI run is finished
49+
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
50+
else
51+
mkfifo ${FIFO}
52+
# Wait for a message from the master node
53+
cat ${FIFO}
54+
fi
55+
56+
resources:
57+
gpu: MI300X:8

examples/misc/rccl-tests/README.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# RCCL tests
2+
3+
This example shows how to run distributed [RCCL tests :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/rccl-tests){:target="_blank"} with MPI using `dstack`.
4+
5+
## Running as a task
6+
7+
Here's an example of a task that runs AllReduce test on 2 nodes, each with 8 `Mi300x` GPUs (16 processes in total).
8+
9+
<div editor-title="examples/misc/rccl-tests/.dstack.yml">
10+
11+
```yaml
12+
type: task
13+
name: rccl-tests
14+
15+
nodes: 2
16+
17+
# Mount the system libraries folder from the host
18+
volumes:
19+
- /usr/local/lib:/mnt/lib
20+
21+
image: rocm/dev-ubuntu-22.04:6.4-complete
22+
env:
23+
- NCCL_DEBUG=INFO
24+
- OPEN_MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi
25+
commands:
26+
# Setup MPI and build RCCL tests
27+
- apt-get install -y git libopenmpi-dev openmpi-bin
28+
- git clone https://github.com/ROCm/rccl-tests.git
29+
- cd rccl-tests
30+
- make MPI=1 MPI_HOME=${OPEN_MPI_HOME}
31+
32+
# Preload the RoCE driver library from the host (for Broadcom driver compatibility)
33+
- export LD_PRELOAD=/mnt/lib/libbnxt_re-rdmav34.so
34+
35+
# Run RCCL tests via MPI
36+
- |
37+
FIFO=/tmp/${DSTACK_RUN_NAME}
38+
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
39+
sleep 10
40+
echo "$DSTACK_NODES_IPS" | tr ' ' '\n' > hostfile
41+
MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
42+
# Wait for other nodes
43+
while true; do
44+
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
45+
break
46+
fi
47+
echo 'Waiting for worker nodes...'
48+
sleep 5
49+
done
50+
# Run NCCL Tests
51+
${MPIRUN} \
52+
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
53+
--mca btl_tcp_if_include ens41np0 \
54+
-x LD_PRELOAD \
55+
-x NCCL_IB_HCA=mlx5_0/1,bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 \
56+
-x NCCL_IB_GID_INDEX=3 \
57+
-x NCCL_IB_DISABLE=0 \
58+
./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 20 -c 0;
59+
# Notify worker nodes the MPI run is finished
60+
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
61+
else
62+
mkfifo ${FIFO}
63+
# Wait for a message from the master node
64+
cat ${FIFO}
65+
fi
66+
67+
resources:
68+
gpu: MI300X:8
69+
```
70+
71+
</div>
72+
73+
!!! info "MPI"
74+
RCCL tests rely on MPI to run on multiple processes. The master node (`DSTACK_NODE_RANK=0`) generates `hostfile` (using `DSTACK_NODES_IPS`)
75+
and waits until worker nodes are accessible via MPI.
76+
Then, it executes `/rccl-tests/build/all_reduce_perf` across all GPUs.
77+
78+
Worker nodes use a `FIFO` pipe to wait for until the MPI run is finished.
79+
80+
There is an open [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/2467){:target="_blank"} to simplify the use of MPI with distributed tasks.
81+
82+
!!! info "RoCE library"
83+
Broadcom RoCE drivers require the `libbnxt_re` userspace library inside the container to be compatible with the host’s Broadcom
84+
kernel driver `bnxt_re`. To ensure this compatibility, we mount `libbnxt_re-rdmav34.so` from the host and preload it
85+
using `LD_PRELOAD` when running MPI.
86+
87+
### Creating a fleet
88+
89+
Define an SSH fleet configuration by listing the IP addresses of each node in the cluster, along with the SSH user and SSH key configured for each host.
90+
91+
```yaml
92+
type: fleet
93+
# The name is optional, if not specified, generated randomly
94+
name: mi300x-fleet
95+
96+
# SSH credentials for the on-prem servers
97+
ssh_config:
98+
user: root
99+
identity_file: ~/.ssh/id_rsa
100+
hosts:
101+
- 144.202.58.28
102+
- 137.220.58.52
103+
```
104+
105+
### Apply a configuration
106+
107+
To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply/) command.
108+
109+
<div class="termy">
110+
111+
```shell
112+
$ dstack apply -f examples/misc/rccl-tests/.dstack.yml
113+
114+
# BACKEND RESOURCES INSTANCE TYPE PRICE
115+
1 ssh (remote) cpu=256 mem=2268GB disk=752GB instance $0 idle
116+
MI300X:192GB:8
117+
2 ssh (remote) cpu=256 mem=2268GB disk=752GB instance $0 idle
118+
MI300X:192GB:8
119+
120+
Submit the run rccl-tests? [y/n]: y
121+
```
122+
123+
</div>
124+
125+
## Source code
126+
127+
The source-code of this example can be found in
128+
[`examples/misc/rccl-tests` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/misc/rccl-tests).
129+
130+
## What's next?
131+
132+
1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
133+
[services](https://dstack.ai/docs/services), and [fleets](https://dstack.ai/docs/concepts/fleets).

mkdocs.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -262,7 +262,8 @@ nav:
262262
- Tenstorrent: examples/accelerators/tenstorrent/index.md
263263
- Misc:
264264
- Docker Compose: examples/misc/docker-compose/index.md
265-
- NCCL Tests: examples/misc/nccl-tests/index.md
265+
- NCCL tests: examples/misc/nccl-tests/index.md
266+
- RCCL tests: examples/misc/rccl-tests/index.md
266267
- A3 Mega: examples/misc/a3mega-clusters/index.md
267268
- A3 High: examples/misc/a3high-clusters/index.md
268269
# - Community: community.md

0 commit comments

Comments
 (0)