Skip to content

Commit de0bf5f

Browse files
authored
Simplify NCCL tests example (#2723)
1 parent a444b84 commit de0bf5f

2 files changed

Lines changed: 12 additions & 59 deletions

File tree

examples/clusters/nccl-tests/.dstack.yml

Lines changed: 4 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,44 +2,25 @@ type: task
22
name: nccl-tests
33

44
nodes: 2
5+
startup_order: workers-first
6+
stop_criteria: master-done
57

68
image: dstackai/efa
79
env:
810
- NCCL_DEBUG=INFO
911
commands:
1012
- |
11-
# We use FIFO for inter-node communication
12-
FIFO=/tmp/dstack_job
1313
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
1414
cd /root/nccl-tests/build
15-
# Generate hostfile for mpirun
16-
: > hostfile
17-
for ip in ${DSTACK_NODES_IPS}; do
18-
echo "${ip} slots=${DSTACK_GPUS_PER_NODE}" >> hostfile
19-
done
20-
MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
21-
# Wait for other nodes
22-
while true; do
23-
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
24-
break
25-
fi
26-
echo 'Waiting for nodes...'
27-
sleep 5
28-
done
15+
MPIRUN="mpirun --allow-run-as-root --hostfile $DSTACK_MPI_HOSTFILE"
2916
# Run NCCL Tests
3017
${MPIRUN} \
3118
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
32-
--mca pml ^cm \
33-
--mca btl tcp,self \
3419
--mca btl_tcp_if_exclude lo,docker0 \
3520
--bind-to none \
3621
./all_reduce_perf -b 8 -e 8G -f 2 -g 1
37-
# Notify nodes the job is done
38-
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
3922
else
40-
mkfifo ${FIFO}
41-
# Wait for a message from the first node
42-
cat ${FIFO}
23+
sleep infinity
4324
fi
4425
4526
resources:

examples/clusters/nccl-tests/README.md

Lines changed: 8 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -6,51 +6,32 @@ This example shows how to run distributed [NCCL tests :material-arrow-top-right-
66

77
Here's an example of a task that runs AllReduce test on 2 nodes, each with 4 GPUs (8 processes in total).
88

9-
<div editor-title="examples/distributed-training/nccl-tests/.dstack.yml">
9+
<div editor-title="examples/clusters/nccl-tests/.dstack.yml">
1010

1111
```yaml
1212
type: task
1313
name: nccl-tests
1414

1515
nodes: 2
16+
startup_order: workers-first
17+
stop_criteria: master-done
1618

1719
image: dstackai/efa
1820
env:
1921
- NCCL_DEBUG=INFO
2022
commands:
2123
- |
22-
# We use FIFO for inter-node communication
23-
FIFO=/tmp/dstack_job
2424
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
2525
cd /root/nccl-tests/build
26-
# Generate hostfile for mpirun
27-
: > hostfile
28-
for ip in ${DSTACK_NODES_IPS}; do
29-
echo "${ip} slots=${DSTACK_GPUS_PER_NODE}" >> hostfile
30-
done
31-
MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
32-
# Wait for other nodes
33-
while true; do
34-
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
35-
break
36-
fi
37-
echo 'Waiting for nodes...'
38-
sleep 5
39-
done
40-
# Run NCCL tests
26+
MPIRUN="mpirun --allow-run-as-root --hostfile $DSTACK_MPI_HOSTFILE"
27+
# Run NCCL Tests
4128
${MPIRUN} \
4229
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
43-
--mca pml ^cm \
44-
--mca btl tcp,self \
4530
--mca btl_tcp_if_exclude lo,docker0 \
4631
--bind-to none \
4732
./all_reduce_perf -b 8 -e 8G -f 2 -g 1
48-
# Notify nodes the job is done
49-
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
5033
else
51-
mkfifo ${FIFO}
52-
# Wait for a message from the first node
53-
cat ${FIFO}
34+
sleep infinity
5435
fi
5536
5637
resources:
@@ -61,15 +42,6 @@ resources:
6142

6243
</div>
6344

64-
!!! info "MPI"
65-
NCCL tests rely on MPI to run on multiple processes. The master node (`DSTACK_NODE_RANK=0`) generates `hostfile` (using `DSTACK_NODES_IPS`)
66-
and waits until other nodes are accessible via MPI.
67-
Then, it executes `/nccl-tests/build/all_reduce_perf` across all GPUs.
68-
69-
Non-master nodes use a `FIFO` pipe to wait for until the MPI run is finished.
70-
71-
There is an open [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/2467){:target="_blank"} to simplify the use of MPI with distributed tasks.
72-
7345
!!! info "Docker image"
7446
The `dstackai/efa` image used in the example comes with MPI and NCCL tests pre-installed. While it is optimized for
7547
[AWS EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}, it can also
@@ -84,7 +56,7 @@ To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/referenc
8456
<div class="termy">
8557

8658
```shell
87-
$ dstack apply -f examples/distributed-training/nccl-tests/.dstack.yml
59+
$ dstack apply -f examples/clusters/nccl-tests/.dstack.yml
8860

8961
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
9062
1 aws us-east-1 g4dn.12xlarge 48xCPU, 192GB, 4xT4 (16GB), 100.0GB (disk) no $3.912
@@ -99,7 +71,7 @@ Submit the run nccl-tests? [y/n]: y
9971
## Source code
10072

10173
The source-code of this example can be found in
102-
[`examples/distributed-training/nccl-tests` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/nccl-tests).
74+
[`examples/clusters/nccl-tests` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/clusters/nccl-tests).
10375

10476
## What's next?
10577

0 commit comments

Comments
 (0)