|
| 1 | +# RCCL tests |
| 2 | + |
| 3 | +This example shows how to run distributed [RCCL tests :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/rccl-tests){:target="_blank"} with MPI using `dstack`. |
| 4 | + |
| 5 | +## Running as a task |
| 6 | + |
| 7 | +Here's an example of a task that runs AllReduce test on 2 nodes, each with 8 `Mi300x` GPUs (16 processes in total). |
| 8 | + |
| 9 | +<div editor-title="examples/misc/rccl-tests/.dstack.yml"> |
| 10 | + |
| 11 | +```yaml |
| 12 | +type: task |
| 13 | +name: rccl-tests |
| 14 | + |
| 15 | +nodes: 2 |
| 16 | + |
| 17 | +# Mount the system libraries folder from the host |
| 18 | +volumes: |
| 19 | + - /usr/local/lib:/mnt/lib |
| 20 | + |
| 21 | +image: rocm/dev-ubuntu-22.04:6.4-complete |
| 22 | +env: |
| 23 | + - NCCL_DEBUG=INFO |
| 24 | + - OPEN_MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi |
| 25 | +commands: |
| 26 | + # Setup MPI and build RCCL tests |
| 27 | + - apt-get install -y git libopenmpi-dev openmpi-bin |
| 28 | + - git clone https://github.com/ROCm/rccl-tests.git |
| 29 | + - cd rccl-tests |
| 30 | + - make MPI=1 MPI_HOME=${OPEN_MPI_HOME} |
| 31 | + |
| 32 | + # Preload the RoCE driver library from the host (for Broadcom driver compatibility) |
| 33 | + - export LD_PRELOAD=/mnt/lib/libbnxt_re-rdmav34.so |
| 34 | + |
| 35 | + # Run RCCL tests via MPI |
| 36 | + - | |
| 37 | + FIFO=/tmp/${DSTACK_RUN_NAME} |
| 38 | + if [ ${DSTACK_NODE_RANK} -eq 0 ]; then |
| 39 | + sleep 10 |
| 40 | + echo "$DSTACK_NODES_IPS" | tr ' ' '\n' > hostfile |
| 41 | + MPIRUN='mpirun --allow-run-as-root --hostfile hostfile' |
| 42 | + # Wait for other nodes |
| 43 | + while true; do |
| 44 | + if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then |
| 45 | + break |
| 46 | + fi |
| 47 | + echo 'Waiting for worker nodes...' |
| 48 | + sleep 5 |
| 49 | + done |
| 50 | + # Run NCCL Tests |
| 51 | + ${MPIRUN} \ |
| 52 | + -n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \ |
| 53 | + --mca btl_tcp_if_include ens41np0 \ |
| 54 | + -x LD_PRELOAD \ |
| 55 | + -x NCCL_IB_HCA=mlx5_0/1,bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 \ |
| 56 | + -x NCCL_IB_GID_INDEX=3 \ |
| 57 | + -x NCCL_IB_DISABLE=0 \ |
| 58 | + ./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 20 -c 0; |
| 59 | + # Notify worker nodes the MPI run is finished |
| 60 | + ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}" |
| 61 | + else |
| 62 | + mkfifo ${FIFO} |
| 63 | + # Wait for a message from the master node |
| 64 | + cat ${FIFO} |
| 65 | + fi |
| 66 | +
|
| 67 | +resources: |
| 68 | + gpu: MI300X:8 |
| 69 | +``` |
| 70 | +
|
| 71 | +</div> |
| 72 | +
|
| 73 | +!!! info "MPI" |
| 74 | + RCCL tests rely on MPI to run on multiple processes. The master node (`DSTACK_NODE_RANK=0`) generates `hostfile` (using `DSTACK_NODES_IPS`) |
| 75 | + and waits until worker nodes are accessible via MPI. |
| 76 | + Then, it executes `/rccl-tests/build/all_reduce_perf` across all GPUs. |
| 77 | + |
| 78 | + Worker nodes use a `FIFO` pipe to wait for until the MPI run is finished. |
| 79 | + |
| 80 | + There is an open [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/2467){:target="_blank"} to simplify the use of MPI with distributed tasks. |
| 81 | + |
| 82 | +!!! info "RoCE library" |
| 83 | + Broadcom RoCE drivers require the `libbnxt_re` userspace library inside the container to be compatible with the host’s Broadcom |
| 84 | + kernel driver `bnxt_re`. To ensure this compatibility, we mount `libbnxt_re-rdmav34.so` from the host and preload it |
| 85 | + using `LD_PRELOAD` when running MPI. |
| 86 | + |
| 87 | +### Creating a fleet |
| 88 | + |
| 89 | +Define an SSH fleet configuration by listing the IP addresses of each node in the cluster, along with the SSH user and SSH key configured for each host. |
| 90 | + |
| 91 | +```yaml |
| 92 | +type: fleet |
| 93 | +# The name is optional, if not specified, generated randomly |
| 94 | +name: mi300x-fleet |
| 95 | +
|
| 96 | +# SSH credentials for the on-prem servers |
| 97 | +ssh_config: |
| 98 | + user: root |
| 99 | + identity_file: ~/.ssh/id_rsa |
| 100 | + hosts: |
| 101 | + - 144.202.58.28 |
| 102 | + - 137.220.58.52 |
| 103 | +``` |
| 104 | + |
| 105 | +### Apply a configuration |
| 106 | + |
| 107 | +To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply/) command. |
| 108 | + |
| 109 | +<div class="termy"> |
| 110 | + |
| 111 | +```shell |
| 112 | +$ dstack apply -f examples/misc/rccl-tests/.dstack.yml |
| 113 | +
|
| 114 | + # BACKEND RESOURCES INSTANCE TYPE PRICE |
| 115 | + 1 ssh (remote) cpu=256 mem=2268GB disk=752GB instance $0 idle |
| 116 | + MI300X:192GB:8 |
| 117 | + 2 ssh (remote) cpu=256 mem=2268GB disk=752GB instance $0 idle |
| 118 | + MI300X:192GB:8 |
| 119 | +
|
| 120 | +Submit the run rccl-tests? [y/n]: y |
| 121 | +``` |
| 122 | + |
| 123 | +</div> |
| 124 | + |
| 125 | +## Source code |
| 126 | + |
| 127 | +The source-code of this example can be found in |
| 128 | +[`examples/misc/rccl-tests` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/misc/rccl-tests). |
| 129 | + |
| 130 | +## What's next? |
| 131 | + |
| 132 | +1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), |
| 133 | + [services](https://dstack.ai/docs/services), and [fleets](https://dstack.ai/docs/concepts/fleets). |
0 commit comments