Skip to content

Latest commit

 

History

History
133 lines (102 loc) · 4.43 KB

File metadata and controls

133 lines (102 loc) · 4.43 KB

RCCL tests

This example shows how to run distributed RCCL tests :material-arrow-top-right-thin:{ .external }{:target="_blank"} with MPI using dstack.

Running as a task

Here's an example of a task that runs AllReduce test on 2 nodes, each with 8 Mi300x GPUs (16 processes in total).

type: task
name: rccl-tests

nodes: 2

# Mount the system libraries folder from the host
volumes:
  - /usr/local/lib:/mnt/lib

image: rocm/dev-ubuntu-22.04:6.4-complete
env:
  - NCCL_DEBUG=INFO
  - OPEN_MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi
commands:
  # Setup MPI and build RCCL tests
  - apt-get install -y git libopenmpi-dev openmpi-bin
  - git clone https://github.com/ROCm/rccl-tests.git
  - cd rccl-tests
  - make MPI=1 MPI_HOME=${OPEN_MPI_HOME}

  # Preload the RoCE driver library from the host (for Broadcom driver compatibility)
  - export LD_PRELOAD=/mnt/lib/libbnxt_re-rdmav34.so

  # Run RCCL tests via MPI
  - |
    FIFO=/tmp/${DSTACK_RUN_NAME}
    if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
      sleep 10
      echo "$DSTACK_NODES_IPS" | tr ' ' '\n' > hostfile
      MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
      # Wait for other nodes
      while true; do
        if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
          break
        fi
        echo 'Waiting for other nodes...'
        sleep 5
      done
      # Run NCCL Tests
      ${MPIRUN} \
        -n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
        --mca btl_tcp_if_include ens41np0 \
        -x LD_PRELOAD \
        -x NCCL_IB_HCA=mlx5_0/1,bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 \
        -x NCCL_IB_GID_INDEX=3 \
        -x NCCL_IB_DISABLE=0 \
        ./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 20 -c 0;
      # Notify other nodes the MPI run is finished
      ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
    else
      mkfifo ${FIFO}
      # Wait for a message from the master node
      cat ${FIFO}
    fi

resources:
  gpu: MI300X:8

!!! info "MPI" RCCL tests rely on MPI to run on multiple processes. The master node (DSTACK_NODE_RANK=0) generates hostfile (using DSTACK_NODES_IPS) and waits until other nodes are accessible via MPI. Then, it executes /rccl-tests/build/all_reduce_perf across all GPUs.

Other nodes use a `FIFO` pipe to wait for until the MPI run is finished.

There is an open [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/2467){:target="_blank"} to simplify the use of MPI with distributed tasks.

!!! info "RoCE library" Broadcom RoCE drivers require the libbnxt_re userspace library inside the container to be compatible with the host’s Broadcom kernel driver bnxt_re. To ensure this compatibility, we mount libbnxt_re-rdmav34.so from the host and preload it using LD_PRELOAD when running MPI.

Creating a fleet

Define an SSH fleet configuration by listing the IP addresses of each node in the cluster, along with the SSH user and SSH key configured for each host.

type: fleet
# The name is optional, if not specified, generated randomly
name: mi300x-fleet

# SSH credentials for the on-prem servers
ssh_config:
  user: root
  identity_file: ~/.ssh/id_rsa
  hosts:
    - 144.202.58.28
    - 137.220.58.52

Apply a configuration

To run a configuration, use the dstack apply command.

$ dstack apply -f examples/distributed-training/rccl-tests/.dstack.yml

 #  BACKEND       RESOURCES                      INSTANCE TYPE   PRICE
 1  ssh (remote)  cpu=256 mem=2268GB disk=752GB  instance        $0      idle
                  MI300X:192GB:8
 2  ssh (remote)  cpu=256 mem=2268GB disk=752GB  instance        $0      idle
                  MI300X:192GB:8

Submit the run rccl-tests? [y/n]: y

Source code

The source-code of this example can be found in examples/distributed-training/rccl-tests :material-arrow-top-right-thin:{ .external }.

What's next?

  1. Check dev environments, tasks, services, and fleets.