| title | Crusoe |
|---|---|
| description | Using Crusoe clusters with InfiniBand support via VMs or Kubernetes |
dstack allows using Crusoe clusters with fast interconnect via two ways:
- VMs – If you configure a
crusoebackend indstackby providing your Crusoe credentials,dstacklets you fully provision and use clusters throughdstack. - Kubernetes – If you create a Kubernetes cluster on Crusoe and configure a
kubernetesbackend and create a backend fleet indstack,dstacklets you fully use this cluster throughdstack.
Since dstack offers a VM-based backend that natively integrates with Crusoe, you only need to provide your Crusoe credentials to dstack, and it will allow you to fully provision and use clusters on Crusoe through dstack.
Log into your Crusoe console, create an API key under your account settings, and note your project ID.
projects:
- name: main
backends:
- type: crusoe
project_id: your-project-id
creds:
type: access_key
access_key: your-access-key
secret_key: your-secret-keyOnce the backend is configured, you can create a fleet:
type: fleet
name: crusoe-fleet
nodes: 2
placement: cluster
backends: [crusoe]
resources:
gpu: A100:80GB:8Pass the fleet configuration to dstack apply:
$ dstack apply -f crusoe-fleet.dstack.ymlThis will automatically create an IB partition and provision instances with InfiniBand networking.
Once the fleet is created, you can run dev environments, tasks, and services.
If you want instances to be provisioned on demand, you can set
nodesto0..2. In this case,dstackwill create instances only when you run workloads.
- Go
Networking→Firewall Rules, clickCreate Firewall Rule, and allow ingress traffic on port30022. This port will be used by thedstackserver to access the jump host. - Go to
Orchestrationand clickCreate Cluster. Make sure to enable theNVIDIA GPU Operatoradd-on. - Go the the cluster, and click
Create Node Pool. Select the right type of the instance, andDesired Number of Nodes. - Wait until nodes are provisioned.
Even if you enable
autoscaling,dstackcan use only the nodes that are already provisioned.
Follow the standard instructions for setting up a kubernetes backend:
projects:
- name: main
backends:
- type: kubernetes
kubeconfig:
filename: <kubeconfig path>
proxy_jump:
port: 30022Once the Crusoe Managed Kubernetes cluster and the dstack server are running, you can create a fleet:
type: fleet
name: crusoe-fleet
placement: cluster
nodes: 0..
backends: [kubernetes]
resources:
# Specify requirements to filter nodes
gpu: 8Pass the fleet configuration to dstack apply:
$ dstack apply -f crusoe-fleet.dstack.ymlOnce the fleet is created, you can run dev environments, tasks, and services.
Use a distributed task that runs NCCL tests to validate cluster network bandwidth.
=== "VMs"
With the Crusoe backend, HPC-X and NCCL topology files are pre-installed on the host VM image. Mount them into the container via [instance volumes](../../concepts/volumes.md#instance-volumes).
<div editor-title="crusoe-nccl-tests.dstack.yml">
```yaml
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
volumes:
- /opt/hpcx:/opt/hpcx
- /etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo
commands:
- . /opt/hpcx/hpcx-init.sh
- hpcx_load
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
-mca btl tcp,self \
-mca coll_hcoll_enable 0 \
-x PATH \
-x LD_LIBRARY_PATH \
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x NCCL_SOCKET_NTHREADS=4 \
-x NCCL_NSOCKS_PERTHREAD=8 \
-x NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/a100-80gb-sxm-ib-cloud-hypervisor.xml \
-x NCCL_IB_MERGE_VFS=0 \
-x NCCL_IB_HCA=^mlx5_0:1 \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
else
sleep infinity
fi
backends: [crusoe]
resources:
gpu: A100:80GB:8
shm_size: 16GB
```
</div>
> Update `NCCL_TOPO_FILE` to match your instance type. Topology files for all supported types are available at `/etc/crusoe/nccl_topo/` on the host.
=== "Kubernetes"
If you're running on Crusoe Managed Kubernetes, make sure to install HPC-X and provide an up-to-date topology file.
<div editor-title="crusoe-nccl-tests.dstack.yml">
```yaml
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
commands:
# Install NCCL topology files
- curl -sSL https://gist.github.com/un-def/48df8eea222fa9547ad4441986eb15af/archive/df51d56285c5396a0e82bb42f4f970e7bb0a9b65.tar.gz -o nccl_topo.tar.gz
- mkdir -p /etc/crusoe/nccl_topo
- tar -C /etc/crusoe/nccl_topo -xf nccl_topo.tar.gz --strip-components=1
# Install and initialize HPC-X
- curl -sSL https://content.mellanox.com/hpc/hpc-x/v2.21.3/hpcx-v2.21.3-gcc-doca_ofed-ubuntu22.04-cuda12-x86_64.tbz -o hpcx.tar.bz
- mkdir -p /opt/hpcx
- tar -C /opt/hpcx -xf hpcx.tar.bz --strip-components=1 --checkpoint=10000
- . /opt/hpcx/hpcx-init.sh
- hpcx_load
# Run NCCL Tests
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
-mca btl tcp,self \
-mca coll_hcoll_enable 0 \
-x PATH \
-x LD_LIBRARY_PATH \
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x NCCL_SOCKET_NTHREADS=4 \
-x NCCL_NSOCKS_PERTHREAD=8 \
-x NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/a100-80gb-sxm-ib-cloud-hypervisor.xml \
-x NCCL_IB_MERGE_VFS=0 \
-x NCCL_IB_AR_THRESHOLD=0 \
-x NCCL_IB_PCI_RELAXED_ORDERING=1 \
-x NCCL_IB_SPLIT_DATA_ON_QPS=0 \
-x NCCL_IB_QPS_PER_CONNECTION=2 \
-x NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
-x UCX_NET_DEVICES=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
else
sleep infinity
fi
# Required for IB
privileged: true
resources:
gpu: A100:8
shm_size: 16GB
```
</div>
> The task above downloads an A100 topology file from a Gist. The most reliable way to obtain the latest topology is to copy it from a Crusoe-provisioned VM (see [VMs](#vms)).
??? info "Privileged"
When running on Crusoe Managed Kubernetes, set `privileged` to `true` to ensure access to InfiniBand.
Pass the configuration to dstack apply:
$ dstack apply -f crusoe-nccl-tests.dstack.yml- Learn about dev environments, tasks, services
- Check out backends and fleets
- Check the docs on Crusoe's networking and "Crusoe Managed" Kubernetes