|
| 1 | +# Clusters |
| 2 | + |
| 3 | +A cluster is a fleet with its `placement` set to `cluster`. This configuration ensures that the instances within the fleet are interconnected, enabling fast inter-node communication—crucial for tasks such as efficient distributed training. |
| 4 | + |
| 5 | +## Fleets |
| 6 | + |
| 7 | +Ensure a fleet is created before you run any distributed task. This can be either an SSH fleet or a cloud fleet. |
| 8 | + |
| 9 | +### SSH fleets |
| 10 | + |
| 11 | +SSH fleets can be used to create a fleet out of existing baremetals or VMs, e.g. if they are already pre-provisioned, or set up on-premises. |
| 12 | + |
| 13 | +> For SSH fleets, fast interconnect is supported provided that the hosts are pre-configured with the appropriate interconnect drivers. |
| 14 | +
|
| 15 | +### Cloud fleets |
| 16 | + |
| 17 | +Cloud fleets allow to provision interconnected clusters across supported backends. |
| 18 | +For cloud fleets, fast interconnect is currently supported only on the `aws`, `gcp`, and `nebius` backends. |
| 19 | + |
| 20 | +=== "AWS" |
| 21 | + When you create a cloud fleet with `aws`, [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type. |
| 22 | + |
| 23 | + !!! info "Backend configuration" |
| 24 | + Note, EFA requires the `public_ips` to set to `false` in the `aws` backend configuration. |
| 25 | + Refer to the [EFA](../../blog/posts/efa.md) example for more details. |
| 26 | + |
| 27 | +=== "GCP" |
| 28 | + When you create a cloud fleet with `gcp`, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured. |
| 29 | + |
| 30 | + !!! info "Backend configuration" |
| 31 | + Note, GPUDirect-TCPXO and GPUDirect-TCPX require `extra_vpcs` to be configured in the `gcp` backend configuration. |
| 32 | + Refer to the [A3 Mega](../../examples/clusters/a3mega/index.md) and |
| 33 | + [A3 Mega](../../examples/clusters/a3high/index.md) examples for more details. |
| 34 | + |
| 35 | +=== "Nebius" |
| 36 | + When you create a cloud fleet with `nebius`, [InfiniBand :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type. |
| 37 | + |
| 38 | +> To request fast interconnect support for a other backends, |
| 39 | +file an [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues){:target="_ blank"}. |
| 40 | + |
| 41 | +## NCCL/RCCL tests |
| 42 | + |
| 43 | +To test the interconnect of a created fleet, ensure you run [NCCL](../../examples/clusters/nccl-tests/index.md) |
| 44 | +(for NVIDIA) or [RCCL](../../examples/clusters/rccl-tests/index.md) (for AMD) tests. |
| 45 | + |
| 46 | +## Distributed tasks |
| 47 | + |
| 48 | +A distributed task is a task with `nodes` set to a value greater than `2`. In this case, `dstack` first ensures a |
| 49 | +suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up, |
| 50 | +`dstack` starts worker nodes and runs the task container on each worker node. |
| 51 | + |
| 52 | +Within the task's `commands`, it's possible to use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other |
| 53 | +[system environment variables](../concepts/tasks.md#system-environment-variables) for inter-node communication. |
| 54 | + |
| 55 | +Refer to [distributed tasks](../concepts/tasks.md#distributed-tasks) for an example. |
| 56 | + |
| 57 | +!!! info "Retry policy" |
| 58 | + By default, if any of the nodes fails, `dstack` terminates the entire run. Configure a [retry policy](../concepts/tasks.md#retry-policy) to restart the run if any node fails. |
| 59 | + |
| 60 | +## Volumes |
| 61 | + |
| 62 | +### Network volumes |
| 63 | + |
| 64 | +Currently, no backend supports multi-attach network volumes for distributed tasks. However, single-attach volumes can be used by leveraging volume name [interpolation syntax](../concepts/volumes.md#distributed-tasks). This approach mounts a separate single-attach volume to each node. |
| 65 | + |
| 66 | +### Instance volumes |
| 67 | + |
| 68 | +Instance volumes enable mounting any folder from the host into the container, allowing data persistence during distributed tasks. |
| 69 | + |
| 70 | +Instance volumes can be used to mount: |
| 71 | + |
| 72 | +* Regular folders (data persists only while the fleet exists) |
| 73 | +* Folders that are mounts of shared filesystems (e.g., manually mounted shared filesystems). |
| 74 | + |
| 75 | +Refer to [instance volumes](../concepts/volumes.md#instance) for an example. |
| 76 | + |
| 77 | +!!! info "What's next?" |
| 78 | + 1. Read about [distributed tasks](../concepts/tasks.md#distributed-tasks), [fleets](../concepts/fleets.md), and [volumes](../concepts/volumes.md) |
| 79 | + 2. Browse the [Clusters](../../examples.md#clusters) examples |
| 80 | + |
0 commit comments