[Docs] Added Clusters guide

peterschmidt85 · peterschmidt85 · commit 332611f2891e · 2025-05-15T23:46:43.000+02:00
diff --git a/docs/blog/posts/nebius.md b/docs/blog/posts/nebius.md
@@ -104,7 +104,7 @@ $ dstack apply -f .dstack.yml
 The new `nebius` backend supports CPU and GPU instances, [fleets](../../docs/concepts/fleets.md), 
 [distributed tasks](../../docs/concepts/tasks.md#distributed-tasks), and more. 
  
-> Support for [network volumes](../../docs/concepts/volumes.md#network-volumes) and accelerated cluster 
+> Support for [network volumes](../../docs/concepts/volumes.md#network) and accelerated cluster 
 interconnects is coming soon.
 
 !!! info "What's next?"
diff --git a/docs/docs/concepts/fleets.md b/docs/docs/concepts/fleets.md
@@ -66,14 +66,22 @@ To ensure instances are interconnected (e.g., for
 This ensures all instances are provisioned in the same backend and region with optimal inter-node connectivity
 
 ??? info "AWS"
-    `dstack` automatically enables the Elastic Fabric Adapter for all
-    [EFA-capable instance types :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types){:target="_blank"}.
-    If the `aws` backend config has `public_ips: false` set, `dstack` enables the maximum number of interfaces supported by the instance.
-    Otherwise, if instances have public IPs, only one EFA interface is enabled per instance due to AWS limitations.
+    When you create a cloud fleet with `aws`, [Elastic Fabric Adapter networking :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type.
+    Note, EFA requires the `public_ips` to set to `false` in the `aws` backend configuration.
+    Otherwise, instances are only connected by the default VPC subnet.
+
+    Refer to the [EFA](../../blog/posts/efa.md) example for more details.
+
+??? info "GCP"
+    When you create a cloud fleet with `gcp`, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.
+
+    !!! info "Backend configuration"    
+        Note, GPUDirect-TCPXO and GPUDirect-TCPX require `extra_vpcs` to be configured  in the `gcp` backend configuration.
+        Refer to the [A3 Mega](../../examples/clusters/a3mega/index.md) and 
+        [A3 Mega](../../examples/clusters/a3high/index.md) examples for more details.
 
 ??? info "Nebius"
-    `dstack` automatically creates an [InfiniBand cluster](https://docs.nebius.com/compute/clusters/gpu)
-    if all instances in the fleet support it.
+    When you create a cloud fleet with `nebius`, [InfiniBand networking :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type.
     Otherwise, instances are only connected by the default VPC subnet.
 
     An InfiniBand fabric for the cluster is selected automatically.
diff --git a/docs/docs/concepts/tasks.md b/docs/docs/concepts/tasks.md
@@ -137,8 +137,7 @@ resources:
 
 Nodes can communicate using their private IP addresses.
 Use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other
-[System environment variables](#system-environment-variables)
-to discover IP addresses and other details.
+[System environment variables](#system-environment-variables) for inter-node communication.
 
 ??? info "Network interface"
     Distributed frameworks usually detect the correct network interface automatically,
diff --git a/docs/docs/concepts/volumes.md b/docs/docs/concepts/volumes.md
@@ -4,12 +4,12 @@ Volumes enable data persistence between runs of dev environments, tasks, and ser
 
 `dstack` supports two kinds of volumes: 
 
-* [Network volumes](#network-volumes) &mdash; provisioned via backends and mounted to specific container directories.
+* [Network volumes](#network) &mdash; provisioned via backends and mounted to specific container directories.
   Ideal for persistent storage.
-* [Instance volumes](#instance-volumes) &mdash; bind directories on the host instance to container directories.
+* [Instance volumes](#instance) &mdash; bind directories on the host instance to container directories.
 Useful as a cache for cloud fleets or for persistent storage with SSH fleets.
 
-## Network volumes
+## Network volumes { #network }
 
 Network volumes are currently supported for the `aws`, `gcp`, and `runpod` backends.
 
@@ -130,6 +130,7 @@ and its contents will persist across runs.
 
     `dstack` will attach one of the volumes based on the region and backend of the run.  
 
+<span id="distributed-tasks"></span>
 ??? info "Distributed tasks"
     When using single-attach volumes such as AWS EBS with distributed tasks,
     you can attach different volumes to different nodes using `dstack` variable interpolation:
@@ -221,7 +222,7 @@ If you've registered an existing volume, it will be de-registered with `dstack`
 ??? info "Can I attach network volumes to multiple runs or instances?"
     You can mount a volume in multiple runs. This feature is currently supported only by the `runpod` backend.
 
-## Instance volumes
+## Instance volumes { #instance }
 
 Instance volumes allow mapping any directory on the instance where the run is executed to any path inside the container.
 This means that the data in instance volumes is persisted only if the run is executed on the same instance.
diff --git a/docs/docs/guides/clusters.md b/docs/docs/guides/clusters.md
@@ -0,0 +1,80 @@
+# Clusters
+
+A cluster is a fleet with its `placement` set to `cluster`. This configuration ensures that the instances within the fleet are interconnected, enabling fast inter-node communication—crucial for tasks such as efficient distributed training.
+
+## Fleets
+
+Ensure a fleet is created before you run any distributed task. This can be either an SSH fleet or a cloud fleet.
+
+### SSH fleets
+
+SSH fleets can be used to create a fleet out of existing baremetals or VMs, e.g. if they are already pre-provisioned, or set up on-premises.
+
+> For SSH fleets, fast interconnect is supported provided that the hosts are pre-configured with the appropriate interconnect drivers.
+
+### Cloud fleets
+
+Cloud fleets allow to provision interconnected clusters across supported backends.
+For cloud fleets, fast interconnect is currently supported only on the `aws`, `gcp`, and `nebius` backends.
+
+=== "AWS"
+    When you create a cloud fleet with `aws`, [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.
+    
+    !!! info "Backend configuration"    
+        Note, EFA requires the `public_ips` to set to `false` in the `aws` backend configuration.
+        Refer to the [EFA](../../blog/posts/efa.md) example for more details.
+
+=== "GCP"
+    When you create a cloud fleet with `gcp`, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.
+
+    !!! info "Backend configuration"    
+        Note, GPUDirect-TCPXO and GPUDirect-TCPX require `extra_vpcs` to be configured  in the `gcp` backend configuration.
+        Refer to the [A3 Mega](../../examples/clusters/a3mega/index.md) and 
+        [A3 Mega](../../examples/clusters/a3high/index.md) examples for more details.
+
+=== "Nebius"
+    When you create a cloud fleet with `nebius`, [InfiniBand :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.
+
+> To request fast interconnect support for a other backends,
+file an [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues){:target="_ blank"}. 
+
+## NCCL/RCCL tests
+
+To test the interconnect of a created fleet, ensure you run [NCCL](../../examples/clusters/nccl-tests/index.md) 
+(for NVIDIA) or [RCCL](../../examples/clusters/rccl-tests/index.md) (for AMD) tests.
+
+## Distributed tasks
+
+A distributed task is a task with `nodes` set to a value greater than `2`. In this case, `dstack` first ensures a 
+suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up,
+`dstack` starts worker nodes and runs the task container on each worker node.
+
+Within the task's `commands`, it's possible to use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other
+[system environment variables](../concepts/tasks.md#system-environment-variables) for inter-node communication.
+
+Refer to [distributed tasks](../concepts/tasks.md#distributed-tasks) for an example.
+
+!!! info "Retry policy"
+    By default, if any of the nodes fails, `dstack` terminates the entire run. Configure a [retry policy](../concepts/tasks.md#retry-policy) to  restart the run if any node fails.
+
+## Volumes
+
+### Network volumes
+
+Currently, no backend supports multi-attach network volumes for distributed tasks. However, single-attach volumes can be used by leveraging volume name [interpolation syntax](../concepts/volumes.md#distributed-tasks). This approach mounts a separate single-attach volume to each node.
+
+### Instance volumes
+
+Instance volumes enable mounting any folder from the host into the container, allowing data persistence during distributed tasks.
+
+Instance volumes can be used to mount:
+
+* Regular folders (data persists only while the fleet exists)
+* Folders that are mounts of shared filesystems (e.g., manually mounted shared filesystems).
+
+Refer to [instance volumes](../concepts/volumes.md#instance) for an example.
+
+!!! info "What's next?"
+    1. Read about [distributed tasks](../concepts/tasks.md#distributed-tasks), [fleets](../concepts/fleets.md), and [volumes](../concepts/volumes.md)
+    2. Browse the [Clusters](../../examples.md#clusters) examples
+    
diff --git a/docs/docs/guides/protips.md b/docs/docs/guides/protips.md
@@ -36,9 +36,9 @@ unlimited).
 ## Volumes
 
 To persist data across runs, it is recommended to use volumes.
-`dstack` supports two types of volumes: [network](../concepts/volumes.md#network-volumes) 
+`dstack` supports two types of volumes: [network](../concepts/volumes.md#network) 
 (for persisting data even if the instance is interrupted)
-and [instance](../concepts/volumes.md#instance-volumes) (useful for persisting cached data across runs while the instance remains active).
+and [instance](../concepts/volumes.md#instance) (useful for persisting cached data across runs while the instance remains active).
 
 > If you use [SSH fleets](../concepts/fleets.md#ssh), you can mount network storage (e.g., NFS or SMB) to the hosts and access it in runs via instance volumes.
 
diff --git a/docs/docs/guides/troubleshooting.md b/docs/docs/guides/troubleshooting.md
@@ -87,7 +87,7 @@ Examples: `gpu: amd` (one AMD GPU), `gpu: A10:4..8` (4 to 8 A10 GPUs),
 
 #### Cause 6: Network volumes
 
-If your run configuration uses [network volumes](../concepts/volumes.md#network-volumes),
+If your run configuration uses [network volumes](../concepts/volumes.md#network),
 `dstack` will only select instances from the same backend and region as the volumes.
 For AWS, the availability zone of the volume and the instance should also match.
 
@@ -97,7 +97,7 @@ Some `dstack` features are not supported by all backends. If your configuration
 one of these features, `dstack` will only select offers from the backends that support it.
 
 - [Cloud fleet](../concepts/fleets.md#cloud) configurations,
-  [Instance volumes](../concepts/volumes.md#instance-volumes),
+  [Instance volumes](../concepts/volumes.md#instance),
   and [Privileged containers](../reference/dstack.yml/dev-environment.md#privileged)
   are supported by all backends except `runpod`, `vastai`, and `kubernetes`.
 - [Clusters](../concepts/fleets.md#cloud-placement)
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -222,6 +222,7 @@ nav:
       - Guides:
           - Protips: docs/guides/protips.md
           - Metrics: docs/guides/metrics.md
+          - Clusters: docs/guides/clusters.md
           - Server deployment: docs/guides/server-deployment.md
           - Plugins: docs/guides/plugins.md
           - Troubleshooting: docs/guides/troubleshooting.md