You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/blog/posts/mpi.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -86,10 +86,10 @@ resources:
86
86
87
87
</div>
88
88
89
-
The first worker node (`DSTACK_NODE_RANK=0`) generates a `hostfile` listing all node IPs and waits until all nodes are
89
+
The master node (`DSTACK_NODE_RANK=0`) generates a `hostfile` listing all node IPs and waits until all nodes are
90
90
reachable via MPI. Once confirmed, it launches the `/root/nccl-tests/build/all_reduce_perf` benchmark across all available GPUs in the cluster.
91
91
92
-
The other worker nodes remain blocked until they receive a termination signal from the master node via a FIFO pipe.
92
+
Non-master nodes remain blocked until they receive a termination signal from the master node via a FIFO pipe.
93
93
94
94
With this, now you can use such a task to run both NCCL or RCCL tests on both cloud and SSH fleets,
95
95
as well as use MPI for other tasks.
@@ -102,4 +102,4 @@ as well as use MPI for other tasks.
102
102
!!! info "What's next?"
103
103
1. Learn more about [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md)
104
104
2. Check the [NCCL tests](../../examples/clusters/nccl-tests/index.md) example
Copy file name to clipboardExpand all lines: docs/docs/concepts/fleets.md
+18-15Lines changed: 18 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -63,33 +63,34 @@ Once the status of instances changes to `idle`, they can be used by dev environm
63
63
64
64
To ensure instances are interconnected (e.g., for
65
65
[distributed tasks](tasks.md#distributed-tasks)), set `placement` to `cluster`.
66
-
This ensures all instances are provisioned in the same backend and region with optimal inter-node connectivity
66
+
This ensures all instances are provisioned with optimal inter-node connectivity.
67
67
68
68
??? info "AWS"
69
-
When you create a cloud fleet with `aws`, [Elastic Fabric Adapter networking :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type.
69
+
When you create a cloud fleet with AWS, [Elastic Fabric Adapter networking :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type.
70
70
Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration.
71
71
Otherwise, instances are only connected by the default VPC subnet.
72
72
73
73
Refer to the [EFA](../../blog/posts/efa.md) example for more details.
74
74
75
75
??? info "GCP"
76
-
When you create a cloud fleet with `gcp`, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.
76
+
When you create a cloud fleet with GCP, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.
77
77
78
78
!!! info "Backend configuration"
79
79
Note, GPUDirect-TCPXO and GPUDirect-TCPX require `extra_vpcs` to be configured in the `gcp` backend configuration.
80
80
Refer to the [A3 Mega](../../examples/clusters/a3mega/index.md) and
81
81
[A3 High](../../examples/clusters/a3high/index.md) examples for more details.
82
82
83
83
??? info "Nebius"
84
-
When you create a Nebius cloud fleet with `placement: cluster`, [InfiniBand networking :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type.
84
+
When you create a cloud fleet with Nebius, [InfiniBand networking :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type.
85
85
Otherwise, instances are only connected by the default VPC subnet.
86
86
87
-
An InfiniBand fabric for the cluster is selected automatically.
88
-
If you prefer to use some specific fabrics, configure them in the
87
+
An InfiniBand fabric for the cluster is selected automatically. If you prefer to use some specific fabrics, configure them in the
Copy file name to clipboardExpand all lines: docs/docs/guides/clusters.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,22 +18,22 @@ Cloud fleets allow to provision interconnected clusters across supported backend
18
18
For cloud fleets, fast interconnect is currently supported only on the `aws`, `gcp`, and `nebius` backends.
19
19
20
20
=== "AWS"
21
-
When you create a cloud fleet with `aws`, [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.
21
+
When you create a cloud fleet with AWS, [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.
22
22
23
23
!!! info "Backend configuration"
24
-
Note, EFA requires the `public_ips` to set to `false` in the `aws` backend configuration.
24
+
Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration.
25
25
Refer to the [EFA](../../blog/posts/efa.md) example for more details.
26
26
27
27
=== "GCP"
28
-
When you create a cloud fleet with `gcp`, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.
28
+
When you create a cloud fleet with GCP, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.
29
29
30
30
!!! info "Backend configuration"
31
31
Note, GPUDirect-TCPXO and GPUDirect-TCPX require `extra_vpcs` to be configured in the `gcp` backend configuration.
32
32
Refer to the [A3 Mega](../../examples/clusters/a3mega/index.md) and
33
-
[A3 Mega](../../examples/clusters/a3high/index.md) examples for more details.
33
+
[A3 High](../../examples/clusters/a3high/index.md) examples for more details.
34
34
35
35
=== "Nebius"
36
-
When you create a cloud fleet with `nebius`, [InfiniBand :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.
36
+
When you create a cloud fleet with Nebius, [InfiniBand :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.
37
37
38
38
> To request fast interconnect support for a other backends,
39
39
file an [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues){:target="_ blank"}.
@@ -47,7 +47,7 @@ To test the interconnect of a created fleet, ensure you run [NCCL](../../example
47
47
48
48
A distributed task is a task with `nodes` set to a value greater than `2`. In this case, `dstack` first ensures a
49
49
suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up,
50
-
`dstack` starts worker nodes and runs the task container on each worker node.
50
+
`dstack` starts the rest of the nodes and runs the task container on each of them.
51
51
52
52
Within the task's `commands`, it's possible to use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other
53
53
[system environment variables](../concepts/tasks.md#system-environment-variables) for inter-node communication.
Copy file name to clipboardExpand all lines: examples/clusters/nccl-tests/README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -63,10 +63,10 @@ resources:
63
63
64
64
!!! info "MPI"
65
65
NCCL tests rely on MPI to run on multiple processes. The master node (`DSTACK_NODE_RANK=0`) generates `hostfile` (using `DSTACK_NODES_IPS`)
66
-
and waits until worker nodes are accessible via MPI.
66
+
and waits until other nodes are accessible via MPI.
67
67
Then, it executes `/nccl-tests/build/all_reduce_perf` across all GPUs.
68
68
69
-
Worker nodes use a `FIFO` pipe to wait for until the MPI run is finished.
69
+
Non-master nodes use a `FIFO` pipe to wait for until the MPI run is finished.
70
70
71
71
There is an open [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/2467){:target="_blank"} to simplify the use of MPI with distributed tasks.
RCCL tests rely on MPI to run on multiple processes. The master node (`DSTACK_NODE_RANK=0`) generates `hostfile` (using `DSTACK_NODES_IPS`)
75
-
and waits until worker nodes are accessible via MPI.
75
+
and waits until other nodes are accessible via MPI.
76
76
Then, it executes `/rccl-tests/build/all_reduce_perf` across all GPUs.
77
77
78
-
Worker nodes use a `FIFO` pipe to wait for until the MPI run is finished.
78
+
Other nodes use a `FIFO` pipe to wait for until the MPI run is finished.
79
79
80
80
There is an open [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/2467){:target="_blank"} to simplify the use of MPI with distributed tasks.
0 commit comments