Skip to content

Commit 8632ded

Browse files
[Docs] Minor PR review feedback fixes
1 parent be0bec6 commit 8632ded

6 files changed

Lines changed: 35 additions & 32 deletions

File tree

docs/blog/posts/mpi.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -86,10 +86,10 @@ resources:
8686

8787
</div>
8888

89-
The first worker node (`DSTACK_NODE_RANK=0`) generates a `hostfile` listing all node IPs and waits until all nodes are
89+
The master node (`DSTACK_NODE_RANK=0`) generates a `hostfile` listing all node IPs and waits until all nodes are
9090
reachable via MPI. Once confirmed, it launches the `/root/nccl-tests/build/all_reduce_perf` benchmark across all available GPUs in the cluster.
9191

92-
The other worker nodes remain blocked until they receive a termination signal from the master node via a FIFO pipe.
92+
Non-master nodes remain blocked until they receive a termination signal from the master node via a FIFO pipe.
9393

9494
With this, now you can use such a task to run both NCCL or RCCL tests on both cloud and SSH fleets,
9595
as well as use MPI for other tasks.
@@ -102,4 +102,4 @@ as well as use MPI for other tasks.
102102
!!! info "What's next?"
103103
1. Learn more about [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md)
104104
2. Check the [NCCL tests](../../examples/clusters/nccl-tests/index.md) example
105-
2. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"}
105+
3. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"}

docs/docs/concepts/fleets.md

Lines changed: 18 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -63,33 +63,34 @@ Once the status of instances changes to `idle`, they can be used by dev environm
6363

6464
To ensure instances are interconnected (e.g., for
6565
[distributed tasks](tasks.md#distributed-tasks)), set `placement` to `cluster`.
66-
This ensures all instances are provisioned in the same backend and region with optimal inter-node connectivity
66+
This ensures all instances are provisioned with optimal inter-node connectivity.
6767

6868
??? info "AWS"
69-
When you create a cloud fleet with `aws`, [Elastic Fabric Adapter networking :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type.
69+
When you create a cloud fleet with AWS, [Elastic Fabric Adapter networking :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type.
7070
Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration.
7171
Otherwise, instances are only connected by the default VPC subnet.
7272

7373
Refer to the [EFA](../../blog/posts/efa.md) example for more details.
7474

7575
??? info "GCP"
76-
When you create a cloud fleet with `gcp`, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.
76+
When you create a cloud fleet with GCP, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.
7777

7878
!!! info "Backend configuration"
7979
Note, GPUDirect-TCPXO and GPUDirect-TCPX require `extra_vpcs` to be configured in the `gcp` backend configuration.
8080
Refer to the [A3 Mega](../../examples/clusters/a3mega/index.md) and
8181
[A3 High](../../examples/clusters/a3high/index.md) examples for more details.
8282

8383
??? info "Nebius"
84-
When you create a Nebius cloud fleet with `placement: cluster`, [InfiniBand networking :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type.
84+
When you create a cloud fleet with Nebius, [InfiniBand networking :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type.
8585
Otherwise, instances are only connected by the default VPC subnet.
8686

87-
An InfiniBand fabric for the cluster is selected automatically.
88-
If you prefer to use some specific fabrics, configure them in the
87+
An InfiniBand fabric for the cluster is selected automatically. If you prefer to use some specific fabrics, configure them in the
8988
[backend settings](../reference/server/config.yml.md#nebius).
9089

91-
> The `cluster` placement is supported only for `aws`, `azure`, `gcp`, `nebius`, `oci`, and `vultr`
92-
> backends.
90+
The `cluster` placement is supported for `aws`, `azure`, `gcp`, `nebius`, `oci`, and `vultr`
91+
backends.
92+
93+
> For more details on optimal inter-node connectivity, read the [Clusters](../guides/clusters.md) guide.
9394
9495
#### Resources
9596

@@ -312,13 +313,14 @@ Once the status of instances changes to `idle`, they can be used by dev environm
312313
If the hosts are interconnected (i.e. share the same network), set `placement` to `cluster`.
313314
This is required if you'd like to use the fleet for [distributed tasks](tasks.md#distributed-tasks).
314315

315-
##### Network
316-
317-
By default, `dstack` automatically detects the network shared by the hosts.
318-
However, it's possible to configure it explicitly via
319-
the [`network`](../reference/dstack.yml/fleet.md#network) property.
316+
??? info "Network"
317+
By default, `dstack` automatically detects the network shared by the hosts.
318+
However, it's possible to configure it explicitly via
319+
the [`network`](../reference/dstack.yml/fleet.md#network) property.
320+
321+
[//]: # (TODO: Provide an example and more detail)
320322

321-
[//]: # (TODO: Provide an example and more detail)
323+
> For more details on optimal inter-node connectivity, read the [Clusters](../guides/clusters.md) guide.
322324

323325
#### Blocks { #ssh-blocks }
324326

@@ -471,5 +473,6 @@ Alternatively, you can delete a fleet by passing the fleet name to `dstack flee
471473
To terminate and delete specific instances from a fleet, pass `-i INSTANCE_NUM`.
472474

473475
!!! info "What's next?"
474-
1. Read about [dev environments](dev-environments.md), [tasks](tasks.md), and
476+
1. Check [dev environments](dev-environments.md), [tasks](tasks.md), and
475477
[services](services.md)
478+
2. Read the [Clusters](../guides/clusters.md) guide

docs/docs/guides/clusters.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -18,22 +18,22 @@ Cloud fleets allow to provision interconnected clusters across supported backend
1818
For cloud fleets, fast interconnect is currently supported only on the `aws`, `gcp`, and `nebius` backends.
1919

2020
=== "AWS"
21-
When you create a cloud fleet with `aws`, [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.
21+
When you create a cloud fleet with AWS, [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.
2222

2323
!!! info "Backend configuration"
24-
Note, EFA requires the `public_ips` to set to `false` in the `aws` backend configuration.
24+
Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration.
2525
Refer to the [EFA](../../blog/posts/efa.md) example for more details.
2626

2727
=== "GCP"
28-
When you create a cloud fleet with `gcp`, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.
28+
When you create a cloud fleet with GCP, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.
2929

3030
!!! info "Backend configuration"
3131
Note, GPUDirect-TCPXO and GPUDirect-TCPX require `extra_vpcs` to be configured in the `gcp` backend configuration.
3232
Refer to the [A3 Mega](../../examples/clusters/a3mega/index.md) and
33-
[A3 Mega](../../examples/clusters/a3high/index.md) examples for more details.
33+
[A3 High](../../examples/clusters/a3high/index.md) examples for more details.
3434

3535
=== "Nebius"
36-
When you create a cloud fleet with `nebius`, [InfiniBand :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.
36+
When you create a cloud fleet with Nebius, [InfiniBand :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.
3737

3838
> To request fast interconnect support for a other backends,
3939
file an [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues){:target="_ blank"}.
@@ -47,7 +47,7 @@ To test the interconnect of a created fleet, ensure you run [NCCL](../../example
4747

4848
A distributed task is a task with `nodes` set to a value greater than `2`. In this case, `dstack` first ensures a
4949
suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up,
50-
`dstack` starts worker nodes and runs the task container on each worker node.
50+
`dstack` starts the rest of the nodes and runs the task container on each of them.
5151

5252
Within the task's `commands`, it's possible to use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other
5353
[system environment variables](../concepts/tasks.md#system-environment-variables) for inter-node communication.

examples/clusters/nccl-tests/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,10 +63,10 @@ resources:
6363

6464
!!! info "MPI"
6565
NCCL tests rely on MPI to run on multiple processes. The master node (`DSTACK_NODE_RANK=0`) generates `hostfile` (using `DSTACK_NODES_IPS`)
66-
and waits until worker nodes are accessible via MPI.
66+
and waits until other nodes are accessible via MPI.
6767
Then, it executes `/nccl-tests/build/all_reduce_perf` across all GPUs.
6868

69-
Worker nodes use a `FIFO` pipe to wait for until the MPI run is finished.
69+
Non-master nodes use a `FIFO` pipe to wait for until the MPI run is finished.
7070

7171
There is an open [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/2467){:target="_blank"} to simplify the use of MPI with distributed tasks.
7272

examples/clusters/rccl-tests/.dstack.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ commands:
3333
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
3434
break
3535
fi
36-
echo 'Waiting for worker nodes...'
36+
echo 'Waiting for other nodes...'
3737
sleep 5
3838
done
3939
# Run NCCL Tests
@@ -45,7 +45,7 @@ commands:
4545
-x NCCL_IB_GID_INDEX=3 \
4646
-x NCCL_IB_DISABLE=0 \
4747
./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 20 -c 0;
48-
# Notify worker nodes the MPI run is finished
48+
# Notify other nodes the MPI run is finished
4949
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
5050
else
5151
mkfifo ${FIFO}

examples/clusters/rccl-tests/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ commands:
4444
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
4545
break
4646
fi
47-
echo 'Waiting for worker nodes...'
47+
echo 'Waiting for other nodes...'
4848
sleep 5
4949
done
5050
# Run NCCL Tests
@@ -56,7 +56,7 @@ commands:
5656
-x NCCL_IB_GID_INDEX=3 \
5757
-x NCCL_IB_DISABLE=0 \
5858
./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 20 -c 0;
59-
# Notify worker nodes the MPI run is finished
59+
# Notify other nodes the MPI run is finished
6060
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
6161
else
6262
mkfifo ${FIFO}
@@ -72,10 +72,10 @@ resources:
7272
7373
!!! info "MPI"
7474
RCCL tests rely on MPI to run on multiple processes. The master node (`DSTACK_NODE_RANK=0`) generates `hostfile` (using `DSTACK_NODES_IPS`)
75-
and waits until worker nodes are accessible via MPI.
75+
and waits until other nodes are accessible via MPI.
7676
Then, it executes `/rccl-tests/build/all_reduce_perf` across all GPUs.
7777

78-
Worker nodes use a `FIFO` pipe to wait for until the MPI run is finished.
78+
Other nodes use a `FIFO` pipe to wait for until the MPI run is finished.
7979

8080
There is an open [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/2467){:target="_blank"} to simplify the use of MPI with distributed tasks.
8181

0 commit comments

Comments
 (0)