Skip to content

[Docs] Added Clusters guide#2646

Merged
peterschmidt85 merged 5 commits intomasterfrom
clusters-guide
May 16, 2025
Merged

[Docs] Added Clusters guide#2646
peterschmidt85 merged 5 commits intomasterfrom
clusters-guide

Conversation

@peterschmidt85
Copy link
Copy Markdown
Contributor

No description provided.

@peterschmidt85 peterschmidt85 requested review from jvstme and r4victor May 15, 2025 21:47
Comment thread docs/docs/concepts/fleets.md Outdated
Comment thread docs/docs/concepts/fleets.md Outdated
If the `aws` backend config has `public_ips: false` set, `dstack` enables the maximum number of interfaces supported by the instance.
Otherwise, if instances have public IPs, only one EFA interface is enabled per instance due to AWS limitations.
When you create a cloud fleet with `aws`, [Elastic Fabric Adapter networking :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type.
Note, EFA requires the `public_ips` to set to `false` in the `aws` backend configuration.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, according to the previous version of this section, EFA is also used for public_ips: true, except only one EFA interface is attached

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no point of using EFA without multiple interface I guess

Comment thread docs/docs/concepts/fleets.md Outdated
Comment thread docs/docs/concepts/fleets.md Outdated
Comment thread docs/docs/guides/clusters.md Outdated
Comment on lines +49 to +50
suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up,
`dstack` starts worker nodes and runs the task container on each worker node.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually the case? I don't think we guarantee the order in which containers start

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r4victor please comment on this

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we don't provide any guarantees.

For example, here the master node started 40 seconds after a non-master node. To demonstrate this, I pre-created a fleet and removed the Docker image from one of the nodes so that the job assigned to that node takes longer to start.

type: task
nodes: 2
commands:
- "echo Node rank: $DSTACK_NODE_RANK"
- date --iso-8601=ns
> dstack logs chatty-swan-1 --job 0
Node rank: 0
2025-05-16T10:27:35,731006010-04:00
> dstack logs chatty-swan-1 --job 1
Node rank: 1
2025-05-16T10:26:55,453938960-04:00

We've thought about making the order configurable, but currently it is expected to be random. And this is a good default, as it prevents any GPU time from being wasted.

Comment thread docs/docs/guides/clusters.md Outdated
Comment on lines +20 to +36
=== "AWS"
When you create a cloud fleet with `aws`, [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.

!!! info "Backend configuration"
Note, EFA requires the `public_ips` to set to `false` in the `aws` backend configuration.
Refer to the [EFA](../../blog/posts/efa.md) example for more details.

=== "GCP"
When you create a cloud fleet with `gcp`, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.

!!! info "Backend configuration"
Note, GPUDirect-TCPXO and GPUDirect-TCPX require `extra_vpcs` to be configured in the `gcp` backend configuration.
Refer to the [A3 Mega](../../examples/clusters/a3mega/index.md) and
[A3 Mega](../../examples/clusters/a3high/index.md) examples for more details.

=== "Nebius"
When you create a cloud fleet with `nebius`, [InfiniBand :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments from fleets.md are also relevant here.

Maybe leave a link to fleets.md instead of duplicating the details for each backend?

Comment thread docs/docs/guides/clusters.md Outdated
peterschmidt85 and others added 4 commits May 16, 2025 11:00
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
@peterschmidt85 peterschmidt85 merged commit 20de4c7 into master May 16, 2025
23 checks passed
@peterschmidt85 peterschmidt85 deleted the clusters-guide branch May 16, 2025 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants