Conversation
| If the `aws` backend config has `public_ips: false` set, `dstack` enables the maximum number of interfaces supported by the instance. | ||
| Otherwise, if instances have public IPs, only one EFA interface is enabled per instance due to AWS limitations. | ||
| When you create a cloud fleet with `aws`, [Elastic Fabric Adapter networking :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type. | ||
| Note, EFA requires the `public_ips` to set to `false` in the `aws` backend configuration. |
There was a problem hiding this comment.
Not quite, according to the previous version of this section, EFA is also used for public_ips: true, except only one EFA interface is attached
There was a problem hiding this comment.
There is no point of using EFA without multiple interface I guess
| suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up, | ||
| `dstack` starts worker nodes and runs the task container on each worker node. |
There was a problem hiding this comment.
Is this actually the case? I don't think we guarantee the order in which containers start
There was a problem hiding this comment.
I'm pretty sure we don't provide any guarantees.
For example, here the master node started 40 seconds after a non-master node. To demonstrate this, I pre-created a fleet and removed the Docker image from one of the nodes so that the job assigned to that node takes longer to start.
type: task
nodes: 2
commands:
- "echo Node rank: $DSTACK_NODE_RANK"
- date --iso-8601=ns> dstack logs chatty-swan-1 --job 0
Node rank: 0
2025-05-16T10:27:35,731006010-04:00
> dstack logs chatty-swan-1 --job 1
Node rank: 1
2025-05-16T10:26:55,453938960-04:00We've thought about making the order configurable, but currently it is expected to be random. And this is a good default, as it prevents any GPU time from being wasted.
| === "AWS" | ||
| When you create a cloud fleet with `aws`, [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type. | ||
|
|
||
| !!! info "Backend configuration" | ||
| Note, EFA requires the `public_ips` to set to `false` in the `aws` backend configuration. | ||
| Refer to the [EFA](../../blog/posts/efa.md) example for more details. | ||
|
|
||
| === "GCP" | ||
| When you create a cloud fleet with `gcp`, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured. | ||
|
|
||
| !!! info "Backend configuration" | ||
| Note, GPUDirect-TCPXO and GPUDirect-TCPX require `extra_vpcs` to be configured in the `gcp` backend configuration. | ||
| Refer to the [A3 Mega](../../examples/clusters/a3mega/index.md) and | ||
| [A3 Mega](../../examples/clusters/a3high/index.md) examples for more details. | ||
|
|
||
| === "Nebius" | ||
| When you create a cloud fleet with `nebius`, [InfiniBand :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type. |
There was a problem hiding this comment.
My comments from fleets.md are also relevant here.
Maybe leave a link to fleets.md instead of duplicating the details for each backend?
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
No description provided.