Skip to content

Commit 6748fd3

Browse files
[Docs] Added EFA example (#2820)
1 parent 7648a6c commit 6748fd3

File tree

8 files changed

+220
-12
lines changed

8 files changed

+220
-12
lines changed
File renamed without changes.

docs/docs/concepts/fleets.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ This ensures all instances are provisioned with optimal inter-node connectivity.
7070
Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration.
7171
Otherwise, instances are only connected by the default VPC subnet.
7272

73-
Refer to the [EFA](../../blog/posts/efa.md) example for more details.
73+
Refer to the [EFA](../../examples/clusters/efa/index.md) example for more details.
7474

7575
??? info "GCP"
7676
When you create a cloud fleet with GCP, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.

docs/docs/guides/clusters.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ For cloud fleets, fast interconnect is currently supported only on the `aws`, `g
2222

2323
!!! info "Backend configuration"
2424
Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration.
25-
Refer to the [EFA](../../blog/posts/efa.md) example for more details.
25+
Refer to the [EFA](../../examples/clusters/efa/index.md) example for more details.
2626

2727
=== "GCP"
2828
When you create a cloud fleet with GCP, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.

docs/examples.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ hide:
103103
<a href="/examples/clusters/a3mega"
104104
class="feature-cell sky">
105105
<h3>
106-
A3 Mega
106+
GCP A3 Mega
107107
</h3>
108108

109109
<p>
@@ -113,13 +113,23 @@ hide:
113113
<a href="/examples/clusters/a3high"
114114
class="feature-cell sky">
115115
<h3>
116-
A3 High
116+
GCP A3 High
117117
</h3>
118118

119119
<p>
120120
Set up GCP A3 High clusters with optimized networking
121121
</p>
122122
</a>
123+
<a href="/examples/clusters/efa"
124+
class="feature-cell sky">
125+
<h3>
126+
AWS EFA
127+
</h3>
128+
129+
<p>
130+
Set up AWS EFA clusters with optimized networking
131+
</p>
132+
</a>
123133
</div>
124134

125135
## Inference

docs/examples/clusters/efa/index.md

Whitespace-only changes.

examples/clusters/efa/README.md

Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
# AWS EFA
2+
3+
In this guide, we’ll walk through how to run high-performance distributed training on AWS using [Amazon Elastic Fabric Adapter (EFA) :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"} with `dstack`.
4+
5+
## Overview
6+
7+
EFA is a network interface for Amazon EC2 that enables low-latency, high-bandwidth inter-node communication — essential for scaling distributed deep learning. With `dstack`, EFA is automatically enabled when you create fleets with supported instance types.
8+
9+
## Prerequisite
10+
11+
Before you start, make sure the `aws` backend is properly configured.
12+
13+
<div editor-title="~/.dstack/server/config.yml">
14+
15+
```yaml
16+
projects:
17+
- name: main
18+
backends:
19+
- type: aws
20+
creds:
21+
type: default
22+
regions: ["us-west-2"]
23+
24+
public_ips: false
25+
vpc_name: my-custom-vpc
26+
```
27+
28+
</div>
29+
30+
!!! info "Multiple network interfaces"
31+
To use P4, P5, or P6 instances, set `public_ips` to `false` — this allows AWS to attach multiple network interfaces for EFA. In this case, the `dstack` server can reach your VPC’s private subnets.
32+
33+
!!! info "VPC"
34+
If you use a custom VPC, verify that it permits all internal traffic between nodes for EFA to function properly
35+
36+
## Create a fleet
37+
38+
Once your backend is ready, define a fleet configuration.
39+
40+
<div editor-title="examples/clusters/efa/fleet.dstack.yml">
41+
42+
```yaml
43+
type: fleet
44+
name: my-efa-fleet
45+
46+
nodes: 2
47+
placement: cluster
48+
49+
resources:
50+
gpu: H100:8
51+
```
52+
53+
</div>
54+
55+
Provision the fleet with `dstack apply`:
56+
57+
<div class="termy">
58+
59+
```shell
60+
$ dstack apply -f examples/clusters/efa/fleet.dstack.yml
61+
62+
Provisioning...
63+
---> 100%
64+
65+
FLEET INSTANCE BACKEND INSTANCE TYPE GPU PRICE STATUS CREATED
66+
my-efa-fleet 0 aws (us-west-2) p4d.24xlarge H100:8:80GB $98.32 idle 3 mins ago
67+
1 aws (us-west-2) p4d.24xlarge $98.32 idle 3 mins ago
68+
```
69+
70+
</div>
71+
72+
??? info "Instance types"
73+
`dstack` selects suitable instances automatically, but not
74+
[all types support EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}.
75+
To enforce EFA, you can specify `instance_types` explicitly:
76+
77+
```yaml
78+
type: fleet
79+
name: my-efa-fleet
80+
81+
nodes: 2
82+
placement: cluster
83+
84+
resources:
85+
gpu: L4
86+
87+
instance_types: ["g6.8xlarge"] # If not specified, g6.xlarge is used (won't have EFA)
88+
```
89+
90+
## Run NCCL tests
91+
92+
To confirm that EFA is working, run NCCL tests:
93+
94+
<div editor-title="examples/clusters/nccl-tests/.dstack.yml">
95+
96+
```yaml
97+
type: task
98+
name: nccl-tests
99+
100+
nodes: 2
101+
102+
startup_order: workers-first
103+
stop_criteria: master-done
104+
105+
env:
106+
- NCCL_DEBUG=INFO
107+
commands:
108+
- |
109+
if [ $DSTACK_NODE_RANK -eq 0 ]; then
110+
mpirun \
111+
--allow-run-as-root \
112+
--hostfile $DSTACK_MPI_HOSTFILE \
113+
-n $DSTACK_GPUS_NUM \
114+
-N $DSTACK_GPUS_PER_NODE \
115+
--bind-to none \
116+
all_reduce_perf -b 8 -e 8G -f 2 -g 1
117+
else
118+
sleep infinity
119+
fi
120+
121+
resources:
122+
gpu: 1..8
123+
shm_size: 16GB
124+
```
125+
126+
</div>
127+
128+
Run it with `dstack apply`:
129+
130+
<div class="termy">
131+
132+
```shell
133+
$ dstack apply -f examples/clusters/nccl-tests/.dstack.yml
134+
135+
Provisioning...
136+
---> 100%
137+
```
138+
139+
</div>
140+
141+
!!! info "Docker image"
142+
You can use your own container by setting `image`. If omitted, `dstack` uses its default image with drivers, NCCL tests, and tools pre-installed.
143+
144+
## Run distributed training
145+
146+
Here’s an example using `torchrun` for a simple multi-node PyTorch job:
147+
148+
<div editor-title="examples/distributed-training/torchrun/.dstack.yml">
149+
150+
```yaml
151+
type: task
152+
name: train-distrib
153+
154+
nodes: 2
155+
156+
python: 3.12
157+
env:
158+
- NCCL_DEBUG=INFO
159+
commands:
160+
- git clone https://github.com/pytorch/examples.git pytorch-examples
161+
- cd pytorch-examples/distributed/ddp-tutorial-series
162+
- uv pip install -r requirements.txt
163+
- |
164+
torchrun \
165+
--nproc-per-node=$DSTACK_GPUS_PER_NODE \
166+
--node-rank=$DSTACK_NODE_RANK \
167+
--nnodes=$DSTACK_NODES_NUM \
168+
--master-addr=$DSTACK_MASTER_NODE_IP \
169+
--master-port=12345 \
170+
multinode.py 50 10
171+
172+
resources:
173+
gpu: 1..8
174+
shm_size: 16GB
175+
```
176+
177+
</div>
178+
179+
Provision and launch it via `dstack apply`.
180+
181+
<div class="termy">
182+
183+
```shell
184+
$ dstack apply -f examples/distributed-training/torchrun/.dstack.yml
185+
186+
Provisioning...
187+
---> 100%
188+
```
189+
190+
</div>
191+
192+
Instead of setting `python`, you can specify your own Docker image using `image`. Make sure that the image is properly configured for EFA.
193+
194+
!!! info "What's next"
195+
1. Learn more about [distributed tasks](https://dstack.ai/docs/concepts/tasks#distributed-tasks)
196+
2. Check [dev environments](https://dstack.ai/docs/concepts/dev-environments),
197+
[services](https://dstack.ai/docs/concepts/services), and [fleets](https://dstack.ai/docs/concepts/fleets)
198+
3. Read the [Clusters](https://dstack.ai/docs/guides/clusters) guide

examples/clusters/nccl-tests/README.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,30 +13,28 @@ type: task
1313
name: nccl-tests
1414

1515
nodes: 2
16+
1617
startup_order: workers-first
1718
stop_criteria: master-done
1819

19-
image: dstackai/efa
2020
env:
2121
- NCCL_DEBUG=INFO
2222
commands:
23-
- cd /root/nccl-tests/build
2423
- |
2524
if [ $DSTACK_NODE_RANK -eq 0 ]; then
2625
mpirun \
2726
--allow-run-as-root \
2827
--hostfile $DSTACK_MPI_HOSTFILE \
2928
-n $DSTACK_GPUS_NUM \
3029
-N $DSTACK_GPUS_PER_NODE \
31-
--mca btl_tcp_if_exclude lo,docker0 \
3230
--bind-to none \
33-
./all_reduce_perf -b 8 -e 8G -f 2 -g 1
31+
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
3432
else
3533
sleep infinity
3634
fi
3735
3836
resources:
39-
gpu: nvidia:4:16GB
37+
gpu: nvidia:1..8
4038
shm_size: 16GB
4139
```
4240

mkdocs.yml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ plugins:
105105
'blog/monitoring-gpu-usage.md': 'blog/posts/dstack-metrics.md'
106106
'blog/inactive-dev-environments-auto-shutdown.md': 'blog/posts/inactivity-duration.md'
107107
'blog/data-centers-and-private-clouds.md': 'blog/posts/gpu-blocks-and-proxy-jump.md'
108-
'blog/distributed-training-with-aws-efa.md': 'blog/posts/efa.md'
108+
'blog/distributed-training-with-aws-efa.md': 'examples/clusters/efa/index.md'
109109
'blog/dstack-stats.md': 'blog/posts/dstack-metrics.md'
110110
'docs/concepts/metrics.md': 'docs/guides/metrics.md'
111111
'docs/guides/monitoring.md': 'docs/guides/metrics.md'
@@ -122,6 +122,7 @@ plugins:
122122
'examples/deployment/trtllm/index.md': 'examples/inference/trtllm/index.md'
123123
'examples/fine-tuning/trl/index.md': 'examples/single-node-training/trl/index.md'
124124
'examples/fine-tuning/axolotl/index.md': 'examples/single-node-training/axolotl/index.md'
125+
'blog/efa.md': 'examples/clusters/efa/index.md'
125126
- typeset
126127
- gen-files:
127128
scripts: # always relative to mkdocs.yml
@@ -271,8 +272,9 @@ nav:
271272
- Clusters:
272273
- NCCL tests: examples/clusters/nccl-tests/index.md
273274
- RCCL tests: examples/clusters/rccl-tests/index.md
274-
- A3 Mega: examples/clusters/a3mega/index.md
275-
- A3 High: examples/clusters/a3high/index.md
275+
- GCP A3 Mega: examples/clusters/a3mega/index.md
276+
- GCP A3 High: examples/clusters/a3high/index.md
277+
- AWS EFA: examples/clusters/efa/index.md
276278
- Inference:
277279
- SGLang: examples/inference/sglang/index.md
278280
- vLLM: examples/inference/vllm/index.md

0 commit comments

Comments
 (0)