Skip to content

Commit fdad550

Browse files
Merge branch 'master' into 2655-ux-improve-the-output-of-dstack-ps
2 parents 00bb4b3 + c1126ba commit fdad550

20 files changed

Lines changed: 302 additions & 102 deletions

File tree

docs/docs/concepts/tasks.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -378,6 +378,7 @@ If you don't assign a value to an environment variable (see `HF_TOKEN` above),
378378
| `DSTACK_NODE_RANK` | The rank of the node |
379379
| `DSTACK_MASTER_NODE_IP` | The internal IP address of the master node |
380380
| `DSTACK_NODES_IPS` | The list of internal IP addresses of all nodes delimited by "\n" |
381+
| `DSTACK_MPI_HOSTFILE` | The path to a pre-populated MPI hostfile |
381382

382383
### Spot policy
383384

docs/docs/reference/environment-variables.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,7 @@ tasks, and services:
7777
```
7878

7979
- `DSTACK_NODES_IPS`{ #DSTACK_NODES_IPS } – The list of internal IP addresses of all nodes delimited by `"\n"`.
80+
- `DSTACK_MPI_HOSTFILE`{ #DSTACK_MPI_HOSTFILE } – The path to a pre-populated MPI hostfile that can be used directly as `mpirun --hostfile $DSTACK_MPI_HOSTFILE`.
8081

8182
## Server
8283

examples/clusters/nccl-tests/.dstack.yml

Lines changed: 4 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,44 +2,25 @@ type: task
22
name: nccl-tests
33

44
nodes: 2
5+
startup_order: workers-first
6+
stop_criteria: master-done
57

68
image: dstackai/efa
79
env:
810
- NCCL_DEBUG=INFO
911
commands:
1012
- |
11-
# We use FIFO for inter-node communication
12-
FIFO=/tmp/dstack_job
1313
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
1414
cd /root/nccl-tests/build
15-
# Generate hostfile for mpirun
16-
: > hostfile
17-
for ip in ${DSTACK_NODES_IPS}; do
18-
echo "${ip} slots=${DSTACK_GPUS_PER_NODE}" >> hostfile
19-
done
20-
MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
21-
# Wait for other nodes
22-
while true; do
23-
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
24-
break
25-
fi
26-
echo 'Waiting for nodes...'
27-
sleep 5
28-
done
15+
MPIRUN="mpirun --allow-run-as-root --hostfile $DSTACK_MPI_HOSTFILE"
2916
# Run NCCL Tests
3017
${MPIRUN} \
3118
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
32-
--mca pml ^cm \
33-
--mca btl tcp,self \
3419
--mca btl_tcp_if_exclude lo,docker0 \
3520
--bind-to none \
3621
./all_reduce_perf -b 8 -e 8G -f 2 -g 1
37-
# Notify nodes the job is done
38-
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
3922
else
40-
mkfifo ${FIFO}
41-
# Wait for a message from the first node
42-
cat ${FIFO}
23+
sleep infinity
4324
fi
4425
4526
resources:

examples/clusters/nccl-tests/README.md

Lines changed: 8 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -6,51 +6,32 @@ This example shows how to run distributed [NCCL tests :material-arrow-top-right-
66

77
Here's an example of a task that runs AllReduce test on 2 nodes, each with 4 GPUs (8 processes in total).
88

9-
<div editor-title="examples/distributed-training/nccl-tests/.dstack.yml">
9+
<div editor-title="examples/clusters/nccl-tests/.dstack.yml">
1010

1111
```yaml
1212
type: task
1313
name: nccl-tests
1414

1515
nodes: 2
16+
startup_order: workers-first
17+
stop_criteria: master-done
1618

1719
image: dstackai/efa
1820
env:
1921
- NCCL_DEBUG=INFO
2022
commands:
2123
- |
22-
# We use FIFO for inter-node communication
23-
FIFO=/tmp/dstack_job
2424
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
2525
cd /root/nccl-tests/build
26-
# Generate hostfile for mpirun
27-
: > hostfile
28-
for ip in ${DSTACK_NODES_IPS}; do
29-
echo "${ip} slots=${DSTACK_GPUS_PER_NODE}" >> hostfile
30-
done
31-
MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
32-
# Wait for other nodes
33-
while true; do
34-
if ${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 true >/dev/null 2>&1; then
35-
break
36-
fi
37-
echo 'Waiting for nodes...'
38-
sleep 5
39-
done
40-
# Run NCCL tests
26+
MPIRUN="mpirun --allow-run-as-root --hostfile $DSTACK_MPI_HOSTFILE"
27+
# Run NCCL Tests
4128
${MPIRUN} \
4229
-n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
43-
--mca pml ^cm \
44-
--mca btl tcp,self \
4530
--mca btl_tcp_if_exclude lo,docker0 \
4631
--bind-to none \
4732
./all_reduce_perf -b 8 -e 8G -f 2 -g 1
48-
# Notify nodes the job is done
49-
${MPIRUN} -n ${DSTACK_NODES_NUM} -N 1 sh -c "echo done > ${FIFO}"
5033
else
51-
mkfifo ${FIFO}
52-
# Wait for a message from the first node
53-
cat ${FIFO}
34+
sleep infinity
5435
fi
5536
5637
resources:
@@ -61,15 +42,6 @@ resources:
6142

6243
</div>
6344

64-
!!! info "MPI"
65-
NCCL tests rely on MPI to run on multiple processes. The master node (`DSTACK_NODE_RANK=0`) generates `hostfile` (using `DSTACK_NODES_IPS`)
66-
and waits until other nodes are accessible via MPI.
67-
Then, it executes `/nccl-tests/build/all_reduce_perf` across all GPUs.
68-
69-
Non-master nodes use a `FIFO` pipe to wait for until the MPI run is finished.
70-
71-
There is an open [issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/2467){:target="_blank"} to simplify the use of MPI with distributed tasks.
72-
7345
!!! info "Docker image"
7446
The `dstackai/efa` image used in the example comes with MPI and NCCL tests pre-installed. While it is optimized for
7547
[AWS EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}, it can also
@@ -84,7 +56,7 @@ To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/referenc
8456
<div class="termy">
8557

8658
```shell
87-
$ dstack apply -f examples/distributed-training/nccl-tests/.dstack.yml
59+
$ dstack apply -f examples/clusters/nccl-tests/.dstack.yml
8860

8961
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
9062
1 aws us-east-1 g4dn.12xlarge 48xCPU, 192GB, 4xT4 (16GB), 100.0GB (disk) no $3.912
@@ -99,7 +71,7 @@ Submit the run nccl-tests? [y/n]: y
9971
## Source code
10072

10173
The source-code of this example can be found in
102-
[`examples/distributed-training/nccl-tests` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/nccl-tests).
74+
[`examples/clusters/nccl-tests` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/clusters/nccl-tests).
10375

10476
## What's next?
10577

examples/single-node-training/trl/README.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -21,19 +21,19 @@ env:
2121
- WANDB_API_KEY
2222
- HUB_MODEL_ID
2323
commands:
24-
- pip install "transformers>=4.43.2"
25-
- pip install bitsandbytes
26-
- pip install flash-attn --no-build-isolation
27-
- pip install peft
28-
- pip install wandb
24+
# Pin torch==2.6.0 to avoid building Flash Attention from source.
25+
# Prebuilt Flash Attention wheels are not available for the latest torch==2.7.0.
26+
- uv pip install torch==2.6.0
27+
- uv pip install transformers bitsandbytes peft wandb
28+
- uv pip install flash_attn --no-build-isolation
2929
- git clone https://github.com/huggingface/trl
3030
- cd trl
31-
- pip install .
32-
- |
31+
- uv pip install .
32+
- |
3333
accelerate launch \
3434
--config_file=examples/accelerate_configs/multi_gpu.yaml \
3535
--num_processes $DSTACK_GPUS_PER_NODE \
36-
examples/scripts/sft.py \
36+
trl/scripts/sft.py \
3737
--model_name meta-llama/Meta-Llama-3.1-8B \
3838
--dataset_name OpenAssistant/oasst_top1_2023-08-25 \
3939
--dataset_text_field="text" \
@@ -44,14 +44,15 @@ commands:
4444
--report_to wandb \
4545
--bf16 \
4646
--max_seq_length 1024 \
47-
--lora_r 16 --lora_alpha 32 \
47+
--lora_r 16 \
48+
--lora_alpha 32 \
4849
--lora_target_modules q_proj k_proj v_proj o_proj \
4950
--load_in_4bit \
5051
--use_peft \
5152
--attn_implementation "flash_attention_2" \
5253
--logging_steps=10 \
5354
--output_dir models/llama31 \
54-
--hub_model_id $HUB_MODEL_ID
55+
--hub_model_id peterschmidt85/FineLlama-3.1-8B
5556
5657
resources:
5758
gpu:

examples/single-node-training/trl/train.dstack.yml

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,20 +12,19 @@ env:
1212
- ACCELERATE_LOG_LEVEL=info
1313
# Commands of the task
1414
commands:
15-
- conda install cuda
16-
- pip install git+https://github.com/huggingface/transformers.git
17-
- pip install bitsandbytes
18-
- pip install flash-attn --no-build-isolation
19-
- pip install peft
20-
- pip install wandb
15+
# Pin torch==2.6.0 to avoid building Flash Attention from source.
16+
# Prebuilt Flash Attention wheels are not available for the latest torch==2.7.0.
17+
- uv pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0
18+
- uv pip install transformers bitsandbytes peft wandb
19+
- uv pip install flash_attn --no-build-isolation
2120
- git clone https://github.com/huggingface/trl
2221
- cd trl
23-
- pip install .
22+
- uv pip install .
2423
- |
2524
accelerate launch \
2625
--config_file=examples/accelerate_configs/multi_gpu.yaml \
2726
--num_processes $DSTACK_GPUS_PER_NODE \
28-
examples/scripts/sft.py \
27+
trl/scripts/sft.py \
2928
--model_name meta-llama/Meta-Llama-3.1-8B \
3029
--dataset_name OpenAssistant/oasst_top1_2023-08-25 \
3130
--dataset_text_field="text" \
@@ -36,15 +35,15 @@ commands:
3635
--report_to wandb \
3736
--bf16 \
3837
--max_seq_length 1024 \
39-
--lora_r 16 --lora_alpha 32 \
38+
--lora_r 16 \
39+
--lora_alpha 32 \
4040
--lora_target_modules q_proj k_proj v_proj o_proj \
4141
--load_in_4bit \
4242
--use_peft \
4343
--attn_implementation "flash_attention_2" \
4444
--logging_steps=10 \
4545
--output_dir models/llama31 \
46-
--hub_model_id $HUB_MODEL_ID
47-
46+
--hub_model_id peterschmidt85/FineLlama-3.1-8B
4847
resources:
4948
gpu:
5049
# 24GB or more VRAM

runner/internal/executor/executor.go

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -257,6 +257,8 @@ func (ex *RunExecutor) execJob(ctx context.Context, jobLogFile io.Writer) error
257257
gpus_per_node_num := ex.clusterInfo.GPUSPerJob
258258
gpus_num := nodes_num * gpus_per_node_num
259259

260+
mpiHostfilePath := filepath.Join(ex.homeDir, ".dstack/mpi/hostfile")
261+
260262
jobEnvs := map[string]string{
261263
"DSTACK_RUN_ID": ex.run.Id,
262264
"DSTACK_JOB_ID": ex.jobSubmission.Id,
@@ -268,6 +270,7 @@ func (ex *RunExecutor) execJob(ctx context.Context, jobLogFile io.Writer) error
268270
"DSTACK_NODES_NUM": strconv.Itoa(nodes_num),
269271
"DSTACK_GPUS_PER_NODE": strconv.Itoa(gpus_per_node_num),
270272
"DSTACK_GPUS_NUM": strconv.Itoa(gpus_num),
273+
"DSTACK_MPI_HOSTFILE": mpiHostfilePath,
271274
}
272275

273276
// Call buildLDLibraryPathEnv and update jobEnvs if no error occurs
@@ -390,6 +393,11 @@ func (ex *RunExecutor) execJob(ctx context.Context, jobLogFile io.Writer) error
390393
}
391394
}
392395

396+
err = writeMpiHostfile(ctx, ex.clusterInfo.JobIPs, gpus_per_node_num, mpiHostfilePath)
397+
if err != nil {
398+
return err
399+
}
400+
393401
cmd.Env = envMap.Render()
394402

395403
log.Trace(ctx, "Starting exec", "cmd", cmd.String(), "working_dir", cmd.Dir, "env", cmd.Env)
@@ -696,6 +704,34 @@ func prepareSSHDir(uid int, gid int, homeDir string) (string, error) {
696704
return sshDir, nil
697705
}
698706

707+
func writeMpiHostfile(ctx context.Context, ips []string, gpus_per_node int, path string) error {
708+
if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
709+
return err
710+
}
711+
file, err := os.OpenFile(path, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0o644)
712+
if err != nil {
713+
return err
714+
}
715+
defer file.Close()
716+
nonEmptyIps := []string{}
717+
for _, ip := range ips {
718+
if ip != "" {
719+
nonEmptyIps = append(nonEmptyIps, ip)
720+
}
721+
}
722+
if len(nonEmptyIps) == len(ips) {
723+
for _, ip := range nonEmptyIps {
724+
line := fmt.Sprintf("%s slots=%d\n", ip, gpus_per_node)
725+
if _, err = file.WriteString(line); err != nil {
726+
return err
727+
}
728+
}
729+
} else {
730+
log.Info(ctx, "creating empty MPI hostfile: no internal IPs assigned")
731+
}
732+
return nil
733+
}
734+
699735
func writeDstackProfile(env map[string]string, path string) error {
700736
file, err := os.OpenFile(path, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0o644)
701737
if err != nil {

src/dstack/_internal/cli/services/configurators/run.py

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@
4141
)
4242
from dstack._internal.core.models.repos.base import Repo
4343
from dstack._internal.core.models.resources import CPUSpec
44-
from dstack._internal.core.models.runs import JobSubmission, RunStatus
44+
from dstack._internal.core.models.runs import JobStatus, JobSubmission, RunStatus
4545
from dstack._internal.core.services.configs import ConfigManager
4646
from dstack._internal.core.services.diff import diff_models
4747
from dstack._internal.utils.common import local_time
@@ -593,6 +593,20 @@ def get_run_exit_code(run: Run) -> int:
593593
return 1
594594

595595

596+
def _is_ready_to_attach(run: Run) -> bool:
597+
return not (
598+
run.status
599+
in [
600+
RunStatus.SUBMITTED,
601+
RunStatus.PENDING,
602+
RunStatus.PROVISIONING,
603+
RunStatus.TERMINATING,
604+
]
605+
or run._run.jobs[0].job_submissions[-1].status
606+
in [JobStatus.SUBMITTED, JobStatus.PROVISIONING, JobStatus.PULLING]
607+
)
608+
609+
596610
def _run_resubmitted(run: Run, current_job_submission: Optional[JobSubmission]) -> bool:
597611
if current_job_submission is None or run._run.latest_job_submission is None:
598612
return False

src/dstack/_internal/core/models/configurations.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -440,7 +440,7 @@ def convert_replicas(cls, v: Any) -> Range[int]:
440440
raise ValueError("The minimum number of replicas must be greater than or equal to 0")
441441
if v.max < v.min:
442442
raise ValueError(
443-
"The maximum number of replicas must be greater than or equal to the minium number of replicas"
443+
"The maximum number of replicas must be greater than or equal to the minimum number of replicas"
444444
)
445445
return v
446446

src/dstack/_internal/core/models/fleets.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
parse_idle_duration,
2121
)
2222
from dstack._internal.core.models.resources import Range, ResourcesSpec
23+
from dstack._internal.utils.common import list_enum_values_for_annotation
2324
from dstack._internal.utils.json_schema import add_extra_schema_types
2425
from dstack._internal.utils.tags import tags_validator
2526

@@ -207,7 +208,11 @@ class InstanceGroupParams(CoreModel):
207208
spot_policy: Annotated[
208209
Optional[SpotPolicy],
209210
Field(
210-
description="The policy for provisioning spot or on-demand instances: `spot`, `on-demand`, or `auto`"
211+
description=(
212+
"The policy for provisioning spot or on-demand instances:"
213+
f" {list_enum_values_for_annotation(SpotPolicy)}."
214+
f" Defaults to `{SpotPolicy.ONDEMAND.value}`"
215+
)
211216
),
212217
] = None
213218
retry: Annotated[

0 commit comments

Comments
 (0)