0.19.12-v1
Clusters
Simplified use of MPI
startup_order and stop_criteria
New run configuration properties are introduced:
startup_order: any/master-first/workers-firstspecifies the order in which master and workers jobs are started.stop_criteria: all-done/master-donespecifies the criteria when a multi-node run should be considered finished.
These properties simplify running certain multi-node workloads. For example, MPI requires that workers are up and running when the master runs mpirun, so you'd use startup_order: workers-first. MPI workload can be considered done when the master is done, so you'd use stop_criteria: master-done and dstack won't wait for workers to exit.
DSTACK_MPI_HOSTFILE
dstack now automatically creates an MPI hostfile and exposes the DSTACK_MPI_HOSTFILE environment variable with the hostfile path. It can be used directly as mpirun --hostfile $DSTACK_MPI_HOSTFILE.
CLI
We've also updated how the CLI displays run and job status. Previously, the CLI displayed the internal status code which was hard to interpret. Now, the the STATUS column in dstack ps and dstack apply displays a status code which is easy to understand why run or job was terminated.
dstack ps -n 10
NAME BACKEND RESOURCES PRICE STATUS SUBMITTED
oom-task no offers yesterday
oom-task nebius (eu-north1) cpu=2 mem=8GB disk=100GB $0.0496 exited (127) yesterday
oom-task nebius (eu-north1) cpu=2 mem=8GB disk=100GB $0.0496 exited (127) yesterday
heavy-wolverine-1 done yesterday
replica=0 job=0 aws (us-east-1) cpu=4 mem=16GB disk=100GB T4:16GB:1 $0.526 exited (0) yesterday
replica=0 job=1 aws (us-east-1) cpu=4 mem=16GB disk=100GB T4:16GB:1 $0.526 exited (0) yesterday
cursor nebius (eu-north1) cpu=2 mem=8GB disk=100GB $0.0496 stopped yesterday
cursor nebius (eu-north1) cpu=2 mem=8GB disk=100GB $0.0496 error yesterday
cursor nebius (eu-north1) cpu=2 mem=8GB disk=100GB $0.0496 interrupted yesterday
cursor nebius (eu-north1) cpu=2 mem=8GB disk=100GB $0.0496 aborted yesterdayExamples
Simplified NCCL tests
With this release improvements, it became much easier to run MPI workloads with dstack. This includes NCCL tests that can now be run using the following configuration:
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
image: dstackai/efa
env:
- NCCL_DEBUG=INFO
commands:
- cd /root/nccl-tests/build
- |
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
mpirun \
--allow-run-as-root --hostfile $DSTACK_MPI_HOSTFILE \
-n ${DSTACK_GPUS_NUM} \
-N ${DSTACK_GPUS_PER_NODE} \
--mca btl_tcp_if_exclude lo,docker0 \
--bind-to none \
./all_reduce_perf -b 8 -e 8G -f 2 -g 1
else
sleep infinity
fi
resources:
gpu: nvidia:4:16GB
shm_size: 16GBSee the updated NCCL tests example for more details.
Distributed training
TRL
The new TRL example walks you through how to run distributed fine-tune using TRL, Accelerate and Deepspeed.
Axolotl
The new Axolotl example walks you through how to run distributed fine-tune using Axolotl with dstack.
What's changed
- [Feature] Update
.gitignorelogic to catch more cases by @colinjc in dstackai/dstack#2695 - [Bug] Increase
upload_codeclient timeout by @r4victor in dstackai/dstack#2709 - [Bug] Fix missing
apt-get updateby @r4victor in dstackai/dstack#2710 - [Internal]: Update git hooks and
package.jsonby @olgenn in dstackai/dstack#2706 - [Examples] Add distributed Axolotl and TRL example by @Bihan in dstackai/dstack#2703
- [Docs] Update
dstack-proxycontributing guide by @jvstme in dstackai/dstack#2683 - [Feature] Implement
DSTACK_MPI_HOSTFILEby @r4victor in dstackai/dstack#2718 - [Feature] Implement
startup_orderandstop_criteriaby @r4victor in dstackai/dstack#2714 - [Bug] Fix CLI exiting while master starting by @r4victor in dstackai/dstack#2720
- [Examples] Simplify NCCL tests example by @r4victor in dstackai/dstack#2723
- [Examples] Update TRL Single Node example to uv by @Bihan in dstackai/dstack#2715
- [Bug] Fix backward compatibility when creating fleets by @jvstme in dstackai/dstack#2727
- [UX]: Make run status in UI and CLI easier to understand by @peterschmidt85 in dstackai/dstack#2716
- [Bug] Fix relative paths in
dstack apply --repoby @jvstme in dstackai/dstack#2733 - [Internal]: Drop hardcoded regions from the backend template by @jvstme in dstackai/dstack#2734
- [Internal]: Update backend template to match
ruffformatting by @jvstme in dstackai/dstack#2735
Full changelog: dstackai/dstack@0.19.11...0.19.12