Releases: dstackai/dstack-enterprise
0.19.16-v1
Docker
Docker in Docker
Using Docker in a run configuration is now much easier. Just set docker to true:
type: task
name: docker-nvidia-smi
docker: true
commands:
- docker run --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi
resources:
gpu: 1This works with all run configuration types and supports both AMD and NVIDIA GPUs. It’s especially useful if you want to use the docker CLI in your commands—for example, to build Docker images.
The docker property is supported on all backends except vastai, runpod, and kubernetes, and is fully supported on SSH fleets as well.
Backends
CloudRift
The CloudRift team has added support for their GPU cloud, which can now be used with dstack.
To configure it, use a CloudRift API key in the backend configuration:
projects:
- name: main
backends:
- type: cloudrift
creds:
type: api_key
api_key: rift_2prgY1d0laOrf2BblTwx2B2d1zcf1zIp4tZYpj5j88qmNgz38pxNlpX3vAoCloudRift offers competitive on-demand GPU pricing, with more GPUs and regions coming soon.
dstack apply -f examples/.dstack.yml -b cloudrift
# BACKEND RESOURCES INSTANCE TYPE PRICE
1 cloudrift (us-east-nc-nr-1) cpu=16 mem=100GB disk=1000GB RTX5090:32GB:1 rtx59-16c-nr.1 $0.65If you encounter any issues with this backend, please report them.
Server
Public projects
You can now create public projects that any user on the server can join or leave without approval. Previously, all projects were private, and adding new members required manual action by an admin or manager—a step that’s redundant in high-trust environments.
Admins can change a project’s visibility at any time in the project settings.
Metrics
The server exports new Prometheus metrics:
dstack_submit_to_provision_duration_seconds: Time from when a run has been submitted and first job provisioningdstack_pending_runs_total: Total number of pending runs
What's changed
- [Feature]: Property filter on Fleets, Models, Volumes pages by @olgenn in dstackai/dstack#2824
- [Bug]: Run/job status in UI/CLI is shown as
provisioninginstead ofpullingby @peterschmidt85 in dstackai/dstack#2834 - [chore]: Fix annotation in
update_service_desired_replica_countby @jvstme in dstackai/dstack#2840 - Add CloudRift backend by @6erun in dstackai/dstack#2771
- Fix Postgres deadlocks by @r4victor in dstackai/dstack#2843
- [UX] Simplify the use of Docker inside containers #2468 by @peterschmidt85 in dstackai/dstack#2828
- [Docs] Update docs and examples to reflect the
dockerproperty by @peterschmidt85 in dstackai/dstack#2831 - Add support for Tenstorrent n300 GPUs by @peterschmidt85 in dstackai/dstack#2827
- [Feature]: Property filter on Instances page by @olgenn in dstackai/dstack#2826
- [UI] Allow to hide the Tour panel by @olgenn in dstackai/dstack#2816
- Pr3 add join leave UI buttons by @haydnli-shopify in dstackai/dstack#2795
- Health metrics (Part 2) by @Nadine-H in dstackai/dstack#2796
- [Bug]: Use a unique token for log pagination instead of a timestamp by @peterschmidt85 in dstackai/dstack#2845
- Fix update project required permissions by @r4victor in dstackai/dstack#2846
New contributors
- @6erun made their first contribution in dstackai/dstack#2771
Full changelog: dstackai/dstack@0.19.15...0.19.16
0.19.15-v1
Services
Rolling deployments
This update introduces rolling deployments, which help avoid downtime when deploying new versions of your services.
When you apply an updated service configuration, dstack will gradually replace old service replicas with new ones. You can track the progress in the dstack apply output — the deployment number will be lower for old replicas and higher for new ones.
> dstack apply -f my-service.dstack.yml
Active run my-service already exists. Detected configuration changes that can be updated in-place: ['image', 'env', 'commands']
Update the run? [y/n]: y
⠋ Launching my-service...
NAME BACKEND RESOURCES PRICE STATUS SUBMITTED
my-service deployment=1 running 11 mins ago
replica=0 job=0 deployment=0 aws (us-west-2) cpu=2 mem=1GB disk=100GB (spot) $0.0026 terminating 11 mins ago
replica=1 job=0 deployment=1 aws (us-west-2) cpu=2 mem=1GB disk=100GB (spot) $0.0026 running 1 min agoCurrently, the following service configuration properties can be updated using rolling deployments: resources, volumes, image, user, privileged, entrypoint, python, nvcc, single_branch, env, shell, and commands.
Future releases will allow updating more properties and deploying new git repo commits.
Clusters
Updated default Docker images
If you don't specify a custom image in the run configuration, dstack uses its default images. These images have been improved for cluster environments and now include mpirun and NCCL tests. Additionally, if you are running on AWS EFA-capable instances, dstack will now automatically select an image with the appropriate EFA drivers. See our new AWS EFA guide for more details.
Server
Health metrics
The dstack server now exports some operational Prometheus metrics that allow to monitor its health. If you are running your own production-grade dstack server installation, refer to the metrics docs for details.
What's changed
- Set logsWaitDuration to 5m by @r4victor in dstackai/dstack#2794
- Add health metrics (Part 1) by @Nadine-H in dstackai/dstack#2760
- Add public projects by @haydnli-shopify in dstackai/dstack#2759
- Fix is_public allowing null by @r4victor in dstackai/dstack#2798
- Retry on
VOLUME_ERRORandINSTANCE_UNREACHABLEby @jvstme in dstackai/dstack#2805 - Rework default Docker images by @peterschmidt85 in dstackai/dstack#2799
- Fix volume error status message by @jvstme in dstackai/dstack#2806
- [Docs] Added EFA example by @peterschmidt85 in dstackai/dstack#2820
- [Bug]: Empty spaces on User Details page by @olgenn in dstackai/dstack#2815
- Rolling deployment for services by @jvstme in dstackai/dstack#2821
- Fix building
dstackpackage by @jvstme in dstackai/dstack#2823
New Contributors
- @haydnli-shopify made their first contribution in dstackai/dstack#2759
Full Changelog: dstackai/dstack@0.19.13...0.19.15
0.19.13-v1
Clusters
Built-in InfiniBand support in dstack Docker images
The dstack default Docker images now come with built-in InfiniBand support, which includes the necessary libibverbs library and InfiniBand utilities from rdma-core. This means you can run torch distributed and other workloads utilizing NCCL, and they'll take full advantage of InfiniBand without custom Docker images.
You can try InfiniBand clusters with dstack on Nebius.
Built-in EFA support in dstack VM images
dstack switches to DLAMI as the default AWS GPU VM image from a custom one. DLAMI supports EFA out-of-the-box, so you no longer need to use a custom VM image to take advantage of EFA.
Server
GCS support for code uploads
It's now possible to configure the dstack server to use GCP Cloud Storage for code uploads. Previously, only DB and S3 storages were supported. Learn more in the Server deployment guide.
What's Changed
- Support file upload to gcs bucket by @colinjc in dstackai/dstack#2737
- Document File storage by @r4victor in dstackai/dstack#2755
- [Docs] Minor update of Clusters and Distributed tasks sections by @peterschmidt85 in dstackai/dstack#2741
- Fix CLI exiting while master starting by @r4victor in dstackai/dstack#2757
- [UI] Implement property filter on Run list page by @olgenn in dstackai/dstack#2762
- [Bug]: Text is unavailable for selection on run logs page by @olgenn in dstackai/dstack#2763
- Preinstall rdma-core packages into dstack Docker image by @r4victor in dstackai/dstack#2764
- [UX] Show status message as retrying in case a run or job is being retired by @peterschmidt85 in dstackai/dstack#2758
- [Docs] Minor improvements by @peterschmidt85 in dstackai/dstack#2766
- [Feature]: Include priority to the list of runs and sort runs by priority by @olgenn in dstackai/dstack#2768
- [Feature]: The Run details page should display the same fields as the Run list page by @olgenn in dstackai/dstack#2769
- [Feature]: Show Quickstart button if user don't have any runs by @olgenn in dstackai/dstack#2770
- [Feature]: Implement links for elements that have details page by @olgenn in dstackai/dstack#2772
- [Feature]: Add Refresh button on Run details page by @olgenn in dstackai/dstack#2773
- [Bug]: Tab Billing changes to Settings after top up balance by @olgenn in dstackai/dstack#2774
- Exclude backward incompatible fields from rest plugin calls by @colinjc in dstackai/dstack#2767
- [UI] Minor fixes by @peterschmidt85 in dstackai/dstack#2775
- Pin dkms by @r4victor in dstackai/dstack#2776
- Use DLAMI on AWS by @r4victor in dstackai/dstack#2782
- 2674 prop filter by @olgenn in dstackai/dstack#2778
- Fixed defect #2752 by @olgenn in dstackai/dstack#2784
- Update base image to 0.9 by @r4victor in dstackai/dstack#2786
- Fix status_message with missing on_events by @r4victor in dstackai/dstack#2788
- [Bug]: UI doesn't show Resources for instances of SSH fleets by @peterschmidt85 in dstackai/dstack#2785
- Ignore AWS quotas when hitting rate limits by @r4victor in dstackai/dstack#2791
Full Changelog: dstackai/dstack@0.19.12...0.19.13
0.19.12-v1
Clusters
Simplified use of MPI
startup_order and stop_criteria
New run configuration properties are introduced:
startup_order: any/master-first/workers-firstspecifies the order in which master and workers jobs are started.stop_criteria: all-done/master-donespecifies the criteria when a multi-node run should be considered finished.
These properties simplify running certain multi-node workloads. For example, MPI requires that workers are up and running when the master runs mpirun, so you'd use startup_order: workers-first. MPI workload can be considered done when the master is done, so you'd use stop_criteria: master-done and dstack won't wait for workers to exit.
DSTACK_MPI_HOSTFILE
dstack now automatically creates an MPI hostfile and exposes the DSTACK_MPI_HOSTFILE environment variable with the hostfile path. It can be used directly as mpirun --hostfile $DSTACK_MPI_HOSTFILE.
CLI
We've also updated how the CLI displays run and job status. Previously, the CLI displayed the internal status code which was hard to interpret. Now, the the STATUS column in dstack ps and dstack apply displays a status code which is easy to understand why run or job was terminated.
dstack ps -n 10
NAME BACKEND RESOURCES PRICE STATUS SUBMITTED
oom-task no offers yesterday
oom-task nebius (eu-north1) cpu=2 mem=8GB disk=100GB $0.0496 exited (127) yesterday
oom-task nebius (eu-north1) cpu=2 mem=8GB disk=100GB $0.0496 exited (127) yesterday
heavy-wolverine-1 done yesterday
replica=0 job=0 aws (us-east-1) cpu=4 mem=16GB disk=100GB T4:16GB:1 $0.526 exited (0) yesterday
replica=0 job=1 aws (us-east-1) cpu=4 mem=16GB disk=100GB T4:16GB:1 $0.526 exited (0) yesterday
cursor nebius (eu-north1) cpu=2 mem=8GB disk=100GB $0.0496 stopped yesterday
cursor nebius (eu-north1) cpu=2 mem=8GB disk=100GB $0.0496 error yesterday
cursor nebius (eu-north1) cpu=2 mem=8GB disk=100GB $0.0496 interrupted yesterday
cursor nebius (eu-north1) cpu=2 mem=8GB disk=100GB $0.0496 aborted yesterdayExamples
Simplified NCCL tests
With this release improvements, it became much easier to run MPI workloads with dstack. This includes NCCL tests that can now be run using the following configuration:
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
image: dstackai/efa
env:
- NCCL_DEBUG=INFO
commands:
- cd /root/nccl-tests/build
- |
if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
mpirun \
--allow-run-as-root --hostfile $DSTACK_MPI_HOSTFILE \
-n ${DSTACK_GPUS_NUM} \
-N ${DSTACK_GPUS_PER_NODE} \
--mca btl_tcp_if_exclude lo,docker0 \
--bind-to none \
./all_reduce_perf -b 8 -e 8G -f 2 -g 1
else
sleep infinity
fi
resources:
gpu: nvidia:4:16GB
shm_size: 16GBSee the updated NCCL tests example for more details.
Distributed training
TRL
The new TRL example walks you through how to run distributed fine-tune using TRL, Accelerate and Deepspeed.
Axolotl
The new Axolotl example walks you through how to run distributed fine-tune using Axolotl with dstack.
What's changed
- [Feature] Update
.gitignorelogic to catch more cases by @colinjc in dstackai/dstack#2695 - [Bug] Increase
upload_codeclient timeout by @r4victor in dstackai/dstack#2709 - [Bug] Fix missing
apt-get updateby @r4victor in dstackai/dstack#2710 - [Internal]: Update git hooks and
package.jsonby @olgenn in dstackai/dstack#2706 - [Examples] Add distributed Axolotl and TRL example by @Bihan in dstackai/dstack#2703
- [Docs] Update
dstack-proxycontributing guide by @jvstme in dstackai/dstack#2683 - [Feature] Implement
DSTACK_MPI_HOSTFILEby @r4victor in dstackai/dstack#2718 - [Feature] Implement
startup_orderandstop_criteriaby @r4victor in dstackai/dstack#2714 - [Bug] Fix CLI exiting while master starting by @r4victor in dstackai/dstack#2720
- [Examples] Simplify NCCL tests example by @r4victor in dstackai/dstack#2723
- [Examples] Update TRL Single Node example to uv by @Bihan in dstackai/dstack#2715
- [Bug] Fix backward compatibility when creating fleets by @jvstme in dstackai/dstack#2727
- [UX]: Make run status in UI and CLI easier to understand by @peterschmidt85 in dstackai/dstack#2716
- [Bug] Fix relative paths in
dstack apply --repoby @jvstme in dstackai/dstack#2733 - [Internal]: Drop hardcoded regions from the backend template by @jvstme in dstackai/dstack#2734
- [Internal]: Update backend template to match
ruffformatting by @jvstme in dstackai/dstack#2735
Full changelog: dstackai/dstack@0.19.11...0.19.12
0.19.11-v1
Runs
Replacing conda with uv
dstack's default Docker images now come with uv installed. Installing Python packages with uv can be significantly faster than with pip or conda. Here's for example, uv vs pip times for installing torch on GCP VMs:
# time uv pip install torch
...
real 0m32.771s
user 0m29.070s
sys 0m8.300s
# time pip install torch
...
real 2m26.338s
user 1m37.514s
sys 0m16.711s
To continue supporting pip, dstack now automatically activates a virtual environment with pip available.
conda is no longer included in dstack's default Docker images. If you need to use conda, it should be installed manually:
commands:
- wget -O miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
- bash miniconda.sh -b -p /workflow/miniconda
- eval "$(/workflow/miniconda/bin/conda shell.bash hook)"
Plugins
Built-in rest_plugin
dstack gets support for a built-in rest_plugin that allows writing custom plugins as API servers, so you don't need to install plugins as Python packages.
Plugins implemented as API servers have advantages over plugins implemented as Python packages in some cases:
- No dependency conflicts with
dstack. - You can use any programming language.
- If you run the
dstackserver via Docker, you don't need to extend thedstackserver image with plugins or map them via volumes.
To get started, check out the plugin server example. The rest_plugin server API is documented here.
AWS
New CPU series
dstack now supports most recent AWS CPU VMs based on Intel Xeon Sapphire Rapids: M7i, C7i, and R7i. It also adds support for the burstable T3 family. Previously, only M5, C5 and t2.small CPU instances were supported.
Azure
New CPU series
dstack now supports most recent Azure CPU VMs based on Intel Xeon Sapphire Rapids: general purpose Dsv6 and memory optimized Esv6 series. Previously, only Dsv3, Esv4, and Fsv2 series were supported.
GCP
New CPU series
dstack now supports most recent GCP CPU VMs: C4, M4, H3, N4, N2. Previously, only E2 and M1 were supported.
Note that C4, M4, H3, N4 instances do not currently support Volumes since they require Hyperdisk support.
Examples
Ray+RAGEN
The new Ray+RAGEN example shows how use dstack and RAGEN to fine-tune an agent on multiple nodes.
Breaking changes
condais no longer included indstack's default Docker images.
Deprecations
- Azure VM series Dsv3 and Esv4 are deprecated.
What's Changed
- [Examples] Ray+RAGEN by @Bihan in dstackai/dstack#2665
- [UX] Minor improvements of
dstack metricsby @peterschmidt85 in dstackai/dstack#2667 - Fix request filtering for service stats by @jvstme in dstackai/dstack#2678
- Auto activate uv venv with pip installed by @r4victor in dstackai/dstack#2666
- Support new Azure CPU series by @r4victor in dstackai/dstack#2668
- [Blog] Case study: how EA uses dstack to fast-track AI development by @peterschmidt85 in dstackai/dstack#2682
- Add REST plugin for user-defined policies by @Nadine-H in dstackai/dstack#2631
- [UI] Minor update of help messages by @peterschmidt85 in dstackai/dstack#2690
- Fix wrong env var name in error message by @colinjc in dstackai/dstack#2686
- Fix upload_code limit message by @r4victor in dstackai/dstack#2691
- Support new GCP CPU series by @r4victor in dstackai/dstack#2685
- Drop humanize by @r4victor in dstackai/dstack#2692
- Support new AWS CPU series by @r4victor in dstackai/dstack#2693
- Disable max code upload limit in runner by @colinjc in dstackai/dstack#2694
- Generate REST plugin API docs by @Nadine-H in dstackai/dstack#2696
- Fix docs-build by @r4victor in dstackai/dstack#2700
- [UX]: Only show update notices for stable releases #2697 by @peterschmidt85 in dstackai/dstack#2698
- Run plugins in executor by @r4victor in dstackai/dstack#2701
- Fix phantom priority changes detected by @r4victor in dstackai/dstack#2702
- Update GRID drivers in Azure VM image by @jvstme in dstackai/dstack#2704
New Contributors
- @Nadine-H made their first contribution in dstackai/dstack#2631
Full Changelog: dstackai/dstack@0.19.10...0.19.11
0.19.11rc2-v1
What's Changed
- [Examples] Ray+RAGEN by @Bihan in dstackai/dstack#2665
- [UX] Minor improvements of
dstack metricsby @peterschmidt85 in dstackai/dstack#2667 - Fix request filtering for service stats by @jvstme in dstackai/dstack#2678
- Auto activate uv venv with pip installed by @r4victor in dstackai/dstack#2666
- Support new Azure CPU series by @r4victor in dstackai/dstack#2668
- [Blog] Case study: how EA uses dstack to fast-track AI development by @peterschmidt85 in dstackai/dstack#2682
- Add REST plugin for user-defined policies by @Nadine-H in dstackai/dstack#2631
- [UI] Minor update of help messages by @peterschmidt85 in dstackai/dstack#2690
- Fix wrong env var name in error message by @colinjc in dstackai/dstack#2686
- Fix upload_code limit message by @r4victor in dstackai/dstack#2691
- Support new GCP CPU series by @r4victor in dstackai/dstack#2685
- Drop humanize by @r4victor in dstackai/dstack#2692
- Support new AWS CPU series by @r4victor in dstackai/dstack#2693
- Disable max code upload limit in runner by @colinjc in dstackai/dstack#2694
- Generate REST plugin API docs by @Nadine-H in dstackai/dstack#2696
- Fix docs-build by @r4victor in dstackai/dstack#2700
- [UX]: Only show update notices for stable releases #2697 by @peterschmidt85 in dstackai/dstack#2698
New Contributors
- @Nadine-H made their first contribution in dstackai/dstack#2631
Full Changelog: dstackai/dstack@0.19.10...0.19.11rc2
0.19.10-v2
Linking Okta accounts
Now dstack automatically links Okta accounts to existing dstack users on first login instead of creating new users if their emails match.
[14:27:08] INFO dstack_enterprise.services.auth.okta:70 Linked existing
dstack user r4victor to Okta account pqefub12345
(victor@dstack.ai)
0.19.10-v1
Runs
Priorities
Run configurations now support a new priority property that allows controlling the order in which the runs are provisioned:
type: task
nodes: 1
priority: 50
commands:
- ...Runs with higher priorities take precedence over runs with lower priorities. Previously, submitted jobs were processed in FIFO manner with older jobs processed first. Now, the jobs are first sorted by descending priority. Note that if a high priority run cannot be scheduled, it does not block other runs with lower priority from scheduling (a.k.a. Best effort FIFO).
The priority property is updatable, so it can be changed for already submitted runs and will take effect.
CLI
dstack project command
The new dstack project command replaces the existing dstack config command.
dstack project(same asdstack project list)
$ dstack project
PROJECT URL USER DEFAULT
peterschmidt85 https://sky.dstack.ai peterschmidt85
main http://127.0.0.1:3000 admin ✓dstack project set-default
$ dstack project set-default peterschmidt85
OKdstack project add(similar to olddstack config, but--projectis changed to--name)
$ dstack project add --name peterschmidt85 --url https://sky.dstack.ai --token 76d8dd51-0470-74a7-24ed9ec18-fb7d341
OKdstack ps -n/--last
The dstack ps command now supports a new -n/--last parameter to show last N runs:
✗ dstack ps -n 3
NAME BACKEND RESOURCES PRICE STATUS SUBMITTED
good-panther-2 gcp (europe-west4) cpu=2 mem=8GB disk=100GB $0.0738 terminated 49 mins ago
new-chipmunk-1 azure (westeurope) cpu=2 mem=8GB disk=100GB (spot) $0.0158 terminated 23 hours ago
fuzzy-panther-1 runpod (EU-RO-1) cpu=6 mem=31GB disk=100GB RTX2000Ada:16GB:1 $0.28 terminated yesterday
Azure
Fsv2 series
The Azure backend now supports compute-optimized Fsv2 series:
✗ dstack apply -b azure
Project main
User admin
Configuration .dstack.yml
Type dev-environment
Resources cpu=4.. mem=8GB.. disk=100GB..
Spot policy auto
Max price -
Retry policy -
Creation policy reuse-or-create
Idle duration 5m
Max duration -
Inactivity duration -
Reservation -
# BACKEND RESOURCES INSTANCE TYPE PRICE
1 azure (westeurope) cpu=4 mem=8GB disk=100GB (spot) Standard_F4s_v2 $0.0278
2 azure (westeurope) cpu=4 mem=16GB disk=100GB (spot) Standard_D4s_v3 $0.0312
3 azure (westeurope) cpu=4 mem=32GB disk=100GB (spot) Standard_E4-2s_v4 $0.0416
...
Shown 3 of 98 offers, $40.962max
Major bugfixes
- [Bug]: Instances with blocks feature cannot be used for multi-node runs #2650
Deprecations
- The
dstack configCLI command is deprecated in favor ofdstack project add.
What's changed
- [Bug] Allow multi-node tasks on
idleinstances with blocks by @un-def in dstackai/dstack#2651 - [UX] Make local code upload size limit configurable by @colinjc in dstackai/dstack#2673
- [Feature] Implement run priorities by @r4victor in dstackai/dstack#2635
- [Bug] Fix
IllegalStateChangeErrorindelete_metricstask by @un-def in dstackai/dstack#2639 - [Examples] Renamed some example groups for better extensibility by @peterschmidt85 in dstackai/dstack#2641
- [Azure] Support Azure Fsv2-series by @r4victor in dstackai/dstack#2647
- [UX]: Add
dstack projectCLI to configure, list and switching between projects by @peterschmidt85 in dstackai/dstack#2653 - [UI] Dark/light theme toggler state is reset after page reload #289 by @olgenn in dstackai/dstack#2675
- [UX] Support
dstack ps -n NUMby @peterschmidt85 in dstackai/dstack#2654 - [Docs] Added
Clustersguide by @peterschmidt85 in dstackai/dstack#2646 - [UX] Replace
condawithuvindstackai/baseimages by @un-def in dstackai/dstack#2649 - [Docs]: Mention SSH fleet networking requirements by @jvstme in dstackai/dstack#2643
- [Bug] Put lower bounds on
ocideps by @r4victor in dstackai/dstack#2658 - [UX]: Replace conda with
uvin dstack's default Docker image #2625 by @peterschmidt85 in dstackai/dstack#2652 - [UX]: Replace conda with
uvin dstack's default Docker image #2625 by @peterschmidt85 in dstackai/dstack#2659 - [Internal] Support building staging Docker images by @r4victor in dstackai/dstack#2664
- [Bug] Forbid
scaling.target<= 0 by @jvstme in dstackai/dstack#2672
Full changelog: dstackai/dstack@0.19.9...0.19.10
0.19.9-v1
CLI
Container exit status
The CLI now displays the container exit status of each failed run or job:
Monitoring
Metrics
Previously, dstack stored and displayed only metrics within the last hour. If a run or job is finished, eventually metrics disappeared.
Now, dstack stores the last hour window of metrics for all finished runs.
AMD
On AMD, a wider range of ROCm/AMD SMI versions is now supported. Previously, for certain versions, metrics were not shown properly.
Server
Robust handling of networking issues
It sometimes happens that the dstack server cannot establish connections to running instances due to networking problems or because instances become temporarily unreachable. Previously, dstack failed jobs very quickly in such cases. Now, the server puts a graceful timeout of 2 minutes before considering jobs failed if instances are unreachable.
Runs
DSTACK_RUN_ID and DSTACK_JOB_ID
Two new environment variables are now available within runs:
DSTACK_RUN_IDstores the UUID of the run. It's unique for a run unlikeDSTACK_RUN_NAME.DSTACK_JOB_IDstores the UUID of the job submission. It's unique for every replica, job, and retry attempt.
What's Changed
- Add rccl test by @Bihan in dstackai/dstack#2613
- [Docs] Extracted Distributed training examples by @peterschmidt85 in dstackai/dstack#2614
- [Docs] fix YAML indent on trl example by @aaroniscode in dstackai/dstack#2617
- Add example of including plugins into the dstack-server Docker image by @r4victor in dstackai/dstack#2620
- Pull and store process exit status from jobs by @un-def in dstackai/dstack#2615
- [.github] Fix python-test by @peterschmidt85 in dstackai/dstack#2619
- [Docker] Add dstackai/amd-smi image by @un-def in dstackai/dstack#2611
- [runner] Improve GPU metrics collector by @un-def in dstackai/dstack#2612
- Set DSTACK_RUN_ID and DSTACK_JOB_ID by @r4victor in dstackai/dstack#2622
- Drop override message when overriding finished runs by @r4victor in dstackai/dstack#2623
- Change default gpu count to 1.. by @r4victor in dstackai/dstack#2624
- Add Nebius InfiniBand fabric for us-central1 by @jvstme in dstackai/dstack#2629
- Introduce JOB_DISCONNECTED_RETRY_TIMEOUT by @r4victor in dstackai/dstack#2627
- Keep last metrics for finished jobs by @un-def in dstackai/dstack#2628
- Update Nebius default project detection by @jvstme in dstackai/dstack#2633
- [Docs]: Nebius InfiniBand clusters by @jvstme in dstackai/dstack#2634
- Update cudo image by @r4victor in dstackai/dstack#2636
New Contributors
- @aaroniscode made their first contribution in dstackai/dstack#2617
Full Changelog: dstackai/dstack@0.19.8...0.19.9
0.19.8-v1
ARM
dstack now supports compute instances with ARM CPUs. To request ARM CPUs in a run or fleet configuration, specify the arm architecture in the resources.cpu property:
resources:
cpu: arm:4.. # 4 or more ARM coresIf the hosts in an SSH fleet have ARM CPUs, dstack will automatically detect them and enable their use.
To see available offers with ARM CPUs, pass --cpu arm to the dstack offer command.
Lambda
GH200
With the lambda backend, it's now possible to use GH200 instances that come with an ARM-based 72-core NVIDIA Grace CPU and an NVIDIA H200 Tensor Core GPU, connected with a high-bandwidth, memory-coherent NVIDIA NVLink-C2C interconnect.
type: dev-environment
name: my-env
ide: vscode
resources:
gpu: GH200:1If Lambda has GH200 on-demand instances at the time, you'll see them when you run dstack apply:
$ dstack apply -f .dstack.yml
# BACKEND RESOURCES INSTANCE TYPE PRICE
1 lambda (us-east-3) cpu=arm:64 mem=464GB disk=4399GB GH200:96GB:1 gpu_1x_gh200 $1.49Note, if no GH200 is available at the moment, you can specify the retry policy in your run configuration so that dstack can run the configuration once the GPU becomes available.
Nebius
InfiniBand clusters
The nebius backend now supports InfiniBand clusters. A cluster is automatically created when you apply a fleet configuration with placement: cluster and supported GPUs: e.g. 8xH100 or 8xH200.
type: fleet
name: my-fleet
nodes: 2
placement: cluster
resources:
gpu: H100,H200:8A suitable InfiniBand fabric for the cluster is selected automatically. You can also limit the allowed fabrics in the backend settings.
Once the cluster is provisioned, you can benefit from its high-speed networking when running distributed tasks, such as NCCL tests or Hugging Face TRL.
Azure
Managed identities
The new vm_managed_identity backend setting allows you to configure the managed identity that is assigned to VMs created in the azure backend.
projects:
- name: main
backends:
- type: azure
subscription_id: 06c82ce3-28ff-4285-a146-c5e981a9d808
tenant_id: f84a7584-88e4-4fd2-8e97-623f0a715ee1
creds:
type: default
vm_managed_identity: dstack-rg/my-managed-identityMake sure that dstack has the required permissions for managed identities to work.
What's changed
- Fix: handle OSError from os.get_terminal_size() in CLI table rendering for non-TTY environments by @vuyelwadr in dstackai/dstack#2599
- Clarify how retry works for tasks and services by @r4victor in dstackai/dstack#2600
- [Docs] Added Tenstorrent example by @peterschmidt85 in dstackai/dstack#2596
- Lambda: Docker: use
cgroupfsdriver by @un-def in dstackai/dstack#2603 - Don't collect Prometheus metrics on container-based backends by @un-def in dstackai/dstack#2605
- Support Nebius InfiniBand clusters by @jvstme in dstackai/dstack#2604
- Add ARM64 support by @un-def in dstackai/dstack#2595
- Allow to configure Nebius InfiniBand fabrics by @jvstme in dstackai/dstack#2607
- Support vm_managed_identity for Azure by @r4victor in dstackai/dstack#2608
- Fix API quota hitting when provisioning many A3 instances by @r4victor in dstackai/dstack#2610
New contributors
- @vuyelwadr made their first contribution in dstackai/dstack#2599
Full changelog: dstackai/dstack@0.19.7...0.19.8

