Skip to content

Commit d7431cf

Browse files
Deploying to gh-pages from @ dstackai/dstack@3a8cc3f 🚀
1 parent 192dc07 commit d7431cf

File tree

12 files changed

+4478
-356
lines changed

12 files changed

+4478
-356
lines changed
37.9 KB
Loading

blog/changelog/index.html

Lines changed: 77 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -3278,6 +3278,17 @@
32783278
</label>
32793279
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
32803280

3281+
<li class="md-nav__item">
3282+
<a href="#introducing-passive-gpu-health-checks" class="md-nav__link">
3283+
<span class="md-ellipsis">
3284+
3285+
Introducing passive GPU health checks
3286+
3287+
</span>
3288+
</a>
3289+
3290+
</li>
3291+
32813292
<li class="md-nav__item">
32823293
<a href="#supporting-hot-aisle-amd-ai-developer-cloud" class="md-nav__link">
32833294
<span class="md-ellipsis">
@@ -3375,17 +3386,6 @@
33753386
</span>
33763387
</a>
33773388

3378-
</li>
3379-
3380-
<li class="md-nav__item">
3381-
<a href="#auto-shutdown-for-inactive-dev-environmentsno-idle-gpus" class="md-nav__link">
3382-
<span class="md-ellipsis">
3383-
3384-
Auto-shutdown for inactive dev environments—no idle GPUs
3385-
3386-
</span>
3387-
</a>
3388-
33893389
</li>
33903390

33913391
</ul>
@@ -3678,6 +3678,17 @@
36783678
</label>
36793679
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
36803680

3681+
<li class="md-nav__item">
3682+
<a href="#introducing-passive-gpu-health-checks" class="md-nav__link">
3683+
<span class="md-ellipsis">
3684+
3685+
Introducing passive GPU health checks
3686+
3687+
</span>
3688+
</a>
3689+
3690+
</li>
3691+
36813692
<li class="md-nav__item">
36823693
<a href="#supporting-hot-aisle-amd-ai-developer-cloud" class="md-nav__link">
36833694
<span class="md-ellipsis">
@@ -3775,17 +3786,6 @@
37753786
</span>
37763787
</a>
37773788

3778-
</li>
3779-
3780-
<li class="md-nav__item">
3781-
<a href="#auto-shutdown-for-inactive-dev-environmentsno-idle-gpus" class="md-nav__link">
3782-
<span class="md-ellipsis">
3783-
3784-
Auto-shutdown for inactive dev environments—no idle GPUs
3785-
3786-
</span>
3787-
</a>
3788-
37893789
</li>
37903790

37913791
</ul>
@@ -3902,6 +3902,17 @@
39023902
</label>
39033903
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
39043904

3905+
<li class="md-nav__item">
3906+
<a href="#introducing-passive-gpu-health-checks" class="md-nav__link">
3907+
<span class="md-ellipsis">
3908+
3909+
Introducing passive GPU health checks
3910+
3911+
</span>
3912+
</a>
3913+
3914+
</li>
3915+
39053916
<li class="md-nav__item">
39063917
<a href="#supporting-hot-aisle-amd-ai-developer-cloud" class="md-nav__link">
39073918
<span class="md-ellipsis">
@@ -3999,17 +4010,6 @@
39994010
</span>
40004011
</a>
40014012

4002-
</li>
4003-
4004-
<li class="md-nav__item">
4005-
<a href="#auto-shutdown-for-inactive-dev-environmentsno-idle-gpus" class="md-nav__link">
4006-
<span class="md-ellipsis">
4007-
4008-
Auto-shutdown for inactive dev environments—no idle GPUs
4009-
4010-
</span>
4011-
</a>
4012-
40134013
</li>
40144014

40154015
</ul>
@@ -4030,6 +4030,50 @@ <h1 id="changelog">Changelog<a class="headerlink" href="#changelog" title="Perma
40304030
<article class="md-post md-post--excerpt">
40314031
<header class="md-post__header">
40324032

4033+
<div class="md-post__meta md-meta">
4034+
<ul class="md-meta__list">
4035+
<li class="md-meta__item">
4036+
<time datetime="2025-08-12 00:00:00+00:00">August 12, 2025</time></li>
4037+
4038+
<li class="md-meta__item">
4039+
in
4040+
4041+
<a href="./" class="md-meta__link">Changelog</a></li>
4042+
4043+
4044+
4045+
<li class="md-meta__item">
4046+
4047+
3 min read
4048+
4049+
</li>
4050+
4051+
4052+
</ul>
4053+
4054+
</div>
4055+
</header>
4056+
<div class="md-post__content md-typeset">
4057+
<h2 id="introducing-passive-gpu-health-checks"><a class="toclink" href="../gpu-helth-checks/">Introducing passive GPU health checks</a></h2>
4058+
<p>In large-scale training, a single bad GPU can derail progress. Sometimes the failure is obvious — jobs crash outright. Other times it’s subtle: correctable memory errors, intermittent instability, or thermal throttling that quietly drags down throughput. In big experiments, these issues can go unnoticed for hours or days, wasting compute and delaying results.</p>
4059+
<p><code>dstack</code> already supports GPU telemetry monitoring through NVIDIA DCGM <a href="../../docs/guides/metrics/">metrics</a>, covering utilization, memory, and temperature. This release extends that capability with passive hardware health checks powered by DCGM <a href="https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#background-health-checks">background health checks</a>. With these, <code>dstack</code> continuously evaluates fleet GPUs for hardware reliability and displays their status before scheduling workloads.</p>
4060+
<p><img src="https://dstack.ai/static-assets/static-assets/images/gpu-health-checks.png" width="630"/></p>
4061+
4062+
4063+
<nav class="md-post__action">
4064+
<a href="../gpu-helth-checks/">
4065+
<span>Continue reading</span>
4066+
<span class="icon"><svg viewBox="0 0 13 10" xmlns="http://www.w3.org/2000/svg"><path d="M12.823 4.164L8.954.182a.592.592 0 0 0-.854 0 .635.635 0 0 0 0 .88l2.836 2.92H.604A.614.614 0 0 0 0 4.604c0 .344.27.622.604.622h10.332L8.1 8.146a.635.635 0 0 0 0 .88.594.594 0 0 0 .854 0l3.869-3.982a.635.635 0 0 0 0-.88z" fill-rule="nonzero" fill="currentColor" class="fill-main"></path></svg></span>
4067+
</a>
4068+
</nav>
4069+
4070+
4071+
</div>
4072+
</article>
4073+
4074+
<article class="md-post md-post--excerpt">
4075+
<header class="md-post__header">
4076+
40334077
<div class="md-post__meta md-meta">
40344078
<ul class="md-meta__list">
40354079
<li class="md-meta__item">
@@ -4473,54 +4517,6 @@ <h2 id="supporting-intel-gaudi-ai-accelerators-with-ssh-fleets"><a class="toclin
44734517
</div>
44744518
</article>
44754519

4476-
<article class="md-post md-post--excerpt">
4477-
<header class="md-post__header">
4478-
4479-
<div class="md-post__meta md-meta">
4480-
<ul class="md-meta__list">
4481-
<li class="md-meta__item">
4482-
<time datetime="2025-02-19 00:00:00+00:00">February 19, 2025</time></li>
4483-
4484-
<li class="md-meta__item">
4485-
in
4486-
4487-
<a href="./" class="md-meta__link">Changelog</a></li>
4488-
4489-
4490-
4491-
<li class="md-meta__item">
4492-
4493-
2 min read
4494-
4495-
</li>
4496-
4497-
4498-
</ul>
4499-
4500-
</div>
4501-
</header>
4502-
<div class="md-post__content md-typeset">
4503-
<h2 id="auto-shutdown-for-inactive-dev-environmentsno-idle-gpus"><a class="toclink" href="../inactivity-duration/">Auto-shutdown for inactive dev environments—no idle GPUs</a></h2>
4504-
<p>Whether you’re using cloud or on-prem compute, you may want to test your code before launching a
4505-
training task or deploying a service. <code>dstack</code>’s <a href="../../docs/concepts/dev-environments/">dev environments</a>
4506-
make this easy by setting up a remote machine, cloning your repository, and configuring your IDE —all within
4507-
a container that has GPU access.</p>
4508-
<p>One issue with dev environments is forgetting to stop them or closing your laptop, leaving the GPU idle and costly. With
4509-
our latest update, <code>dstack</code> now detects inactive environments and automatically shuts them down, saving you money.</p>
4510-
<p><img src="https://dstack.ai/static-assets/static-assets/images/inactive-dev-environments-auto-shutdown.png" width="630"/></p>
4511-
4512-
4513-
<nav class="md-post__action">
4514-
<a href="../inactivity-duration/">
4515-
<span>Continue reading</span>
4516-
<span class="icon"><svg viewBox="0 0 13 10" xmlns="http://www.w3.org/2000/svg"><path d="M12.823 4.164L8.954.182a.592.592 0 0 0-.854 0 .635.635 0 0 0 0 .88l2.836 2.92H.604A.614.614 0 0 0 0 4.604c0 .344.27.622.604.622h10.332L8.1 8.146a.635.635 0 0 0 0 .88.594.594 0 0 0 .854 0l3.869-3.982a.635.635 0 0 0 0-.88z" fill-rule="nonzero" fill="currentColor" class="fill-main"></path></svg></span>
4517-
</a>
4518-
</nav>
4519-
4520-
4521-
</div>
4522-
</article>
4523-
45244520

45254521

45264522

blog/changelog/page/2/index.html

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3274,6 +3274,17 @@
32743274
</label>
32753275
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
32763276

3277+
<li class="md-nav__item">
3278+
<a href="#auto-shutdown-for-inactive-dev-environmentsno-idle-gpus" class="md-nav__link">
3279+
<span class="md-ellipsis">
3280+
3281+
Auto-shutdown for inactive dev environments—no idle GPUs
3282+
3283+
</span>
3284+
</a>
3285+
3286+
</li>
3287+
32773288
<li class="md-nav__item">
32783289
<a href="#introducing-gpu-blocks-and-proxy-jump-for-ssh-fleets" class="md-nav__link">
32793290
<span class="md-ellipsis">
@@ -3661,6 +3672,17 @@
36613672
</label>
36623673
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
36633674

3675+
<li class="md-nav__item">
3676+
<a href="#auto-shutdown-for-inactive-dev-environmentsno-idle-gpus" class="md-nav__link">
3677+
<span class="md-ellipsis">
3678+
3679+
Auto-shutdown for inactive dev environments—no idle GPUs
3680+
3681+
</span>
3682+
</a>
3683+
3684+
</li>
3685+
36643686
<li class="md-nav__item">
36653687
<a href="#introducing-gpu-blocks-and-proxy-jump-for-ssh-fleets" class="md-nav__link">
36663688
<span class="md-ellipsis">
@@ -3874,6 +3896,17 @@
38743896
</label>
38753897
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
38763898

3899+
<li class="md-nav__item">
3900+
<a href="#auto-shutdown-for-inactive-dev-environmentsno-idle-gpus" class="md-nav__link">
3901+
<span class="md-ellipsis">
3902+
3903+
Auto-shutdown for inactive dev environments—no idle GPUs
3904+
3905+
</span>
3906+
</a>
3907+
3908+
</li>
3909+
38773910
<li class="md-nav__item">
38783911
<a href="#introducing-gpu-blocks-and-proxy-jump-for-ssh-fleets" class="md-nav__link">
38793912
<span class="md-ellipsis">
@@ -3991,6 +4024,54 @@ <h1 id="changelog">Changelog<a class="headerlink" href="#changelog" title="Perma
39914024
<article class="md-post md-post--excerpt">
39924025
<header class="md-post__header">
39934026

4027+
<div class="md-post__meta md-meta">
4028+
<ul class="md-meta__list">
4029+
<li class="md-meta__item">
4030+
<time datetime="2025-02-19 00:00:00+00:00">February 19, 2025</time></li>
4031+
4032+
<li class="md-meta__item">
4033+
in
4034+
4035+
<a href="../../" class="md-meta__link">Changelog</a></li>
4036+
4037+
4038+
4039+
<li class="md-meta__item">
4040+
4041+
2 min read
4042+
4043+
</li>
4044+
4045+
4046+
</ul>
4047+
4048+
</div>
4049+
</header>
4050+
<div class="md-post__content md-typeset">
4051+
<h2 id="auto-shutdown-for-inactive-dev-environmentsno-idle-gpus"><a class="toclink" href="../../../inactivity-duration/">Auto-shutdown for inactive dev environments—no idle GPUs</a></h2>
4052+
<p>Whether you’re using cloud or on-prem compute, you may want to test your code before launching a
4053+
training task or deploying a service. <code>dstack</code>’s <a href="../../../../docs/concepts/dev-environments/">dev environments</a>
4054+
make this easy by setting up a remote machine, cloning your repository, and configuring your IDE —all within
4055+
a container that has GPU access.</p>
4056+
<p>One issue with dev environments is forgetting to stop them or closing your laptop, leaving the GPU idle and costly. With
4057+
our latest update, <code>dstack</code> now detects inactive environments and automatically shuts them down, saving you money.</p>
4058+
<p><img src="https://dstack.ai/static-assets/static-assets/images/inactive-dev-environments-auto-shutdown.png" width="630"/></p>
4059+
4060+
4061+
<nav class="md-post__action">
4062+
<a href="../../../inactivity-duration/">
4063+
<span>Continue reading</span>
4064+
<span class="icon"><svg viewBox="0 0 13 10" xmlns="http://www.w3.org/2000/svg"><path d="M12.823 4.164L8.954.182a.592.592 0 0 0-.854 0 .635.635 0 0 0 0 .88l2.836 2.92H.604A.614.614 0 0 0 0 4.604c0 .344.27.622.604.622h10.332L8.1 8.146a.635.635 0 0 0 0 .88.594.594 0 0 0 .854 0l3.869-3.982a.635.635 0 0 0 0-.88z" fill-rule="nonzero" fill="currentColor" class="fill-main"></path></svg></span>
4065+
</a>
4066+
</nav>
4067+
4068+
4069+
</div>
4070+
</article>
4071+
4072+
<article class="md-post md-post--excerpt">
4073+
<header class="md-post__header">
4074+
39944075
<div class="md-post__meta md-meta">
39954076
<ul class="md-meta__list">
39964077
<li class="md-meta__item">

0 commit comments

Comments
 (0)