You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<h2id="introducing-passive-gpu-health-checks"><aclass="toclink" href="../gpu-helth-checks/">Introducing passive GPU health checks</a></h2>
4058
+
<p>In large-scale training, a single bad GPU can derail progress. Sometimes the failure is obvious — jobs crash outright. Other times it’s subtle: correctable memory errors, intermittent instability, or thermal throttling that quietly drags down throughput. In big experiments, these issues can go unnoticed for hours or days, wasting compute and delaying results.</p>
4059
+
<p><code>dstack</code> already supports GPU telemetry monitoring through NVIDIA DCGM <ahref="../../docs/guides/metrics/">metrics</a>, covering utilization, memory, and temperature. This release extends that capability with passive hardware health checks powered by DCGM <ahref="https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#background-health-checks">background health checks</a>. With these, <code>dstack</code> continuously evaluates fleet GPUs for hardware reliability and displays their status before scheduling workloads.</p>
<h2id="auto-shutdown-for-inactive-dev-environmentsno-idle-gpus"><aclass="toclink" href="../inactivity-duration/">Auto-shutdown for inactive dev environments—no idle GPUs</a></h2>
4504
-
<p>Whether you’re using cloud or on-prem compute, you may want to test your code before launching a
4505
-
training task or deploying a service. <code>dstack</code>’s <ahref="../../docs/concepts/dev-environments/">dev environments</a>
4506
-
make this easy by setting up a remote machine, cloning your repository, and configuring your IDE —all within
4507
-
a container that has GPU access.</p>
4508
-
<p>One issue with dev environments is forgetting to stop them or closing your laptop, leaving the GPU idle and costly. With
4509
-
our latest update, <code>dstack</code> now detects inactive environments and automatically shuts them down, saving you money.</p>
<h2id="auto-shutdown-for-inactive-dev-environmentsno-idle-gpus"><aclass="toclink" href="../../../inactivity-duration/">Auto-shutdown for inactive dev environments—no idle GPUs</a></h2>
4052
+
<p>Whether you’re using cloud or on-prem compute, you may want to test your code before launching a
4053
+
training task or deploying a service. <code>dstack</code>’s <ahref="../../../../docs/concepts/dev-environments/">dev environments</a>
4054
+
make this easy by setting up a remote machine, cloning your repository, and configuring your IDE —all within
4055
+
a container that has GPU access.</p>
4056
+
<p>One issue with dev environments is forgetting to stop them or closing your laptop, leaving the GPU idle and costly. With
4057
+
our latest update, <code>dstack</code> now detects inactive environments and automatically shuts them down, saving you money.</p>
0 commit comments