Skip to content

Commit 6baff8f

Browse files
[Changelog] Introducing GPU passive health checks (#2976)
1 parent 6461006 commit 6baff8f

File tree

1 file changed

+73
-0
lines changed

1 file changed

+73
-0
lines changed
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
---
2+
title: Introducing passive GPU health checks
3+
date: 2025-08-12
4+
description: "TBA"
5+
slug: gpu-helth-checks
6+
image: https://dstack.ai/static-assets/static-assets/images/gpu-health-checks.png
7+
categories:
8+
- Changelog
9+
---
10+
11+
# Introducing passive GPU health checks
12+
13+
In large-scale training, a single bad GPU can derail progress. Sometimes the failure is obvious — jobs crash outright. Other times it’s subtle: correctable memory errors, intermittent instability, or thermal throttling that quietly drags down throughput. In big experiments, these issues can go unnoticed for hours or days, wasting compute and delaying results.
14+
15+
`dstack` already supports GPU telemetry monitoring through NVIDIA DCGM [metrics](../../docs/guides/metrics.md), covering utilization, memory, and temperature. This release extends that capability with passive hardware health checks powered by DCGM [background health checks](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#background-health-checks). With these, `dstack` continuously evaluates fleet GPUs for hardware reliability and displays their status before scheduling workloads.
16+
17+
<img src="https://dstack.ai/static-assets/static-assets/images/gpu-health-checks.png" width="630"/>
18+
19+
<!-- more -->
20+
21+
## Why this matters
22+
23+
Multi-GPU and multi-node workloads are only as strong as their weakest component. GPU cloud providers increasingly rely on automated health checks to prevent degraded hardware from reaching customers. Problems can stem from ECC memory errors, faulty PCIe links, overheating, or other hardware-level issues. Some are fatal, others allow the GPU to run but at reduced performance or with higher failure risk.
24+
25+
Passive checks like these run in the background. They continuously monitor hardware telemetry and system events, evaluating them against NVIDIA’s known failure patterns — all without pausing workloads.
26+
27+
## How it works in dstack
28+
29+
`dstack` automatically queries DCGM for each fleet instance and appends a health status:
30+
31+
* An `idle` status means no issues have been detected.
32+
* An `idle (warning)` status indicates a non-fatal issue, such as a correctable ECC error. The instance remains usable but should be monitored.
33+
* An `idle (failure)` status points to a fatal issue, and the instance is automatically excluded from scheduling.
34+
35+
<div class="termy">
36+
37+
```shell
38+
$ dstack fleet
39+
40+
FLEET INSTANCE BACKEND RESOURCES STATUS PRICE CREATED
41+
my-fleet 0 aws (us-east-1) T4:16GB:1 idle $0.526 11 mins ago
42+
1 aws (us-east-1) T4:16GB:1 idle (warning) $0.526 11 mins ago
43+
2 aws (us-east-1) T4:16GB:1 idle (failure) $0.526 11 mins ago
44+
```
45+
46+
</div>
47+
48+
A healthy instance is ready for workloads. A warning means you should monitor it closely. A failure removes it from scheduling entirely.
49+
50+
## Passive vs active checks
51+
52+
This release focuses on passive checks using DCGM background health checks. These run continuously and do not interrupt workloads.
53+
54+
For active checks today, you can run [NCCL tests](../../examples/clusters/nccl-tests/index.md) as a [distributed task](../../docs/concepts/tasks.md#distributed-tasks) to verify GPU-to-GPU communication and bandwidth across a fleet. Active tests like these can reveal network or interconnect issues that passive monitoring might miss. More built-in support for active diagnostics is planned.
55+
56+
## Supported backends
57+
58+
Passive GPU health checks work on AWS (except with custom `os_images`), Azure (except A10 GPUs), GCP, OCI, and [SSH fleets](../../docs/concepts/fleets.md#ssh) where DCGM is installed and configured for background checks.
59+
60+
> Fleets created before version 0.19.22 need to be recreated to enable this feature.
61+
62+
## Looking ahead
63+
64+
This update is about visibility: giving engineers real-time insight into GPU health before jobs run. Next comes automation — policies to skip GPUs with warnings, and self-healing workflows that replace unhealthy instances without manual steps.
65+
66+
If you have experience with GPU reliability or ideas for automated recovery, join the conversation on
67+
[Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"}.
68+
69+
!!! info "What's next?"
70+
1. Check [Quickstart](../../docs/quickstart.md)
71+
2. Explore the [clusters](../../docs/guides/clusters.md) guide
72+
3. Learn more about [metrics](../../docs/guides/metrics.md)
73+
4. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"}

0 commit comments

Comments
 (0)