Skip to content

Commit 3a8cc3f

Browse files
[Changelog] Introducing GPU passive health checks
Review feedback
1 parent 9ff1274 commit 3a8cc3f

1 file changed

Lines changed: 1 addition & 1 deletion

File tree

docs/blog/posts/gpu-health-checks.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ In large-scale training, a single bad GPU can derail progress. Sometimes the fai
2222

2323
Multi-GPU and multi-node workloads are only as strong as their weakest component. GPU cloud providers increasingly rely on automated health checks to prevent degraded hardware from reaching customers. Problems can stem from ECC memory errors, faulty PCIe links, overheating, or other hardware-level issues. Some are fatal, others allow the GPU to run but at reduced performance or with higher failure risk.
2424

25-
Passive checks like these run in the background. They collect telemetry, run lightweight internal tests, and compare results to NVIDIA’s known failure patterns — without pausing workloads.
25+
Passive checks like these run in the background. They continuously monitor hardware telemetry and system events, evaluating them against NVIDIA’s known failure patterns — all without pausing workloads.
2626

2727
## How it works in dstack
2828

0 commit comments

Comments
 (0)