Skip to content

Commit ebe738a

Browse files
[Blog] Exporting GPU, cost, and other metrics to Prometheus (#2458)
1 parent 969f728 commit ebe738a

File tree

1 file changed

+67
-0
lines changed

1 file changed

+67
-0
lines changed

docs/blog/posts/prometheus.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
---
2+
title: "Exporting fleet and run metrics to Prometheus"
3+
date: 2025-04-01
4+
description: "TBA"
5+
slug: prometheus
6+
image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-prometheus-v3.png?raw=true
7+
categories:
8+
- Monitoring
9+
- NVIDIA
10+
---
11+
12+
# Exporting GPU, cost, and other metrics to Prometheus
13+
14+
## Why Prometheus { style="display:none" }
15+
16+
Effective AI infrastructure management requires full visibility into compute performance and costs. AI researchers need
17+
detailed insights into container- and GPU-level performance, while managers rely on cost metrics to track resource usage
18+
across projects.
19+
20+
While `dstack` provides key metrics through its UI and [`dstack stats`](dstack-stats.md) CLI, teams often need more granular data and prefer
21+
using their own monitoring tools. To support this, we’ve introduced a new endpoint that allows real-time exporting all collected
22+
metrics—covering fleets and runs—directly to Prometheus.
23+
24+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-prometheus-v3.png?raw=true" width="630"/>
25+
26+
<!-- more -->
27+
28+
## How to set it up
29+
30+
To collect and export fleet and run metrics to Prometheus, set the
31+
`DSTACK_ENABLE_PROMETHEUS_METRICS` environment variable. Once the server is running, configure Prometheus to pull
32+
metrics from `<dstack server URL>/metrics`.
33+
34+
Once Prometheus is set up, it will automatically pull metrics from the `dstack` server at the defined interval.
35+
36+
With metrics now in Prometheus, you can use Grafana to create dashboards, whether to monitor all projects at once or
37+
drill down into specific projects or users.
38+
39+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-prometheus-grafana-dark.png?raw=true" width="800"/>
40+
41+
Overall, `dstack` collects three groups of metrics:
42+
43+
| Group | Description |
44+
|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
45+
| **Fleets** | Fleet metrics include details for each instance, such as running time, price, GPU name, and more. |
46+
| **Runs** | Run metrics include run counters for each user in each project. |
47+
| **Jobs** | A run consists of one or more jobs, each mapped to a container. Job metrics offer insights into execution time, cost, GPU model, NVIDIA DCGM telemetry, and more. |
48+
49+
For a full list of available metrics and labels, check out the [Monitoring](../../docs/guides/monitoring.md) guide.
50+
51+
??? info "NVIDIA"
52+
NVIDIA DCGM metrics are automatically collected for `aws`, `azure`, `gcp`, and `oci` backends,
53+
as well as for [SSH fleets](../../docs/concepts/fleets.md#ssh).
54+
55+
To ensure NVIDIA DCGM metrics are collected from SSH fleets, ensure the `datacenter-gpu-manager-4-core`,
56+
`datacenter-gpu-manager-4-proprietary`, and `datacenter-gpu-manager-exporter` packages are installed on the hosts.
57+
58+
??? info "AMD"
59+
AMD device metrics are not yet collected for any backends. This support will be available soon. For now, AMD metrics are
60+
only accessible through the UI and the [`dstack stats`](dstack-stats.md) CLI.
61+
62+
!!! info "What's next?"
63+
1. See the [Monitoring](../../docs/guides/monitoring.md) guide
64+
1. Check [dev environments](../../docs/concepts/dev-environments.md),
65+
[tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md),
66+
and [fleets](../../docs/concepts/fleets.md)
67+
2. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"}

0 commit comments

Comments
 (0)