|
| 1 | +--- |
| 2 | +title: "Exporting fleet and run metrics to Prometheus" |
| 3 | +date: 2025-04-01 |
| 4 | +description: "TBA" |
| 5 | +slug: prometheus |
| 6 | +image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-prometheus-v3.png?raw=true |
| 7 | +categories: |
| 8 | + - Monitoring |
| 9 | + - NVIDIA |
| 10 | +--- |
| 11 | + |
| 12 | +# Exporting GPU, cost, and other metrics to Prometheus |
| 13 | + |
| 14 | +## Why Prometheus { style="display:none" } |
| 15 | + |
| 16 | +Effective AI infrastructure management requires full visibility into compute performance and costs. AI researchers need |
| 17 | +detailed insights into container- and GPU-level performance, while managers rely on cost metrics to track resource usage |
| 18 | +across projects. |
| 19 | + |
| 20 | +While `dstack` provides key metrics through its UI and [`dstack stats`](dstack-stats.md) CLI, teams often need more granular data and prefer |
| 21 | +using their own monitoring tools. To support this, we’ve introduced a new endpoint that allows real-time exporting all collected |
| 22 | +metrics—covering fleets and runs—directly to Prometheus. |
| 23 | + |
| 24 | +<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-prometheus-v3.png?raw=true" width="630"/> |
| 25 | + |
| 26 | +<!-- more --> |
| 27 | + |
| 28 | +## How to set it up |
| 29 | + |
| 30 | +To collect and export fleet and run metrics to Prometheus, set the |
| 31 | +`DSTACK_ENABLE_PROMETHEUS_METRICS` environment variable. Once the server is running, configure Prometheus to pull |
| 32 | +metrics from `<dstack server URL>/metrics`. |
| 33 | + |
| 34 | +Once Prometheus is set up, it will automatically pull metrics from the `dstack` server at the defined interval. |
| 35 | + |
| 36 | +With metrics now in Prometheus, you can use Grafana to create dashboards, whether to monitor all projects at once or |
| 37 | +drill down into specific projects or users. |
| 38 | + |
| 39 | +<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-prometheus-grafana-dark.png?raw=true" width="800"/> |
| 40 | + |
| 41 | +Overall, `dstack` collects three groups of metrics: |
| 42 | + |
| 43 | +| Group | Description | |
| 44 | +|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 45 | +| **Fleets** | Fleet metrics include details for each instance, such as running time, price, GPU name, and more. | |
| 46 | +| **Runs** | Run metrics include run counters for each user in each project. | |
| 47 | +| **Jobs** | A run consists of one or more jobs, each mapped to a container. Job metrics offer insights into execution time, cost, GPU model, NVIDIA DCGM telemetry, and more. | |
| 48 | + |
| 49 | +For a full list of available metrics and labels, check out the [Monitoring](../../docs/guides/monitoring.md) guide. |
| 50 | + |
| 51 | +??? info "NVIDIA" |
| 52 | + NVIDIA DCGM metrics are automatically collected for `aws`, `azure`, `gcp`, and `oci` backends, |
| 53 | + as well as for [SSH fleets](../../docs/concepts/fleets.md#ssh). |
| 54 | + |
| 55 | + To ensure NVIDIA DCGM metrics are collected from SSH fleets, ensure the `datacenter-gpu-manager-4-core`, |
| 56 | + `datacenter-gpu-manager-4-proprietary`, and `datacenter-gpu-manager-exporter` packages are installed on the hosts. |
| 57 | + |
| 58 | +??? info "AMD" |
| 59 | + AMD device metrics are not yet collected for any backends. This support will be available soon. For now, AMD metrics are |
| 60 | + only accessible through the UI and the [`dstack stats`](dstack-stats.md) CLI. |
| 61 | + |
| 62 | +!!! info "What's next?" |
| 63 | + 1. See the [Monitoring](../../docs/guides/monitoring.md) guide |
| 64 | + 1. Check [dev environments](../../docs/concepts/dev-environments.md), |
| 65 | + [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), |
| 66 | + and [fleets](../../docs/concepts/fleets.md) |
| 67 | + 2. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"} |
0 commit comments