Skip to content

Commit bbae827

Browse files
[Docs]: Update the Metrics guide (#2441)
1 parent a2c76cb commit bbae827

File tree

2 files changed

+107
-59
lines changed

2 files changed

+107
-59
lines changed

docs/assets/stylesheets/extra.css

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,8 @@
201201
.md-typeset__scrollwrap {
202202
margin-top: 0;
203203
margin-bottom: 0;
204+
margin-block-start: 1em;
205+
margin-block-end: 1em;
204206
}
205207

206208
.md-typeset__table {

docs/docs/guides/metrics.md

Lines changed: 105 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,73 +1,119 @@
1-
# Prometheus metrics
1+
# Metrics
22

3-
If enabled, `dstack` collects and exports Prometheus metrics. Metrics are available at the `/metrics` path.
3+
## Prometheus
44

5-
By default, metrics are disabled. To enable, set the `DSTACK_ENABLE_PROMETHEUS_METRICS` variable.
5+
When enabled, `dstack` is able to collect various metrics from fleets and runs and export them
6+
to Prometheus.
67

7-
!!! info "Convention"
8-
*type?* denotes an optional type. If a type is optional, an empty string is a valid value.
8+
### Setup
99

10-
## Instance metrics
10+
To enable collecting and exporting metrics to Prometheus,
11+
set the `DSTACK_ENABLE_PROMETHEUS_METRICS` environment variable, and point Prometheus to collect metrics
12+
from the `<dstack server URL>/metrics` endpoint.
1113

12-
| Metric | Type | Description | Examples |
13-
|---|---|---|---|
14-
| `dstack_instance_duration_seconds_total` | *counter* | Total seconds the instance is running | `1123763.22` |
15-
| `dstack_instance_price_dollars_per_hour` | *gauge* | Instance price, USD/hour | `16.0`|
16-
| `dstack_instance_gpu_count` | *gauge* | Instance GPU count | `4.0`, `0.0` |
14+
??? info "NVIDIA DCGM"
15+
NVIDIA DCGM metrics are automatically collected for AWS, Azure, GCP, and OCI backends, as well as for SSH fleets.
16+
17+
To ensure NVIDIA DCGM metrics are collected from SSH fleets, ensure the `datacenter-gpu-manager-4-core`,
18+
`datacenter-gpu-manager-4-proprietary`, and `datacenter-gpu-manager-exporter` packages are installed on the hosts.
1719

18-
| Label | Type | Examples |
19-
|---|---|---|
20-
| `dstack_project_name` | *string* | `main` |
21-
| `dstack_fleet_name` | *string?* | `my-fleet` |
22-
| `dstack_fleet_id` | *string?* | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
23-
| `dstack_instance_name` | *string* | `my-fleet-0` |
24-
| `dstack_instance_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
25-
| `dstack_instance_type` | *string?* | `g4dn.xlarge` |
26-
| `dstack_backend` | *string?* | `aws`, `runpod` |
27-
| `dstack_gpu` | *string?* | `T4` |
20+
### Fleets
2821

29-
## Job metrics
22+
Fleet metrics include metrics for each instance within a fleet. This includes information such as the instance's running
23+
time, price, GPU name, and more.
3024

31-
| Metric | Type | Description | Examples |
32-
|---|---|---|---|
33-
| `dstack_job_duration_seconds_total` | *counter* | Total seconds the job is running | `520.37` |
34-
| `dstack_job_price_dollars_per_hour` | *gauge* | Job instance price, USD/hour | `8.0`|
35-
| `dstack_job_gpu_count` | *gauge* | Job GPU count | `2.0`, `0.0` |
25+
=== "Metrics"
26+
| Name | Type | Description | Examples |
27+
|------------------------------------------|-----------|-----------------------------------|--------------|
28+
| `dstack_instance_duration_seconds_total` | *counter* | Total instance runtime in seconds | `1123763.22` |
29+
| `dstack_instance_price_dollars_per_hour` | *gauge* | Instance price, USD/hour | `16.0` |
30+
| `dstack_instance_gpu_count` | *gauge* | Instance GPU count | `4.0`, `0.0` |
3631

37-
| Label | Type | Examples |
38-
|---|---|---|
39-
| `dstack_project_name` | *string* | `main` |
40-
| `dstack_user_name` | *string* | `alice` |
41-
| `dstack_run_name` | *string* | `nccl-tests` |
42-
| `dstack_run_id` | *string* | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
43-
| `dstack_job_name` | *string* | `nccl-tests-0-0` |
44-
| `dstack_job_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
45-
| `dstack_job_num` | *integer* | `0` |
46-
| `dstack_replica_num` | *integer* | `0` |
47-
| `dstack_run_type` | *string* | `task`, `dev-environment` |
48-
| `dstack_backend` | *string* | `aws`, `runpod` |
49-
| `dstack_gpu` | *string?* | `T4` |
32+
=== "Labels"
33+
| Name | Type | Description | Examples |
34+
|------------------------|-----------|:--------------|----------------------------------------|
35+
| `dstack_project_name` | *string* | Project name | `main` |
36+
| `dstack_fleet_name` | *string?* | Fleet name | `my-fleet` |
37+
| `dstack_fleet_id` | *string?* | Fleet ID | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
38+
| `dstack_instance_name` | *string* | Instance name | `my-fleet-0` |
39+
| `dstack_instance_id` | *string* | Instance ID | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
40+
| `dstack_instance_type` | *string?* | Instance type | `g4dn.xlarge` |
41+
| `dstack_backend` | *string?* | Backend | `aws`, `runpod` |
42+
| `dstack_gpu` | *string?* | GPU name | `H100` |
5043

51-
## NVIDIA DCGM job metrics
44+
### Runs
5245

53-
A fixed subset of NVIDIA GPU metrics from [DCGM Exporter :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html){:target="_blank"} on supported cloud backends — AWS, Azure, GCP, OCI — and SSH fleets.
46+
Run metrics include metrics for each job within a run.
47+
This includes information such as job runtime, price, GPU name, DCGM metrics, and more.
5448

55-
??? info "SSH fleets"
56-
In order for DCGM metrics to work, the following packages must be installed on the instances:
49+
=== "Metrics"
5750

58-
* `datacenter-gpu-manager-4-core`
59-
* `datacenter-gpu-manager-4-proprietary`
60-
* `datacenter-gpu-manager-exporter`
51+
| Name | Type | Description | Examples |
52+
|-------------------------------------------------|-----------|--------------------------------------------------------------------------------------------|--------------|
53+
| `dstack_job_duration_seconds_total` | *counter* | Total job runtime in seconds | `520.37` |
54+
| `dstack_job_price_dollars_per_hour` | *gauge* | Job instance price, USD/hour | `8.0` |
55+
| `dstack_job_gpu_count` | *gauge* | Job GPU count | `2.0`, `0.0` |
56+
| `DCGM_FI_DEV_GPU_UTIL` | gauge | GPU utilization (in %). | |
57+
| `DCGM_FI_DEV_MEM_COPY_UTIL` | gauge | Memory utilization (in %). | |
58+
| `DCGM_FI_DEV_ENC_UTIL` | gauge | Encoder utilization (in %). | |
59+
| `DCGM_FI_DEV_DEC_UTIL` | gauge | Decoder utilization (in %). | |
60+
| `DCGM_FI_DEV_FB_FREE` | gauge | Framebuffer memory free (in MiB). | |
61+
| `DCGM_FI_DEV_FB_USED` | gauge | Framebuffer memory used (in MiB). | |
62+
| `DCGM_FI_PROF_GR_ENGINE_ACTIVE` | gauge | The ratio of cycles during which a graphics engine or compute engine remains active. | |
63+
| `DCGM_FI_PROF_SM_ACTIVE` | gauge | The ratio of cycles an SM has at least 1 warp assigned. | |
64+
| `DCGM_FI_PROF_SM_OCCUPANCY` | gauge | The ratio of number of warps resident on an SM. | |
65+
| `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE` | gauge | Ratio of cycles the tensor (HMMA) pipe is active. | |
66+
| `DCGM_FI_PROF_PIPE_FP64_ACTIVE` | gauge | Ratio of cycles the fp64 pipes are active. | |
67+
| `DCGM_FI_PROF_PIPE_FP32_ACTIVE` | gauge | Ratio of cycles the fp32 pipes are active. | |
68+
| `DCGM_FI_PROF_PIPE_FP16_ACTIVE` | gauge | Ratio of cycles the fp16 pipes are active. | |
69+
| `DCGM_FI_PROF_PIPE_INT_ACTIVE` | gauge | Ratio of cycles the integer pipe is active. | |
70+
| `DCGM_FI_PROF_DRAM_ACTIVE` | gauge | Ratio of cycles the device memory interface is active sending or receiving data. | |
71+
| `DCGM_FI_PROF_PCIE_TX_BYTES` | counter | The number of bytes of active PCIe tx (transmit) data including both header and payload. | |
72+
| `DCGM_FI_PROF_PCIE_RX_BYTES` | counter | The number of bytes of active PCIe rx (read) data including both header and payload. | |
73+
| `DCGM_FI_DEV_SM_CLOCK` | gauge | SM clock frequency (in MHz). | |
74+
| `DCGM_FI_DEV_MEM_CLOCK` | gauge | Memory clock frequency (in MHz). | |
75+
| `DCGM_FI_DEV_MEMORY_TEMP` | gauge | Memory temperature (in C). | |
76+
| `DCGM_FI_DEV_GPU_TEMP` | gauge | GPU temperature (in C). | |
77+
| `DCGM_FI_DEV_POWER_USAGE` | gauge | Power draw (in W). | |
78+
| `DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION` | counter | Total energy consumption since boot (in mJ). | |
79+
| `DCGM_FI_DEV_PCIE_REPLAY_COUNTER` | counter | Total number of PCIe retries. | |
80+
| `DCGM_FI_DEV_XID_ERRORS` | gauge | Value of the last XID error encountered. | |
81+
| `DCGM_FI_DEV_POWER_VIOLATION` | counter | Throttling duration due to power constraints (in us). | |
82+
| `DCGM_FI_DEV_THERMAL_VIOLATION` | counter | Throttling duration due to thermal constraints (in us). | |
83+
| `DCGM_FI_DEV_SYNC_BOOST_VIOLATION` | counter | Throttling duration due to sync-boost constraints (in us). | |
84+
| `DCGM_FI_DEV_BOARD_LIMIT_VIOLATION` | counter | Throttling duration due to board limit constraints (in us). | |
85+
| `DCGM_FI_DEV_LOW_UTIL_VIOLATION` | counter | Throttling duration due to low utilization (in us). | |
86+
| `DCGM_FI_DEV_RELIABILITY_VIOLATION` | counter | Throttling duration due to reliability constraints (in us). | |
87+
| `DCGM_FI_DEV_ECC_SBE_VOL_TOTAL` | counter | Total number of single-bit volatile ECC errors. | |
88+
| `DCGM_FI_DEV_ECC_DBE_VOL_TOTAL` | counter | Total number of double-bit volatile ECC errors. | |
89+
| `DCGM_FI_DEV_ECC_SBE_AGG_TOTAL` | counter | Total number of single-bit persistent ECC errors. | |
90+
| `DCGM_FI_DEV_ECC_DBE_AGG_TOTAL` | counter | Total number of double-bit persistent ECC errors. | |
91+
| `DCGM_FI_DEV_RETIRED_SBE` | counter | Total number of retired pages due to single-bit errors. | |
92+
| `DCGM_FI_DEV_RETIRED_DBE` | counter | Total number of retired pages due to double-bit errors. | |
93+
| `DCGM_FI_DEV_RETIRED_PENDING` | counter | Total number of pages pending retirement. | |
94+
| `DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS` | counter | Number of remapped rows for uncorrectable errors | |
95+
| `DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS` | counter | Number of remapped rows for correctable errors | |
96+
| `DCGM_FI_DEV_ROW_REMAP_FAILURE` | gauge | Whether remapping of rows has failed | |
97+
| `DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL` | counter | Total number of NVLink flow-control CRC errors. | |
98+
| `DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL` | counter | Total number of NVLink data CRC errors. | |
99+
| `DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL` | counter | Total number of NVLink retries. | |
100+
| `DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL` | counter | Total number of NVLink recovery errors. | |
101+
| `DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL` | counter | Total number of NVLink bandwidth counters for all lanes. | |
102+
| `DCGM_FI_DEV_NVLINK_BANDWIDTH_L0` | counter | The number of bytes of active NVLink rx or tx data including both header and payload. | |
103+
| `DCGM_FI_PROF_NVLINK_RX_BYTES` | counter | The number of bytes of active PCIe rx (read) data including both header and payload. | |
104+
| `DCGM_FI_PROF_NVLINK_TX_BYTES` | counter | The number of bytes of active NvLink tx (transmit) data including both header and payload. | |
61105

62-
Check [`dcgm/exporter.go`](https://github.com/dstackai/dstack/blob/master/runner/internal/shim/dcgm/exporter.go) for the list of metrics.
63-
64-
| Label | Type | Examples |
65-
|---|---|---|
66-
| `dstack_project_name` | *string* | `main` |
67-
| `dstack_user_name` | *string* | `alice` |
68-
| `dstack_run_name` | *string* | `nccl-tests` |
69-
| `dstack_run_id` | *string* | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
70-
| `dstack_job_name` | *string* | `nccl-tests-0-0` |
71-
| `dstack_job_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
72-
| `dstack_job_num` | *integer* | `0` |
73-
| `dstack_replica_num` | *integer* | `0` |
106+
=== "Labels"
107+
| Label | Type | | Examples |
108+
|-----------------------|-----------|:-----------------------|----------------------------------------|
109+
| `dstack_project_name` | *string* | Project name | `main` |
110+
| `dstack_user_name` | *string* | User name | `alice` |
111+
| `dstack_run_name` | *string* | Run name | `nccl-tests` |
112+
| `dstack_run_id` | *string* | Run ID | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
113+
| `dstack_job_name` | *string* | Job name | `nccl-tests-0-0` |
114+
| `dstack_job_id` | *string* | Job ID | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
115+
| `dstack_job_num` | *integer* | Job number | `0` |
116+
| `dstack_replica_num` | *integer* | Replica number | `0` |
117+
| `dstack_run_type` | *string* | Run configuration type | `task`, `dev-environment` |
118+
| `dstack_backend` | *string* | Backend | `aws`, `runpod` |
119+
| `dstack_gpu` | *string?* | GPU name | `H100` |

0 commit comments

Comments
 (0)