|
1 | | -# Prometheus metrics |
| 1 | +# Metrics |
2 | 2 |
|
3 | | -If enabled, `dstack` collects and exports Prometheus metrics. Metrics are available at the `/metrics` path. |
| 3 | +## Prometheus |
4 | 4 |
|
5 | | -By default, metrics are disabled. To enable, set the `DSTACK_ENABLE_PROMETHEUS_METRICS` variable. |
| 5 | +When enabled, `dstack` is able to collect various metrics from fleets and runs and export them |
| 6 | +to Prometheus. |
6 | 7 |
|
7 | | -!!! info "Convention" |
8 | | - *type?* denotes an optional type. If a type is optional, an empty string is a valid value. |
| 8 | +### Setup |
9 | 9 |
|
10 | | -## Instance metrics |
| 10 | +To enable collecting and exporting metrics to Prometheus, |
| 11 | +set the `DSTACK_ENABLE_PROMETHEUS_METRICS` environment variable, and point Prometheus to collect metrics |
| 12 | +from the `<dstack server URL>/metrics` endpoint. |
11 | 13 |
|
12 | | -| Metric | Type | Description | Examples | |
13 | | -|---|---|---|---| |
14 | | -| `dstack_instance_duration_seconds_total` | *counter* | Total seconds the instance is running | `1123763.22` | |
15 | | -| `dstack_instance_price_dollars_per_hour` | *gauge* | Instance price, USD/hour | `16.0`| |
16 | | -| `dstack_instance_gpu_count` | *gauge* | Instance GPU count | `4.0`, `0.0` | |
| 14 | +??? info "NVIDIA DCGM" |
| 15 | + NVIDIA DCGM metrics are automatically collected for AWS, Azure, GCP, and OCI backends, as well as for SSH fleets. |
| 16 | + |
| 17 | + To ensure NVIDIA DCGM metrics are collected from SSH fleets, ensure the `datacenter-gpu-manager-4-core`, |
| 18 | + `datacenter-gpu-manager-4-proprietary`, and `datacenter-gpu-manager-exporter` packages are installed on the hosts. |
17 | 19 |
|
18 | | -| Label | Type | Examples | |
19 | | -|---|---|---| |
20 | | -| `dstack_project_name` | *string* | `main` | |
21 | | -| `dstack_fleet_name` | *string?* | `my-fleet` | |
22 | | -| `dstack_fleet_id` | *string?* | `51e837bf-fae9-4a37-ac9c-85c005606c22` | |
23 | | -| `dstack_instance_name` | *string* | `my-fleet-0` | |
24 | | -| `dstack_instance_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` | |
25 | | -| `dstack_instance_type` | *string?* | `g4dn.xlarge` | |
26 | | -| `dstack_backend` | *string?* | `aws`, `runpod` | |
27 | | -| `dstack_gpu` | *string?* | `T4` | |
| 20 | +### Fleets |
28 | 21 |
|
29 | | -## Job metrics |
| 22 | +Fleet metrics include metrics for each instance within a fleet. This includes information such as the instance's running |
| 23 | +time, price, GPU name, and more. |
30 | 24 |
|
31 | | -| Metric | Type | Description | Examples | |
32 | | -|---|---|---|---| |
33 | | -| `dstack_job_duration_seconds_total` | *counter* | Total seconds the job is running | `520.37` | |
34 | | -| `dstack_job_price_dollars_per_hour` | *gauge* | Job instance price, USD/hour | `8.0`| |
35 | | -| `dstack_job_gpu_count` | *gauge* | Job GPU count | `2.0`, `0.0` | |
| 25 | +=== "Metrics" |
| 26 | + | Name | Type | Description | Examples | |
| 27 | + |------------------------------------------|-----------|-----------------------------------|--------------| |
| 28 | + | `dstack_instance_duration_seconds_total` | *counter* | Total instance runtime in seconds | `1123763.22` | |
| 29 | + | `dstack_instance_price_dollars_per_hour` | *gauge* | Instance price, USD/hour | `16.0` | |
| 30 | + | `dstack_instance_gpu_count` | *gauge* | Instance GPU count | `4.0`, `0.0` | |
36 | 31 |
|
37 | | -| Label | Type | Examples | |
38 | | -|---|---|---| |
39 | | -| `dstack_project_name` | *string* | `main` | |
40 | | -| `dstack_user_name` | *string* | `alice` | |
41 | | -| `dstack_run_name` | *string* | `nccl-tests` | |
42 | | -| `dstack_run_id` | *string* | `51e837bf-fae9-4a37-ac9c-85c005606c22` | |
43 | | -| `dstack_job_name` | *string* | `nccl-tests-0-0` | |
44 | | -| `dstack_job_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` | |
45 | | -| `dstack_job_num` | *integer* | `0` | |
46 | | -| `dstack_replica_num` | *integer* | `0` | |
47 | | -| `dstack_run_type` | *string* | `task`, `dev-environment` | |
48 | | -| `dstack_backend` | *string* | `aws`, `runpod` | |
49 | | -| `dstack_gpu` | *string?* | `T4` | |
| 32 | +=== "Labels" |
| 33 | + | Name | Type | Description | Examples | |
| 34 | + |------------------------|-----------|:--------------|----------------------------------------| |
| 35 | + | `dstack_project_name` | *string* | Project name | `main` | |
| 36 | + | `dstack_fleet_name` | *string?* | Fleet name | `my-fleet` | |
| 37 | + | `dstack_fleet_id` | *string?* | Fleet ID | `51e837bf-fae9-4a37-ac9c-85c005606c22` | |
| 38 | + | `dstack_instance_name` | *string* | Instance name | `my-fleet-0` | |
| 39 | + | `dstack_instance_id` | *string* | Instance ID | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` | |
| 40 | + | `dstack_instance_type` | *string?* | Instance type | `g4dn.xlarge` | |
| 41 | + | `dstack_backend` | *string?* | Backend | `aws`, `runpod` | |
| 42 | + | `dstack_gpu` | *string?* | GPU name | `H100` | |
50 | 43 |
|
51 | | -## NVIDIA DCGM job metrics |
| 44 | +### Runs |
52 | 45 |
|
53 | | -A fixed subset of NVIDIA GPU metrics from [DCGM Exporter :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html){:target="_blank"} on supported cloud backends — AWS, Azure, GCP, OCI — and SSH fleets. |
| 46 | +Run metrics include metrics for each job within a run. |
| 47 | +This includes information such as job runtime, price, GPU name, DCGM metrics, and more. |
54 | 48 |
|
55 | | -??? info "SSH fleets" |
56 | | - In order for DCGM metrics to work, the following packages must be installed on the instances: |
| 49 | +=== "Metrics" |
57 | 50 |
|
58 | | - * `datacenter-gpu-manager-4-core` |
59 | | - * `datacenter-gpu-manager-4-proprietary` |
60 | | - * `datacenter-gpu-manager-exporter` |
| 51 | + | Name | Type | Description | Examples | |
| 52 | + |-------------------------------------------------|-----------|--------------------------------------------------------------------------------------------|--------------| |
| 53 | + | `dstack_job_duration_seconds_total` | *counter* | Total job runtime in seconds | `520.37` | |
| 54 | + | `dstack_job_price_dollars_per_hour` | *gauge* | Job instance price, USD/hour | `8.0` | |
| 55 | + | `dstack_job_gpu_count` | *gauge* | Job GPU count | `2.0`, `0.0` | |
| 56 | + | `DCGM_FI_DEV_GPU_UTIL` | gauge | GPU utilization (in %). | | |
| 57 | + | `DCGM_FI_DEV_MEM_COPY_UTIL` | gauge | Memory utilization (in %). | | |
| 58 | + | `DCGM_FI_DEV_ENC_UTIL` | gauge | Encoder utilization (in %). | | |
| 59 | + | `DCGM_FI_DEV_DEC_UTIL` | gauge | Decoder utilization (in %). | | |
| 60 | + | `DCGM_FI_DEV_FB_FREE` | gauge | Framebuffer memory free (in MiB). | | |
| 61 | + | `DCGM_FI_DEV_FB_USED` | gauge | Framebuffer memory used (in MiB). | | |
| 62 | + | `DCGM_FI_PROF_GR_ENGINE_ACTIVE` | gauge | The ratio of cycles during which a graphics engine or compute engine remains active. | | |
| 63 | + | `DCGM_FI_PROF_SM_ACTIVE` | gauge | The ratio of cycles an SM has at least 1 warp assigned. | | |
| 64 | + | `DCGM_FI_PROF_SM_OCCUPANCY` | gauge | The ratio of number of warps resident on an SM. | | |
| 65 | + | `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE` | gauge | Ratio of cycles the tensor (HMMA) pipe is active. | | |
| 66 | + | `DCGM_FI_PROF_PIPE_FP64_ACTIVE` | gauge | Ratio of cycles the fp64 pipes are active. | | |
| 67 | + | `DCGM_FI_PROF_PIPE_FP32_ACTIVE` | gauge | Ratio of cycles the fp32 pipes are active. | | |
| 68 | + | `DCGM_FI_PROF_PIPE_FP16_ACTIVE` | gauge | Ratio of cycles the fp16 pipes are active. | | |
| 69 | + | `DCGM_FI_PROF_PIPE_INT_ACTIVE` | gauge | Ratio of cycles the integer pipe is active. | | |
| 70 | + | `DCGM_FI_PROF_DRAM_ACTIVE` | gauge | Ratio of cycles the device memory interface is active sending or receiving data. | | |
| 71 | + | `DCGM_FI_PROF_PCIE_TX_BYTES` | counter | The number of bytes of active PCIe tx (transmit) data including both header and payload. | | |
| 72 | + | `DCGM_FI_PROF_PCIE_RX_BYTES` | counter | The number of bytes of active PCIe rx (read) data including both header and payload. | | |
| 73 | + | `DCGM_FI_DEV_SM_CLOCK` | gauge | SM clock frequency (in MHz). | | |
| 74 | + | `DCGM_FI_DEV_MEM_CLOCK` | gauge | Memory clock frequency (in MHz). | | |
| 75 | + | `DCGM_FI_DEV_MEMORY_TEMP` | gauge | Memory temperature (in C). | | |
| 76 | + | `DCGM_FI_DEV_GPU_TEMP` | gauge | GPU temperature (in C). | | |
| 77 | + | `DCGM_FI_DEV_POWER_USAGE` | gauge | Power draw (in W). | | |
| 78 | + | `DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION` | counter | Total energy consumption since boot (in mJ). | | |
| 79 | + | `DCGM_FI_DEV_PCIE_REPLAY_COUNTER` | counter | Total number of PCIe retries. | | |
| 80 | + | `DCGM_FI_DEV_XID_ERRORS` | gauge | Value of the last XID error encountered. | | |
| 81 | + | `DCGM_FI_DEV_POWER_VIOLATION` | counter | Throttling duration due to power constraints (in us). | | |
| 82 | + | `DCGM_FI_DEV_THERMAL_VIOLATION` | counter | Throttling duration due to thermal constraints (in us). | | |
| 83 | + | `DCGM_FI_DEV_SYNC_BOOST_VIOLATION` | counter | Throttling duration due to sync-boost constraints (in us). | | |
| 84 | + | `DCGM_FI_DEV_BOARD_LIMIT_VIOLATION` | counter | Throttling duration due to board limit constraints (in us). | | |
| 85 | + | `DCGM_FI_DEV_LOW_UTIL_VIOLATION` | counter | Throttling duration due to low utilization (in us). | | |
| 86 | + | `DCGM_FI_DEV_RELIABILITY_VIOLATION` | counter | Throttling duration due to reliability constraints (in us). | | |
| 87 | + | `DCGM_FI_DEV_ECC_SBE_VOL_TOTAL` | counter | Total number of single-bit volatile ECC errors. | | |
| 88 | + | `DCGM_FI_DEV_ECC_DBE_VOL_TOTAL` | counter | Total number of double-bit volatile ECC errors. | | |
| 89 | + | `DCGM_FI_DEV_ECC_SBE_AGG_TOTAL` | counter | Total number of single-bit persistent ECC errors. | | |
| 90 | + | `DCGM_FI_DEV_ECC_DBE_AGG_TOTAL` | counter | Total number of double-bit persistent ECC errors. | | |
| 91 | + | `DCGM_FI_DEV_RETIRED_SBE` | counter | Total number of retired pages due to single-bit errors. | | |
| 92 | + | `DCGM_FI_DEV_RETIRED_DBE` | counter | Total number of retired pages due to double-bit errors. | | |
| 93 | + | `DCGM_FI_DEV_RETIRED_PENDING` | counter | Total number of pages pending retirement. | | |
| 94 | + | `DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS` | counter | Number of remapped rows for uncorrectable errors | | |
| 95 | + | `DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS` | counter | Number of remapped rows for correctable errors | | |
| 96 | + | `DCGM_FI_DEV_ROW_REMAP_FAILURE` | gauge | Whether remapping of rows has failed | | |
| 97 | + | `DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL` | counter | Total number of NVLink flow-control CRC errors. | | |
| 98 | + | `DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL` | counter | Total number of NVLink data CRC errors. | | |
| 99 | + | `DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL` | counter | Total number of NVLink retries. | | |
| 100 | + | `DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL` | counter | Total number of NVLink recovery errors. | | |
| 101 | + | `DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL` | counter | Total number of NVLink bandwidth counters for all lanes. | | |
| 102 | + | `DCGM_FI_DEV_NVLINK_BANDWIDTH_L0` | counter | The number of bytes of active NVLink rx or tx data including both header and payload. | | |
| 103 | + | `DCGM_FI_PROF_NVLINK_RX_BYTES` | counter | The number of bytes of active PCIe rx (read) data including both header and payload. | | |
| 104 | + | `DCGM_FI_PROF_NVLINK_TX_BYTES` | counter | The number of bytes of active NvLink tx (transmit) data including both header and payload. | | |
61 | 105 |
|
62 | | -Check [`dcgm/exporter.go`](https://github.com/dstackai/dstack/blob/master/runner/internal/shim/dcgm/exporter.go) for the list of metrics. |
63 | | - |
64 | | -| Label | Type | Examples | |
65 | | -|---|---|---| |
66 | | -| `dstack_project_name` | *string* | `main` | |
67 | | -| `dstack_user_name` | *string* | `alice` | |
68 | | -| `dstack_run_name` | *string* | `nccl-tests` | |
69 | | -| `dstack_run_id` | *string* | `51e837bf-fae9-4a37-ac9c-85c005606c22` | |
70 | | -| `dstack_job_name` | *string* | `nccl-tests-0-0` | |
71 | | -| `dstack_job_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` | |
72 | | -| `dstack_job_num` | *integer* | `0` | |
73 | | -| `dstack_replica_num` | *integer* | `0` | |
| 106 | +=== "Labels" |
| 107 | + | Label | Type | | Examples | |
| 108 | + |-----------------------|-----------|:-----------------------|----------------------------------------| |
| 109 | + | `dstack_project_name` | *string* | Project name | `main` | |
| 110 | + | `dstack_user_name` | *string* | User name | `alice` | |
| 111 | + | `dstack_run_name` | *string* | Run name | `nccl-tests` | |
| 112 | + | `dstack_run_id` | *string* | Run ID | `51e837bf-fae9-4a37-ac9c-85c005606c22` | |
| 113 | + | `dstack_job_name` | *string* | Job name | `nccl-tests-0-0` | |
| 114 | + | `dstack_job_id` | *string* | Job ID | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` | |
| 115 | + | `dstack_job_num` | *integer* | Job number | `0` | |
| 116 | + | `dstack_replica_num` | *integer* | Replica number | `0` | |
| 117 | + | `dstack_run_type` | *string* | Run configuration type | `task`, `dev-environment` | |
| 118 | + | `dstack_backend` | *string* | Backend | `aws`, `runpod` | |
| 119 | + | `dstack_gpu` | *string?* | GPU name | `H100` | |
0 commit comments