[Docs]: Update the Metrics guide (#2441)

peterschmidt85 · web-flow · commit bbae827fdb94 · 2025-03-20T00:03:04.000-07:00
diff --git a/docs/assets/stylesheets/extra.css b/docs/assets/stylesheets/extra.css
@@ -201,6 +201,8 @@
 .md-typeset__scrollwrap {
     margin-top: 0;
     margin-bottom: 0;
+    margin-block-start: 1em;
+    margin-block-end: 1em;
 }
 
 .md-typeset__table {
diff --git a/docs/docs/guides/metrics.md b/docs/docs/guides/metrics.md
@@ -1,73 +1,119 @@
-# Prometheus metrics
+# Metrics
 
-If enabled, `dstack` collects and exports Prometheus metrics. Metrics are available at the `/metrics` path.
+## Prometheus
 
-By default, metrics are disabled. To enable, set the `DSTACK_ENABLE_PROMETHEUS_METRICS` variable.
+When enabled, `dstack` is able to collect various metrics from fleets and runs and export them 
+to Prometheus.  
 
-!!! info "Convention"
-    *type?* denotes an optional type. If a type is optional, an empty string is a valid value.
+### Setup
 
-## Instance metrics
+To enable collecting and exporting metrics to Prometheus, 
+set the `DSTACK_ENABLE_PROMETHEUS_METRICS` environment variable, and point Prometheus to collect metrics 
+from the `<dstack server URL>/metrics` endpoint.
 
-| Metric | Type | Description | Examples |
-|---|---|---|---|
-| `dstack_instance_duration_seconds_total` | *counter* | Total seconds the instance is running | `1123763.22` |
-| `dstack_instance_price_dollars_per_hour` | *gauge* | Instance price, USD/hour | `16.0`|
-| `dstack_instance_gpu_count` | *gauge* | Instance GPU count | `4.0`, `0.0` |
+??? info "NVIDIA DCGM"
+    NVIDIA DCGM metrics are automatically collected for AWS, Azure, GCP, and OCI backends, as well as for SSH fleets.
+    
+    To ensure NVIDIA DCGM metrics are collected from SSH fleets, ensure the `datacenter-gpu-manager-4-core`, 
+    `datacenter-gpu-manager-4-proprietary`, and `datacenter-gpu-manager-exporter` packages are installed on the hosts.
 
-| Label | Type | Examples |
-|---|---|---|
-| `dstack_project_name` | *string* | `main` |
-| `dstack_fleet_name` | *string?* | `my-fleet` |
-| `dstack_fleet_id` | *string?* | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
-| `dstack_instance_name` | *string* | `my-fleet-0` |
-| `dstack_instance_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
-| `dstack_instance_type` | *string?* | `g4dn.xlarge` |
-| `dstack_backend` | *string?* | `aws`, `runpod` |
-| `dstack_gpu` | *string?* | `T4` |
+### Fleets
 
-## Job metrics
+Fleet metrics include metrics for each instance within a fleet. This includes information such as the instance's running
+time, price, GPU name, and more.
 
-| Metric | Type | Description | Examples |
-|---|---|---|---|
-| `dstack_job_duration_seconds_total` | *counter* | Total seconds the job is running | `520.37` |
-| `dstack_job_price_dollars_per_hour` | *gauge* | Job instance price, USD/hour | `8.0`|
-| `dstack_job_gpu_count` | *gauge* | Job GPU count | `2.0`, `0.0` |
+=== "Metrics"
+    | Name                                     | Type      | Description                       | Examples     |
+    |------------------------------------------|-----------|-----------------------------------|--------------|
+    | `dstack_instance_duration_seconds_total` | *counter* | Total instance runtime in seconds | `1123763.22` |
+    | `dstack_instance_price_dollars_per_hour` | *gauge*   | Instance price, USD/hour          | `16.0`       |
+    | `dstack_instance_gpu_count`              | *gauge*   | Instance GPU count                | `4.0`, `0.0` |
 
-| Label | Type | Examples |
-|---|---|---|
-| `dstack_project_name` | *string* | `main` |
-| `dstack_user_name` | *string* | `alice` |
-| `dstack_run_name` | *string* | `nccl-tests` |
-| `dstack_run_id` | *string* | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
-| `dstack_job_name` | *string* | `nccl-tests-0-0` |
-| `dstack_job_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
-| `dstack_job_num` | *integer* | `0` |
-| `dstack_replica_num` | *integer* | `0` |
-| `dstack_run_type` | *string* | `task`, `dev-environment` |
-| `dstack_backend` | *string* | `aws`, `runpod` |
-| `dstack_gpu` | *string?* | `T4` |
+=== "Labels"
+    | Name                   | Type      | Description   | Examples                               |
+    |------------------------|-----------|:--------------|----------------------------------------|
+    | `dstack_project_name`  | *string*  | Project name  | `main`                                 |
+    | `dstack_fleet_name`    | *string?* | Fleet name    | `my-fleet`                             |
+    | `dstack_fleet_id`      | *string?* | Fleet ID      | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
+    | `dstack_instance_name` | *string*  | Instance name | `my-fleet-0`                           |
+    | `dstack_instance_id`   | *string*  | Instance ID   | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
+    | `dstack_instance_type` | *string?* | Instance type | `g4dn.xlarge`                          |
+    | `dstack_backend`       | *string?* | Backend       | `aws`, `runpod`                        |
+    | `dstack_gpu`           | *string?* | GPU name      | `H100`                                 |
 
-## NVIDIA DCGM job metrics
+### Runs
 
-A fixed subset of NVIDIA GPU metrics from [DCGM Exporter :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html){:target="_blank"} on supported cloud backends — AWS, Azure, GCP, OCI — and SSH fleets.
+Run metrics include metrics for each job within a run.
+This includes information such as job runtime, price, GPU name, DCGM metrics, and more.
 
-??? info "SSH fleets"
-    In order for DCGM metrics to work, the following packages must be installed on the instances:
+=== "Metrics"
 
-    * `datacenter-gpu-manager-4-core`
-    * `datacenter-gpu-manager-4-proprietary`
-    * `datacenter-gpu-manager-exporter`
+    | Name                                            | Type      | Description                                                                                | Examples     |
+    |-------------------------------------------------|-----------|--------------------------------------------------------------------------------------------|--------------|
+    | `dstack_job_duration_seconds_total`             | *counter* | Total job runtime in seconds                                                               | `520.37`     |
+    | `dstack_job_price_dollars_per_hour`             | *gauge*   | Job instance price, USD/hour                                                               | `8.0`        |
+    | `dstack_job_gpu_count`                          | *gauge*   | Job GPU count                                                                              | `2.0`, `0.0` |
+    | `DCGM_FI_DEV_GPU_UTIL`                          | gauge     | GPU utilization (in %).                                                                    |              |
+    | `DCGM_FI_DEV_MEM_COPY_UTIL`                     | gauge     | Memory utilization (in %).                                                                 |              |
+    | `DCGM_FI_DEV_ENC_UTIL`                          | gauge     | Encoder utilization (in %).                                                                |              |
+    | `DCGM_FI_DEV_DEC_UTIL`                          | gauge     | Decoder utilization (in %).                                                                |              |
+    | `DCGM_FI_DEV_FB_FREE`                           | gauge     | Framebuffer memory free (in MiB).                                                          |              |
+    | `DCGM_FI_DEV_FB_USED`                           | gauge     | Framebuffer memory used (in MiB).                                                          |              |
+    | `DCGM_FI_PROF_GR_ENGINE_ACTIVE`                 | gauge     | The ratio of cycles during which a graphics engine or compute engine remains active.       |              |
+    | `DCGM_FI_PROF_SM_ACTIVE`                        | gauge     | The ratio of cycles an SM has at least 1 warp assigned.                                    |              |
+    | `DCGM_FI_PROF_SM_OCCUPANCY`                     | gauge     | The ratio of number of warps resident on an SM.                                            |              |
+    | `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`               | gauge     | Ratio of cycles the tensor (HMMA) pipe is active.                                          |              |
+    | `DCGM_FI_PROF_PIPE_FP64_ACTIVE`                 | gauge     | Ratio of cycles the fp64 pipes are active.                                                 |              |
+    | `DCGM_FI_PROF_PIPE_FP32_ACTIVE`                 | gauge     | Ratio of cycles the fp32 pipes are active.                                                 |              |
+    | `DCGM_FI_PROF_PIPE_FP16_ACTIVE`                 | gauge     | Ratio of cycles the fp16 pipes are active.                                                 |              |
+    | `DCGM_FI_PROF_PIPE_INT_ACTIVE`                  | gauge     | Ratio of cycles the integer pipe is active.                                                |              |
+    | `DCGM_FI_PROF_DRAM_ACTIVE`                      | gauge     | Ratio of cycles the device memory interface is active sending or receiving data.           |              |
+    | `DCGM_FI_PROF_PCIE_TX_BYTES`                    | counter   | The number of bytes of active PCIe tx (transmit) data including both header and payload.   |              |
+    | `DCGM_FI_PROF_PCIE_RX_BYTES`                    | counter   | The number of bytes of active PCIe rx (read) data including both header and payload.       |              |
+    | `DCGM_FI_DEV_SM_CLOCK`                          | gauge     | SM clock frequency (in MHz).                                                               |              |
+    | `DCGM_FI_DEV_MEM_CLOCK`                         | gauge     | Memory clock frequency (in MHz).                                                           |              |
+    | `DCGM_FI_DEV_MEMORY_TEMP`                       | gauge     | Memory temperature (in C).                                                                 |              |
+    | `DCGM_FI_DEV_GPU_TEMP`                          | gauge     | GPU temperature (in C).                                                                    |              |
+    | `DCGM_FI_DEV_POWER_USAGE`                       | gauge     | Power draw (in W).                                                                         |              |
+    | `DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION`          | counter   | Total energy consumption since boot (in mJ).                                               |              |
+    | `DCGM_FI_DEV_PCIE_REPLAY_COUNTER`               | counter   | Total number of PCIe retries.                                                              |              |
+    | `DCGM_FI_DEV_XID_ERRORS`                        | gauge     | Value of the last XID error encountered.                                                   |              |
+    | `DCGM_FI_DEV_POWER_VIOLATION`                   | counter   | Throttling duration due to power constraints (in us).                                      |              |
+    | `DCGM_FI_DEV_THERMAL_VIOLATION`                 | counter   | Throttling duration due to thermal constraints (in us).                                    |              |
+    | `DCGM_FI_DEV_SYNC_BOOST_VIOLATION`              | counter   | Throttling duration due to sync-boost constraints (in us).                                 |              |
+    | `DCGM_FI_DEV_BOARD_LIMIT_VIOLATION`             | counter   | Throttling duration due to board limit constraints (in us).                                |              |
+    | `DCGM_FI_DEV_LOW_UTIL_VIOLATION`                | counter   | Throttling duration due to low utilization (in us).                                        |              |
+    | `DCGM_FI_DEV_RELIABILITY_VIOLATION`             | counter   | Throttling duration due to reliability constraints (in us).                                |              |
+    | `DCGM_FI_DEV_ECC_SBE_VOL_TOTAL`                 | counter   | Total number of single-bit volatile ECC errors.                                            |              |
+    | `DCGM_FI_DEV_ECC_DBE_VOL_TOTAL`                 | counter   | Total number of double-bit volatile ECC errors.                                            |              |
+    | `DCGM_FI_DEV_ECC_SBE_AGG_TOTAL`                 | counter   | Total number of single-bit persistent ECC errors.                                          |              |
+    | `DCGM_FI_DEV_ECC_DBE_AGG_TOTAL`                 | counter   | Total number of double-bit persistent ECC errors.                                          |              |
+    | `DCGM_FI_DEV_RETIRED_SBE`                       | counter   | Total number of retired pages due to single-bit errors.                                    |              |
+    | `DCGM_FI_DEV_RETIRED_DBE`                       | counter   | Total number of retired pages due to double-bit errors.                                    |              |
+    | `DCGM_FI_DEV_RETIRED_PENDING`                   | counter   | Total number of pages pending retirement.                                                  |              |
+    | `DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS`       | counter   | Number of remapped rows for uncorrectable errors                                           |              |
+    | `DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS`         | counter   | Number of remapped rows for correctable errors                                             |              |
+    | `DCGM_FI_DEV_ROW_REMAP_FAILURE`                 | gauge     | Whether remapping of rows has failed                                                       |              |
+    | `DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL` | counter   | Total number of NVLink flow-control CRC errors.                                            |              |
+    | `DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL` | counter   | Total number of NVLink data CRC errors.                                                    |              |
+    | `DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL`   | counter   | Total number of NVLink retries.                                                            |              |
+    | `DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL` | counter   | Total number of NVLink recovery errors.                                                    |              |
+    | `DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL`            | counter   | Total number of NVLink bandwidth counters for all lanes.                                   |              |
+    | `DCGM_FI_DEV_NVLINK_BANDWIDTH_L0`               | counter   | The number of bytes of active NVLink rx or tx data including both header and payload.      |              |
+    | `DCGM_FI_PROF_NVLINK_RX_BYTES`                  | counter   | The number of bytes of active PCIe rx (read) data including both header and payload.       |              |
+    | `DCGM_FI_PROF_NVLINK_TX_BYTES`                  | counter   | The number of bytes of active NvLink tx (transmit) data including both header and payload. |              |
 
-Check [`dcgm/exporter.go`](https://github.com/dstackai/dstack/blob/master/runner/internal/shim/dcgm/exporter.go) for the list of metrics.
-
-| Label | Type | Examples |
-|---|---|---|
-| `dstack_project_name` | *string* | `main` |
-| `dstack_user_name` | *string* | `alice` |
-| `dstack_run_name` | *string* | `nccl-tests` |
-| `dstack_run_id` | *string* | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
-| `dstack_job_name` | *string* | `nccl-tests-0-0` |
-| `dstack_job_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
-| `dstack_job_num` | *integer* | `0` |
-| `dstack_replica_num` | *integer* | `0` |
+=== "Labels"
+    | Label                 | Type      |                        | Examples                               |
+    |-----------------------|-----------|:-----------------------|----------------------------------------|
+    | `dstack_project_name` | *string*  | Project name           | `main`                                 |
+    | `dstack_user_name`    | *string*  | User name              | `alice`                                |
+    | `dstack_run_name`     | *string*  | Run name               | `nccl-tests`                           |
+    | `dstack_run_id`       | *string*  | Run ID                 | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
+    | `dstack_job_name`     | *string*  | Job name               | `nccl-tests-0-0`                       |
+    | `dstack_job_id`       | *string*  | Job ID                 | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
+    | `dstack_job_num`      | *integer* | Job number             | `0`                                    |
+    | `dstack_replica_num`  | *integer* | Replica number         | `0`                                    |
+    | `dstack_run_type`     | *string*  | Run configuration type | `task`, `dev-environment`              |
+    | `dstack_backend`      | *string*  | Backend                | `aws`, `runpod`                        |
+    | `dstack_gpu`          | *string?* | GPU name               | `H100`                                 |

Original file line number	Diff line number	Diff line change
`@@ -201,6 +201,8 @@`
`201`	`201`	`.md-typeset__scrollwrap {`
`202`	`202`	`margin-top: 0;`
`203`	`203`	`margin-bottom: 0;`
	`204`	`+ margin-block-start: 1em;`
	`205`	`+ margin-block-end: 1em;`
`204`	`206`	`}`
`205`	`207`
`206`	`208`	`.md-typeset__table {`