These essential CockroachDB metrics enable you to build custom dashboards with the following tools: {% if include.deployment == 'self-hosted' %}
- [Grafana]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}#step-5-visualize-metrics-in-grafana)
- [Datadog Integration]({% link {{ page.version.version }}/datadog.md %}) - The Datadog Integration Metric Name column lists the corresponding Datadog metric which requires the
cockroachdb.prefix. {% elsif include.deployment == 'advanced' %} - [Datadog integration]({% link cockroachcloud/tools-page.md %}#monitor-cockroachdb-cloud-with-datadog) - The Datadog Integration Metric Name column lists the corresponding Datadog metric which requires the
crdb_dedicated.prefix. - [Metrics export]({% link cockroachcloud/export-metrics-advanced.md %}) {% endif %}
The Usage column explains why each metric is important to visualize in a custom dashboard and how to make both practical and actionable use of the metric in a production deployment.
|
top command output. The metric value can be more than 1 (or 100%) on multi-core systems. It is best to combine user and system metrics. |
| sys.cpu.sys.percent | sys.cpu.sys.percent | Current system CPU percentage consumed by the CRDB process | This metric gives the CPU usage percentage at the system (Linux kernel) level by the CockroachDB process only. This is similar to the Linux top command output. The metric value can be more than 1 (or 100%) on multi-core systems. It is best to combine user and system metrics. |
| sys.rss | sys.rss | Current process memory (RSS) | This metric gives the amount of RAM used by the CockroachDB process. Persistently low values over an extended period of time suggest there is underutilized memory that can be put to work with adjusted [settings for --cache or --max_sql_memory]({% link {{ page.version.version }}/recommended-production-settings.md %}#cache-and-sql-memory-size) or both. Conversely, a high utilization, even if a temporary spike, indicates an increased risk of [Out-of-memory (OOM) crash]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#out-of-memory-oom-crash) (particularly since the [swap is generally disabled]({% link {{ page.version.version }}/recommended-production-settings.md %}#memory)). |
| sql.mem.root.current | {% if include.deployment == 'self-hosted' %}sql.mem.root.current |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Current sql statement memory usage for root | This metric shows how memory set aside for temporary materializations, such as hash tables and intermediary result sets, is utilized. Use this metric to optimize memory allocations based on long term observations. The maximum amount is set with [--max_sql_memory]({% link {{ page.version.version }}/recommended-production-settings.md %}#cache-and-sql-memory-size). If the utilization of sql memory is persistently low, perhaps some portion of this memory allocation can be shifted to [--cache]({% link {{ page.version.version }}/recommended-production-settings.md %}#cache-and-sql-memory-size). |
| sys.host.disk.write.bytes | {% if include.deployment == 'self-hosted' %}sys.host.disk.write.bytes |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Bytes written to all disks since this process started | This metric reports the effective storage device write throughput (MB/s) rate. To confirm that storage is sufficiently provisioned, assess the I/O performance rates (IOPS and MBPS) in the context of the sys.host.disk.iopsinprogress metric. |
| sys.host.disk.write.count | {% if include.deployment == 'self-hosted' %}sys.host.disk.write |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Disk write operations across all disks since this process started | This metric reports the effective storage device write IOPS rate. To confirm that storage is sufficiently provisioned, assess the I/O performance rates (IOPS and MBPS) in the context of the sys.host.disk.iopsinprogress metric. |
| sys.host.disk.read.bytes | {% if include.deployment == 'self-hosted' %}sys.host.disk.read.bytes |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Bytes read from all disks since this process started | This metric reports the effective storage device read throughput (MB/s) rate. To confirm that storage is sufficiently provisioned, assess the I/O performance rates (IOPS and MBPS) in the context of the sys.host.disk.iopsinprogress metric. |
| sys.host.disk.read.count | {% if include.deployment == 'self-hosted' %}sys.host.disk.read |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Disk read operations across all disks since this process started | This metric reports the effective storage device read IOPS rate. To confirm that storage is sufficiently provisioned, assess the I/O performance rates (IOPS and MBPS) in the context of the sys.host.disk.iopsinprogress metric. |
| sys.host.disk.iopsinprogress | {% if include.deployment == 'self-hosted' %}sys.host.disk.iopsinprogress |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} IO operations currently in progress on this host | This metric gives the average queue length of the storage device. It characterizes the storage device's performance capability. All I/O performance metrics are Linux counters and correspond to the avgqu-sz in the Linux iostat command output. You need to view the device queue graph in the context of the actual read/write IOPS and MBPS metrics that show the actual device utilization. If the device is not keeping up, the queue will grow. Values over 10 are bad. Values around 5 mean the device is working hard trying to keep up. For internal (on chassis) NVMe devices, the queue values are typically 0. For network connected devices, such as AWS EBS volumes, the normal operating range of values is 1 to 2. Spikes in values are OK. They indicate an I/O spike where the device fell behind and then caught up. End users may experience inconsistent response times, but there should be no cluster stability issues. If the queue is greater than 5 for an extended period of time and IOPS or MBPS are low, then the storage is most likely not provisioned per Cockroach Labs guidance. In AWS EBS, it is commonly an EBS type, such as gp2, not suitable as database primary storage. If I/O is low and the queue is low, the most likely scenario is that the CPU is lacking and not driving I/O. One such case is a cluster with nodes with only 2 vcpus which is not supported [sizing]({% link {{ page.version.version }}/recommended-production-settings.md %}#sizing) for production deployments. There are quite a few background processes in the database that take CPU away from the workload, so the workload is just not getting the CPU. Review [storage and disk I/O]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#storage-and-disk-i-o). |
| sys.host.net.recv.bytes | sys.host.net.recv.bytes | Bytes received on all network interfaces since this process started | This metric gives the node's ingress/egress network transfer rates for flat sections which may indicate insufficiently provisioned networking or high error rates. CockroachDB is using a reliable TCP/IP protocol, so errors result in delivery retries that create a "slow network" effect. |
| sys.host.net.send.bytes | sys.host.net.send.bytes | Bytes sent on all network interfaces since this process started | This metric gives the node's ingress/egress network transfer rates for flat sections which may indicate insufficiently provisioned networking or high error rates. CockroachDB is using a reliable TCP/IP protocol, so errors result in delivery retries that create a "slow network" effect. |
| clock-offset.meannanos | clock.offset.meannanos | Mean clock offset with other nodes | This metric gives the node's clock skew. In a well-configured environment, the actual clock skew would be in the sub-millisecond range. A skew exceeding 5 ms is likely due to a NTP service mis-configuration. Reducing the actual clock skew reduces the probability of uncertainty related conflicts and corresponding retires which has a positive impact on workload performance. Conversely, a larger actual clock skew increases the probability of retries due to uncertainty conflicts, with potentially measurable adverse effects on workload performance. |
(add `cockroachdb.` prefix)
(add `crdb_dedicated.` prefix)
|
storage.l0-num-files, storage.l0-sublevels or rocksdb.read-amplification directly. A healthy LSM shape is defined as “read-amp < 20” and “L0-files < 1000”, looking at [cluster settings]({% link {{ page.version.version }}/cluster-settings.md %}) admission.l0_sub_level_count_overload_threshold and admission.l0_file_count_overload_threshold respectively. |
| admission.wait_durations.kv-p75 | {% if include.deployment == 'self-hosted' %}admission.wait.durations.kv |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Wait time durations for requests that waited | This metric shows if CPU utilization-based admission control feature is working effectively or potentially overaggressive. This is a latency histogram of how much delay was added to the workload due to throttling by CPU control. If observing over 100ms waits for over 5 seconds while there was excess CPU capacity available, then the admission control is overly aggressive. |
| admission.wait_durations.kv-stores-p75 | {% if include.deployment == 'self-hosted' %}admission.wait.durations.kv_stores |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Wait time durations for requests that waited | This metric shows if I/O utilization-based admission control feature is working effectively or potentially overaggressive. This is a latency histogram of how much delay was added to the workload due to throttling by I/O control. If observing over 100ms waits for over 5 seconds while there was excess I/O capacity available, then the admission control is overly aggressive. |
| sys.runnable.goroutines.per.cpu | {% if include.deployment == 'self-hosted' %}sys.runnable.goroutines.per_cpu |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Average number of goroutines that are waiting to run, normalized by number of cores | If this metric has a value over 30, it indicates a CPU overload. If the condition lasts a short period of time (a few seconds), the database users are likely to experience inconsistent response times. If the condition persists for an extended period of time (tens of seconds, or minutes) the cluster may start developing stability issues. Review [CPU planning]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#cpu).
{% if include.deployment == 'self-hosted' %}
CockroachDB Metric Name |
Description |
Usage | |
|---|---|---|---|
| rpc.connection.avg_round_trip_latency | rpc.connection.avg_round_trip_latency | Sum of exponentially weighted moving average of round-trip latencies, as measured through a gRPC RPC. Dividing this gauge by rpc.connection.healthy gives an approximation of average latency, but the top-level round-trip-latency histogram is more useful. Instead, users should consult the label families of this metric if they are available (which requires Prometheus and the cluster setting server.child_metrics.enabled); these provide per-peer moving averages. This metric does not track failed connection. A failed connection's contribution is reset to zero. |
This metric is helpful in understanding general network issues outside of CockroachDB that could be impacting the user’s workload. |
| rpc.connection.failures | rpc.connection.failures.count | Counter of failed connections. This includes both the event in which a healthy connection terminates as well as unsuccessful reconnection attempts. Connections that are terminated as part of local node shutdown are excluded. Decommissioned peers are excluded. | See Description. |
| rpc.connection.healthy | rpc.connection.healthy | Gauge of current connections in a healthy state (i.e., bidirectionally connected and heartbeating). | See Description. |
| rpc.connection.healthy_nanos | rpc.connection.healthy_nanos | Gauge of nanoseconds of healthy connection time. On the Prometheus endpoint scraped when the cluster setting server.child_metrics.enabled is set, this gauge allows you to see the duration for which a given peer has been connected in a healthy state. |
This can be useful for monitoring the stability and health of connections within your CockroachDB cluster. |
| rpc.connection.heartbeats | rpc.connection.heartbeats.count | Counter of successful heartbeats. | See Description. |
| rpc.connection.unhealthy | rpc.connection.unhealthy | Gauge of current connections in an unhealthy state (not bidirectionally connected or heartbeating). | If the value of this metric is greater than 0, this could indicate a network partition. |
| rpc.connection.unhealthy_nanos | rpc.connection.unhealthy_nanos | Gauge of nanoseconds of unhealthy connection time. On the Prometheus endpoint scraped when the cluster setting server.child_metrics.enabled is set, this gauge allows you to see the duration for which a given peer has been unreachable. |
If this duration is greater than 0, this could indicate how long a network partition has been occurring. |
| {% endif %} |
{% if include.deployment == 'self-hosted' %}
|
|
rebalancing.queriespersecond) or CPU usage (rebalancing.cpunanospersecond), depending on the value of the kv.allocator.load_based_rebalancing.objective [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates that more rebalancing activity is taking place due to load imbalance between stores. |
| rebalancing_range_rebalances | {% if include.deployment == 'self-hosted' %}rebalancing.range.rebalances | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the number of [load-based range rebalances]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing). This range movement is tracked by a component that looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance of either QPS (rebalancing.queriespersecond) or CPU usage (rebalancing.cpunanospersecond), depending on the value of the kv.allocator.load_based_rebalancing.objective [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates that more rebalancing activity is taking place due to load imbalance between stores. |
| rebalancing_replicas_queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.queriespersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the KV-level requests received per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram that maintains buckets that can be queried for, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a hot range. See also: rebalancing_replicas_cpunanospersecond. |
| rebalancing_replicas_cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.cpunanospersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the CPU nanoseconds of execution time per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram that maintains buckets that can be queried for, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a hot range. See also the non-histogram variant: rebalancing.cpunanospersecond. |
| rebalancing.queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.queriespersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of kv-level requests received per second by the store, considering the last 30 minutes, as used in rebalancing decisions. | This metric shows hotspots along the queries per second (QPS) dimension. It provides insights into the ongoing rebalancing activities. |
| rebalancing.cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.cpunanospersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Non-histogram variant of rebalancing_replicas_cpunanospersecond. | See usage of rebalancing_replicas_cpunanospersecond. |
| ranges | ranges | Number of ranges | This metric provides a measure of the scale of the data size. |
| replicas | {% if include.deployment == 'self-hosted' %}replicas.total |{% elsif include.deployment == 'advanced' %}replicas |{% endif %} Number of replicas | This metric provides an essential characterization of the data distribution across cluster nodes. |
| replicas.leaseholders | replicas.leaseholders | Number of lease holders | This metric provides an essential characterization of the data processing points across cluster nodes. |
| ranges.underreplicated | ranges.underreplicated | Number of ranges with fewer live replicas than the replication target | This metric is an indicator of [replication issues]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#replication-issues). It shows whether the cluster has data that is not conforming to resilience goals. The next step is to determine the corresponding database object, such as the table or index, of these under-replicated ranges and whether the under-replication is temporarily expected. Use the statement SELECT table_name, index_name FROM [SHOW RANGES WITH INDEXES] WHERE range_id = {id of under-replicated range};|
| ranges.unavailable | ranges.unavailable | Number of ranges with fewer live replicas than needed for quorum | This metric is an indicator of [replication issues]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#replication-issues). It shows whether the cluster is unhealthy and can impact workload. If an entire range is unavailable, then it will be unable to process queries. |
| queue.replicate.replacedecommissioningreplica.error | {% if include.deployment == 'self-hosted' %}queue.replicate.replacedecommissioningreplica.error.count |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of failed decommissioning replica replacements processed by the replicate queue | Refer to [Decommission the node]({% link {{ page.version.version }}/node-shutdown.md %}?filters=decommission#decommission-the-node). |
| range.splits | {% if include.deployment == 'self-hosted' %}range.splits.total |{% elsif include.deployment == 'advanced' %}range.splits |{% endif %} Number of range splits | This metric indicates how fast a workload is scaling up. Spikes can indicate resource hot spots since the [split heuristic is based on QPS]({% link {{ page.version.version }}/load-based-splitting.md %}#control-load-based-splitting-threshold). To understand whether hot spots are an issue and with which tables and indexes they are occurring, correlate this metric with other metrics such as CPU usage, such as sys.cpu.combined.percent-normalized, or use the [Hot Ranges page]({% link {{ page.version.version }}/ui-hot-ranges-page.md %}). |
| range.merges | {% if include.deployment == 'self-hosted' %}range.merges.count |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of range merges | This metric indicates how fast a workload is scaling down. Merges are Cockroach's optimization for performance. This metric indicates that there have been deletes in the workload. |
|
SHOW FULL TABLE SCAN or the [SQL Activity Statements page]({% link {{ page.version.version }}/ui-statements-page.md %}) with the corresponding metric time frame. The Statements page also includes [explain plans]({% link {{ page.version.version }}/ui-statements-page.md %}#explain-plans) and [index recommendations]({% link {{ page.version.version }}/ui-statements-page.md %}#insights). Not all full scans are necessarily bad especially over smaller tables. |
| sql.insert.count | sql.insert.count | Number of SQL INSERT statements successfully executed | This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the [SQL Activity pages]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#sql-activity-pages) to investigate interesting outliers or patterns. For example, on the [Transactions page]({% link {{ page.version.version }}/ui-transactions-page.md %}) and the [Statements page]({% link {{ page.version.version }}/ui-statements-page.md %}), sort on the Execution Count column. To find problematic sessions, on the [Sessions page]({% link {{ page.version.version }}/ui-sessions-page.md %}), sort on the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application. |
| sql.update.count | sql.update.count | Number of SQL UPDATE statements successfully executed | This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the [SQL Activity pages]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#sql-activity-pages) to investigate interesting outliers or patterns. For example, on the [Transactions page]({% link {{ page.version.version }}/ui-transactions-page.md %}) and the [Statements page]({% link {{ page.version.version }}/ui-statements-page.md %}), sort on the Execution Count column. To find problematic sessions, on the [Sessions page]({% link {{ page.version.version }}/ui-sessions-page.md %}), sort on the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application. |
| sql.delete.count | sql.delete.count | Number of SQL DELETE statements successfully executed | This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the [SQL Activity pages]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#sql-activity-pages) to investigate interesting outliers or patterns. For example, on the [Transactions page]({% link {{ page.version.version }}/ui-transactions-page.md %}) and the [Statements page]({% link {{ page.version.version }}/ui-statements-page.md %}), sort on the Execution Count column. To find problematic sessions, on the [Sessions page]({% link {{ page.version.version }}/ui-sessions-page.md %}), sort on the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application. |
| sql.select.count | sql.select.count | Number of SQL SELECT statements successfully executed | This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the [SQL Activity pages]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#sql-activity-pages) to investigate interesting outliers or patterns. For example, on the [Transactions page]({% link {{ page.version.version }}/ui-transactions-page.md %}) and the [Statements page]({% link {{ page.version.version }}/ui-statements-page.md %}), sort on the Execution Count column. To find problematic sessions, on the [Sessions page]({% link {{ page.version.version }}/ui-sessions-page.md %}), sort on the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application. |
| sql.ddl.count | sql.ddl.count | Number of SQL DDL statements successfully executed | This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the [SQL Activity pages]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#sql-activity-pages) to investigate interesting outliers or patterns. For example, on the [Transactions page]({% link {{ page.version.version }}/ui-transactions-page.md %}) and the [Statements page]({% link {{ page.version.version }}/ui-statements-page.md %}), sort on the Execution Count column. To find problematic sessions, on the [Sessions page]({% link {{ page.version.version }}/ui-sessions-page.md %}), sort on the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application. |
| sql.txn.begin.count | sql.txn.begin.count | Number of SQL transaction BEGIN statements successfully executed | This metric reflects workload volume by counting explicit [transactions]({% link {{ page.version.version }}/transactions.md %}). Use this metric to determine whether explicit transactions can be refactored as implicit transactions (individual statements). |
| sql.txn.commit.count | sql.txn.commit.count | Number of SQL transaction COMMIT statements successfully executed | This metric shows the number of [transactions]({% link {{ page.version.version }}/transactions.md %}) that completed successfully. This metric can be used as a proxy to measure the number of successful explicit transactions. |
| sql.txn.rollback.count | sql.txn.rollback.count | Number of SQL transaction ROLLBACK statements successfully executed | This metric shows the number of orderly transaction [rollbacks]({% link {{ page.version.version }}/rollback-transaction.md %}). A persistently high number of rollbacks may negatively impact the workload performance and needs to be investigated. |
| sql.txn.abort.count | sql.txn.abort.count | Number of SQL transaction abort errors | This high-level metric reflects workload performance. A persistently high number of SQL transaction abort errors may negatively impact the workload performance and needs to be investigated. |
| sql.service.latency-p90, sql.service.latency-p99 | sql.service.latency | Latency of SQL request execution | These high-level metrics reflect workload performance. Monitor these metrics to understand latency over time. If abnormal patterns emerge, apply the metric's time range to the [SQL Activity pages]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#sql-activity-pages) to investigate interesting outliers or patterns. The [Statements page]({% link {{ page.version.version }}/ui-statements-page.md %}) has P90 Latency and P99 latency columns to enable correlation with this metric. |
| sql.txn.latency-p90, sql.txn.latency-p99 | sql.txn.latency | Latency of SQL transactions | These high-level metrics provide a latency histogram of all executed SQL transactions. These metrics provide an overview of the current SQL workload. |
| txnwaitqueue.deadlocks_total | {% if include.deployment == 'self-hosted' %}txnwaitqueue.deadlocks.count |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of deadlocks detected by the transaction wait queue | Alert on this metric if its value is greater than zero, especially if transaction throughput is lower than expected. Applications should be able to detect and recover from deadlock errors. However, transaction performance and throughput can be maximized if the application logic avoids deadlock conditions in the first place, for example, by keeping transactions as short as possible. |
| sql.distsql.contended_queries.count | {% if include.deployment == 'self-hosted' %}sql.distsql.contended.queries |{% elsif include.deployment == 'advanced' %} sql.distsql.contended.queries |{% endif %} Number of SQL queries that experienced contention | This metric is incremented whenever there is a non-trivial amount of contention experienced by a statement whether read-write or write-write conflicts. Monitor this metric to correlate possible workload performance issues to contention conflicts. |
| sql.conn.failures | sql.conn.failures.count | Number of SQL connection failures | This metric is incremented whenever a connection attempt fails for any reason, including timeouts. |
| sql.conn.latency-p90, sql.conn.latency-p99 | sql.conn.latency | Latency to establish and authenticate a SQL connection | These metrics characterize the database connection latency which can affect the application performance, for example, by having slow startup times. Connection failures are not recorded in these metrics.|
| txn.restarts.serializable | txn.restarts.serializable | Number of restarts due to a forwarded commit timestamp and isolation=SERIALIZABLE | This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review [transaction contention best practices]({% link {{ page.version.version }}/performance-best-practices-overview.md %}#transaction-contention) and [performance tuning recipes]({% link {{ page.version.version }}/performance-recipes.md %}#transaction-contention). Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated. |
| txn.restarts.writetooold | txn.restarts.writetooold | Number of restarts due to a concurrent writer committing first | This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review [transaction contention best practices]({% link {{ page.version.version }}/performance-best-practices-overview.md %}#transaction-contention) and [performance tuning recipes]({% link {{ page.version.version }}/performance-recipes.md %}#transaction-contention). Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated. |
| txn.restarts.writetoooldmulti | {% if include.deployment == 'self-hosted' %}txn.restarts.writetoooldmulti.count |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of restarts due to multiple concurrent writers committing first | This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review [transaction contention best practices]({% link {{ page.version.version }}/performance-best-practices-overview.md %}#transaction-contention) and [performance tuning recipes]({% link {{ page.version.version }}/performance-recipes.md %}#transaction-contention). Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated. |
| txn.restarts.unknown | {% if include.deployment == 'self-hosted' %}txn.restarts.unknown.count |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of restarts due to a unknown reasons | This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review [transaction contention best practices]({% link {{ page.version.version }}/performance-best-practices-overview.md %}#transaction-contention) and [performance tuning recipes]({% link {{ page.version.version }}/performance-recipes.md %}#transaction-contention). Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated. |
| txn.restarts.txnpush | {% if include.deployment == 'self-hosted' %}txn.restarts.txnpush.count |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of restarts due to a transaction push failure | This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review [transaction contention best practices]({% link {{ page.version.version }}/performance-best-practices-overview.md %}#transaction-contention) and [performance tuning recipes]({% link {{ page.version.version }}/performance-recipes.md %}#transaction-contention). Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated. |
| txn.restarts.txnaborted | {% if include.deployment == 'self-hosted' %}txn.restarts.txnaborted.count |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of restarts due to an abort by a concurrent transaction | The errors tracked by this metric are generally due to deadlocks. Deadlocks can often be prevented with a considered transaction design. Identify the conflicting transactions involved in the deadlocks, then, if possible, redesign the business logic implementation prone to deadlocks. |
|
|
0 until it completes a backup. If all nodes are restarted, max() is 0 until a node completes a backup.To make use of this metric, first, from each node, take the maximum over a rolling window equal to or greater than the backup frequency, and then take the maximum of those values across nodes. For example with a backup frequency of 60 minutes, monitor
time() - max_across_nodes(max_over_time(schedules_BACKUP_last_completed_time, 60min)). |
If [changefeeds]({% link {{ page.version.version }}/change-data-capture-overview.md %}) are created in a CockroachDB cluster, monitor these additional metrics in your custom dashboards:
|
If [Row-Level TTL]({% link {{ page.version.version }}/row-level-ttl.md %}) is configured for any table in a CockroachDB cluster, monitor these additional metrics in your custom dashboards:
|
ttl_cron setting that was chosen. If this metric is zero, it means the job is not running |
| jobs.row_level_ttl.resume_failed | {% if include.deployment == 'self-hosted' %}jobs.row.level.ttl.resume_failed.count |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of row_level_ttl jobs which failed with a non-retryable error | This metric should remain at zero. Repeated errors means the Row Level TTL job is not deleting data. |
| jobs.row_level_ttl.rows_selected | {% if include.deployment == 'self-hosted' %}jobs.row.level.ttl.rows_selected.count |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of rows selected for deletion by the row level TTL job. | Correlate this metric with the metric jobs.row_level_ttl.rows_deleted to ensure all the rows that should be deleted are actually getting deleted. |
| jobs.row_level_ttl.rows_deleted | {% if include.deployment == 'self-hosted' %}jobs.row.level.ttl.rows_deleted.count |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of rows deleted by the row level TTL job. | Correlate this metric with the metric jobs.row_level_ttl.rows_selected to ensure all the rows that should be deleted are actually getting deleted. |
| jobs.row_level_ttl.currently_paused | {% if include.deployment == 'self-hosted' %}jobs.row.level.ttl.currently_paused |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of row_level_ttl jobs currently considered Paused | Monitor this metric to ensure the Row Level TTL job does not remain paused inadvertently for an extended period. |
| jobs.row_level_ttl.currently_running | {% if include.deployment == 'self-hosted' %}jobs.row.level.ttl.currently_running |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of row_level_ttl jobs currently running | Monitor this metric to ensure there are not too many Row Level TTL jobs running at the same time. Generally, this metric should be in the low single digits. |
| schedules.scheduled-row-level-ttl-executor.failed | {% if include.deployment == 'self-hosted' %}schedules.scheduled.row.level.ttl.executor_failed.count |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of scheduled-row-level-ttl-executor jobs failed | Monitor this metric to ensure the Row Level TTL job is running. If it is non-zero, it means the job could not be created. |
| jobs.row_level_ttl.span_total_duration | NOT AVAILABLE | Duration for processing a span during row level TTL. | See Description. |
| jobs.row_level_ttl.select_duration | NOT AVAILABLE | Duration for select requests during row level TTL. | See Description. |
| jobs.row_level_ttl.delete_duration | NOT AVAILABLE | Duration for delete requests during row level TTL. | See Description. |
| jobs.row_level_ttl.num_active_spans | NOT AVAILABLE | Number of active spans the TTL job is deleting from. | See Description. |
| jobs.row_level_ttl.total_rows | NOT AVAILABLE | Approximate number of rows on the TTL table. | See Description. |
| jobs.row_level_ttl.total_expired_rows | NOT AVAILABLE | Approximate number of rows that have expired the TTL on the TTL table. | See Description. |
- [Available Metrics]({% link {{ page.version.version }}/metrics.md %}#available-metrics)
- [Monitor CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %})
- [Visualize metrics in Grafana]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}#step-5-visualize-metrics-in-grafana)
- [Custom Chart Debug Page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %})
- [Cluster API]({% link {{ page.version.version }}/cluster-api.md %})
- [Essential Alerts]({% link {{ page.version.version }}/essential-alerts-{{ include.deployment}}.md %})
- CockroachDB Source Code - DB Console metrics to graphs mappings (in *.tsx files)