You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: detect/uptime-monitoring/heartbeat-monitors/overview.mdx
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -78,6 +78,8 @@ Heartbeat monitors provide different metrics and insights than other types of ch
78
78
-**Alert Timeline**: When alerts were triggered and resolved
79
79
-**Source Tracking**: Which systems or processes sent pings
80
80
81
+
Heartbeat metrics are also available via the [Prometheus V2 integration](/integrations/observability/prometheus-v2#heartbeat-metrics), including dead man's switch gauges for alerting in Grafana when a ping is overdue.
82
+
81
83
<Warning>
82
84
Remember: Heartbeat monitors detect when jobs fail to complete, but they can't tell you why a job failed. Combine heartbeat monitoring with application logging and error tracking for complete observability.
| `checkly_icmp_monitor_timing_seconds` | Histogram | The timing phases of the ICMP monitor, including DNS resolution and latency measurements (avg, min, max). |
64
+
| `checkly_icmp_monitor_packet_loss_ratio` | Gauge | The average packet loss ratio (0-1) for the ICMP monitor. |
65
+
| `checkly_dns_monitor_timing_seconds` | Histogram | The total duration of the DNS monitor. |
62
66
| `checkly_multistep_check_duration_seconds` | Histogram | The total check duration. This includes all requests done and any waits. |
67
+
| `checkly_heartbeat_last_success_timestamp_seconds` | Gauge | The Unix timestamp of the last successful heartbeat ping. A value of `0` means no successful ping has ever been received. |
68
+
| `checkly_heartbeat_expected_interval_seconds` | Gauge | The configured maximum interval between heartbeat pings, in seconds. Derived from the heartbeat monitor's period and period unit settings. |
63
69
| `checkly_time_to_ssl_expiry_seconds` | Gauge | The amount of time remaining before the SSL certificate of the monitored domain expires. See the [SSL certificate expiration docs](/alerting-and-retries/ssl-expiration/) for more information on monitoring SSL certificates with checks. |
64
70
65
71
The `checkly_check_status` and `checkly_check_result_total` metrics contain a `status` label with values `passing`, `failing`, and `degraded`.
`checkly_check_status`can be useful for viewing the current status of a check, whereas `checkly_check_result_total` can be useful for calculating overall statistics.
76
82
77
-
The metrics `checkly_browser_check_web_vitals_seconds`, `checkly_browser_check_errors`, and `checkly_api_check_timing_seconds` contain a `type` label.
83
+
The metrics `checkly_browser_check_web_vitals_seconds`, `checkly_browser_check_errors`, `checkly_api_check_timing_seconds`, `checkly_tcp_monitor_timing_seconds`, `checkly_icmp_monitor_timing_seconds`, and `checkly_dns_monitor_timing_seconds` contain a `type` label.
78
84
This label indicates the different Web Vitals, error types, and timing phases being measured.
79
85
86
+
For `checkly_tcp_monitor_timing_seconds`, the `type` label has the following values: `dns`, `connection`, `data`, and `total`.
87
+
For `checkly_icmp_monitor_timing_seconds`, the `type` label has the following values: `dns`(DNS resolution time), `latency_avg`, `latency_min`, and `latency_max`.
88
+
For `checkly_dns_monitor_timing_seconds`, the `type` label has the value `total`.
89
+
80
90
`checkly_time_to_ssl_expiry_seconds`contains a `domain` label giving the domain of the monitored SSL certificate.
81
91
82
92
In addition, the check metrics all contain the following labels:
@@ -85,7 +95,7 @@ In addition, the check metrics all contain the following labels:
85
95
|-------|-------------|
86
96
| `name` | The name of the check. |
87
97
| `check_id` | The unique UUID of the check. |
88
-
| `check_type` | Either `api` or `browser`. |
98
+
| `check_type` | The type of check: `api`, `browser`, `multi_step`, `playwright`, `heartbeat`, `tcp`, `icmp`, `dns`, `url`. |
89
99
| `muted` | Whether the check is muted, configured to not send alerts. |
90
100
| `activated` | Whether the check is activated. Deactivated checks aren't be run. |
91
101
| `group` | The name of the check group. |
@@ -100,7 +110,7 @@ In addition, the check metrics all contain the following labels:
100
110
101
111
To avoid creating a high volume of metrics, by default the metrics don't include a label for the check run location. It is possible to enable this by adding the query param `locationLabelEnabled=true` to your API request. This will add a `location` label giving the location where the checks ran.
102
112
103
-
Since check status and SSL days remaining is only tracked on a per-check basis rather than by location, `checkly_check_status` and `checkly_time_to_ssl_expiry_seconds` do not have the `location` label included. All other [check metrics](#check-metrics) will have the `location` label added.
113
+
Since check status and SSL days remaining is only tracked on a per-check basis rather than by location, `checkly_check_status` and `checkly_time_to_ssl_expiry_seconds` do not have the `location` label included. Heartbeat metrics (`checkly_heartbeat_last_success_timestamp_seconds`, `checkly_heartbeat_expected_interval_seconds`, and heartbeat entries in `checkly_check_result_total`) also do not have a `location` label, since heartbeat monitors receive pings rather than running from specific locations. All other [check metrics](#check-metrics) will have the `location` label added.
104
114
105
115
Here is an example for how to set this in your `prometheus.yml` config:
106
116
@@ -146,6 +156,49 @@ The different histogram metrics can all be used to compute averages. For example
146
156
sum by(type) (rate(checkly_browser_check_web_vitals_seconds_sum{name="Check Name"}[30m])) / sum by(type) (rate(checkly_browser_check_web_vitals_seconds_count{name="Check Name"}[30m]))
147
157
```
148
158
159
+
## Heartbeat Metrics
160
+
161
+
[Heartbeat monitors](/detect/uptime-monitoring/heartbeat-monitors/overview) are passive checks that wait for your scheduled jobs and automated processes to send a ping. The Prometheus exporter includes heartbeat checks in the standard `checkly_check_status` and `checkly_check_result_total` metrics, plus two heartbeat-specific gauges designed for [dead man's switch](https://en.wikipedia.org/wiki/Dead_man%27s_switch) alerting.
162
+
163
+
- **`checkly_heartbeat_last_success_timestamp_seconds`** — The Unix timestamp of the last successful ping, searched across the entire event history (no time window). A value of `0` means no successful ping has ever been received — this immediately triggers a dead man's switch alert since `time() - 0` produces a very large value.
164
+
- **`checkly_heartbeat_expected_interval_seconds`** — The maximum expected time between pings, derived from the heartbeat monitor's configured period and period unit (e.g., `5 minutes` = `300` seconds). Use this as a threshold for alerting.
165
+
166
+
### How heartbeats map to `checkly_check_result_total`
167
+
168
+
Heartbeat checks contribute to `checkly_check_result_total` using a 1-hour count window, the same as other check types. The `status` label values map to heartbeat event states as follows:
169
+
170
+
| `status` label | Heartbeat states counted |
171
+
|----------------|--------------------------|
172
+
| `passing` | `RECEIVED`, `EARLY`, `GRACE`, `LATE` — any ping that arrived |
173
+
| `failing` | `FAILING` — no ping received before the grace period expired |
174
+
| `degraded` | Always `0` — heartbeat monitors do not have a degraded state |
175
+
176
+
### Dead Man's Switch PromQL Examples
177
+
178
+
#### Alert when a heartbeat ping is overdue
179
+
180
+
Compare the current time against the last success timestamp. If the difference exceeds the expected interval, the job has missed its window:
The Prometheus exporter also contains metrics for monitoring [Private Locations](/platform/private-locations/overview). These metrics can be used to ensure that your Private Locations have enough Checkly Agent instances running to execute all of your checks.
0 commit comments