Skip to content

Commit d343e63

Browse files
authored
Merge pull request #186 from checkly/feat/prom-docs-icmp
docs: add and rename Prometheus V2 metrics (ICMP, DNS, TCP, heartbeat)
2 parents 45e5f31 + 140a188 commit d343e63

4 files changed

Lines changed: 70 additions & 15 deletions

File tree

concepts/metrics.mdx

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Response time metrics measure how quickly your services respond to requests, pro
1616

1717
**Unit:** Milliseconds
1818
**Precision:** 2 decimal places
19-
**Available for:** API, Browser, Heartbeat, TCP, Multistep, URL checks
19+
**Available for:** API, Browser, TCP, DNS, Multistep, URL checks
2020

2121
<Accordion title="Use cases">
2222
- Establishing performance baselines
@@ -31,7 +31,7 @@ Response time metrics measure how quickly your services respond to requests, pro
3131

3232
**Unit:** Milliseconds
3333
**Precision:** 2 decimal places
34-
**Available for:** API, Browser, Heartbeat, TCP, Multistep, URL checks
34+
**Available for:** API, Browser, TCP, DNS, Multistep, URL checks
3535

3636
<Accordion title="Use cases">
3737
- SLA monitoring and compliance
@@ -46,7 +46,7 @@ Response time metrics measure how quickly your services respond to requests, pro
4646

4747
**Unit:** Milliseconds
4848
**Precision:** 2 decimal places
49-
**Available for:** API, Browser, Heartbeat, TCP, Multistep, URL checks
49+
**Available for:** API, Browser, TCP, DNS, Multistep, URL checks
5050

5151
<Accordion title="Use cases">
5252
- Understanding worst-case user experience
@@ -60,7 +60,7 @@ Response time metrics measure how quickly your services respond to requests, pro
6060

6161
**Unit:** Milliseconds
6262
**Precision:** 2 decimal places
63-
**Available for:** API, Browser, Heartbeat, TCP, Multistep, URL checks
63+
**Available for:** API, Browser, TCP, DNS, Multistep, URL checks
6464

6565
<Accordion title="Use cases">
6666
- Understanding typical user experience
@@ -75,7 +75,7 @@ Response time metrics measure how quickly your services respond to requests, pro
7575

7676
**Unit:** Milliseconds
7777
**Precision:** 2 decimal places
78-
**Available for:** API, Browser, Heartbeat, TCP, Multistep, URL checks
78+
**Available for:** API, Browser, TCP, DNS, Multistep, URL checks
7979

8080
<Accordion title="Use cases">
8181
- Understanding best-case performance
@@ -90,7 +90,7 @@ Response time metrics measure how quickly your services respond to requests, pro
9090

9191
**Unit:** Milliseconds
9292
**Precision:** 2 decimal places
93-
**Available for:** API, Browser, Heartbeat, TCP, Multistep, URL checks
93+
**Available for:** API, Browser, TCP, DNS, Multistep, URL checks
9494

9595
<Accordion title="Use cases">
9696
- Identifying performance spikes
@@ -188,7 +188,7 @@ Core Web Vitals are a set of metrics defined by Google that measure real-world u
188188
**Available for:** Browser, Multistep checks only
189189

190190
**Google's thresholds:**
191-
- Good: d 1.8 seconds
191+
- Good: 1.8 seconds
192192
- Needs improvement: 1.8 - 3.0 seconds
193193
- Poor: > 3.0 seconds
194194

@@ -401,7 +401,7 @@ These metrics are specific to heartbeat checks, which monitor the availability a
401401

402402
**Unit:** Milliseconds
403403
**Precision:** 2 decimal places
404-
**Available for:** API, Browser, URL, TCP checks (when hostname is used)
404+
**Available for:** API, Browser, URL, TCP, DNS, ICMP checks (when hostname is used)
405405

406406
<Accordion title="Use cases">
407407
- DNS performance monitoring

detect/uptime-monitoring/heartbeat-monitors/overview.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,8 @@ Heartbeat monitors provide different metrics and insights than other types of ch
7878
- **Alert Timeline**: When alerts were triggered and resolved
7979
- **Source Tracking**: Which systems or processes sent pings
8080

81+
Heartbeat metrics are also available via the [Prometheus V2 integration](/integrations/observability/prometheus-v2#heartbeat-metrics), including dead man's switch gauges for alerting in Grafana when a ping is overdue.
82+
8183
<Warning>
8284
Remember: Heartbeat monitors detect when jobs fail to complete, but they can't tell you why a job failed. Combine heartbeat monitoring with application logging and error tracking for complete observability.
8385
</Warning>

guides/uptime-monitoring.mdx

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -91,11 +91,11 @@ Another type of service you’ll want to monitor is one that doesn’t accept an
9191

9292
There’s really not much to getting one to work, just set up a heartbeat in the Checkly UI or with the Checkly CLI. For example, the following code
9393

94-
```tsx UptimeMonitoring.check.ts
94+
```tsx UptimeMonitoring.check.ts
9595
//heartbeat.check.ts
96-
import { HeartbeatCheck } from 'checkly/constructs'
96+
import { HeartbeatMonitor } from 'checkly/constructs'
9797

98-
new HeartbeatCheck('heartbeat-check-1', {
98+
new HeartbeatMonitor('heartbeat-check-1', {
9999
name: 'Send weekly newsletter job',
100100
period: 7,
101101
periodUnit: 'days',

integrations/observability/prometheus-v2.mdx

Lines changed: 57 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -58,8 +58,14 @@ The following metrics are available to monitor checks:
5858
| `checkly_browser_check_errors` | Histogram | The errors encountered during a full browser session. |
5959
| `checkly_api_check_timing_seconds` | Histogram | The response time for the API request, as well as the duration of the different phases. |
6060
| `checkly_url_monitor_timing_seconds` | Histogram | The response time for the HTTP request, as well as the duration of the different phases. |
61-
| `checkly_tcp_check_timing_seconds` | Histogram | The response time for the TCP request, as well as the duration of the different phases. |
61+
| `checkly_tcp_monitor_timing_seconds` | Histogram | The response time for the TCP request, as well as the duration of the different phases. |
62+
| `checkly_tcp_check_timing_seconds` | Histogram | **Deprecated:** use `checkly_tcp_monitor_timing_seconds` instead. Emits identical values. |
63+
| `checkly_icmp_monitor_timing_seconds` | Histogram | The timing phases of the ICMP monitor, including DNS resolution and latency measurements (avg, min, max). |
64+
| `checkly_icmp_monitor_packet_loss_ratio` | Gauge | The average packet loss ratio (0-1) for the ICMP monitor. |
65+
| `checkly_dns_monitor_timing_seconds` | Histogram | The total duration of the DNS monitor. |
6266
| `checkly_multistep_check_duration_seconds` | Histogram | The total check duration. This includes all requests done and any waits. |
67+
| `checkly_heartbeat_last_success_timestamp_seconds` | Gauge | The Unix timestamp of the last successful heartbeat ping. A value of `0` means no successful ping has ever been received. |
68+
| `checkly_heartbeat_expected_interval_seconds` | Gauge | The configured maximum interval between heartbeat pings, in seconds. Derived from the heartbeat monitor's period and period unit settings. |
6369
| `checkly_time_to_ssl_expiry_seconds` | Gauge | The amount of time remaining before the SSL certificate of the monitored domain expires. See the [SSL certificate expiration docs](/alerting-and-retries/ssl-expiration/) for more information on monitoring SSL certificates with checks. |
6470

6571
The `checkly_check_status` and `checkly_check_result_total` metrics contain a `status` label with values `passing`, `failing`, and `degraded`.
@@ -74,9 +80,13 @@ checkly_check_status{name="Passing Browser Check",status="degraded"} 0
7480

7581
`checkly_check_status` can be useful for viewing the current status of a check, whereas `checkly_check_result_total` can be useful for calculating overall statistics.
7682

77-
The metrics `checkly_browser_check_web_vitals_seconds`, `checkly_browser_check_errors`, and `checkly_api_check_timing_seconds` contain a `type` label.
83+
The metrics `checkly_browser_check_web_vitals_seconds`, `checkly_browser_check_errors`, `checkly_api_check_timing_seconds`, `checkly_tcp_monitor_timing_seconds`, `checkly_icmp_monitor_timing_seconds`, and `checkly_dns_monitor_timing_seconds` contain a `type` label.
7884
This label indicates the different Web Vitals, error types, and timing phases being measured.
7985

86+
For `checkly_tcp_monitor_timing_seconds`, the `type` label has the following values: `dns`, `connection`, `data`, and `total`.
87+
For `checkly_icmp_monitor_timing_seconds`, the `type` label has the following values: `dns` (DNS resolution time), `latency_avg`, `latency_min`, and `latency_max`.
88+
For `checkly_dns_monitor_timing_seconds`, the `type` label has the value `total`.
89+
8090
`checkly_time_to_ssl_expiry_seconds` contains a `domain` label giving the domain of the monitored SSL certificate.
8191

8292
In addition, the check metrics all contain the following labels:
@@ -85,7 +95,7 @@ In addition, the check metrics all contain the following labels:
8595
|-------|-------------|
8696
| `name` | The name of the check. |
8797
| `check_id` | The unique UUID of the check. |
88-
| `check_type` | Either `api` or `browser`. |
98+
| `check_type` | The type of check: `api`, `browser`, `multi_step`, `playwright`, `heartbeat`, `tcp`, `icmp`, `dns`, `url`. |
8999
| `muted` | Whether the check is muted, configured to not send alerts. |
90100
| `activated` | Whether the check is activated. Deactivated checks aren't be run. |
91101
| `group` | The name of the check group. |
@@ -100,7 +110,7 @@ In addition, the check metrics all contain the following labels:
100110

101111
To avoid creating a high volume of metrics, by default the metrics don't include a label for the check run location. It is possible to enable this by adding the query param `locationLabelEnabled=true` to your API request. This will add a `location` label giving the location where the checks ran.
102112

103-
Since check status and SSL days remaining is only tracked on a per-check basis rather than by location, `checkly_check_status` and `checkly_time_to_ssl_expiry_seconds` do not have the `location` label included. All other [check metrics](#check-metrics) will have the `location` label added.
113+
Since check status and SSL days remaining is only tracked on a per-check basis rather than by location, `checkly_check_status` and `checkly_time_to_ssl_expiry_seconds` do not have the `location` label included. Heartbeat metrics (`checkly_heartbeat_last_success_timestamp_seconds`, `checkly_heartbeat_expected_interval_seconds`, and heartbeat entries in `checkly_check_result_total`) also do not have a `location` label, since heartbeat monitors receive pings rather than running from specific locations. All other [check metrics](#check-metrics) will have the `location` label added.
104114

105115
Here is an example for how to set this in your `prometheus.yml` config:
106116

@@ -146,6 +156,49 @@ The different histogram metrics can all be used to compute averages. For example
146156
sum by(type) (rate(checkly_browser_check_web_vitals_seconds_sum{name="Check Name"}[30m])) / sum by(type) (rate(checkly_browser_check_web_vitals_seconds_count{name="Check Name"}[30m]))
147157
```
148158

159+
## Heartbeat Metrics
160+
161+
[Heartbeat monitors](/detect/uptime-monitoring/heartbeat-monitors/overview) are passive checks that wait for your scheduled jobs and automated processes to send a ping. The Prometheus exporter includes heartbeat checks in the standard `checkly_check_status` and `checkly_check_result_total` metrics, plus two heartbeat-specific gauges designed for [dead man's switch](https://en.wikipedia.org/wiki/Dead_man%27s_switch) alerting.
162+
163+
- **`checkly_heartbeat_last_success_timestamp_seconds`** — The Unix timestamp of the last successful ping, searched across the entire event history (no time window). A value of `0` means no successful ping has ever been received — this immediately triggers a dead man's switch alert since `time() - 0` produces a very large value.
164+
- **`checkly_heartbeat_expected_interval_seconds`** — The maximum expected time between pings, derived from the heartbeat monitor's configured period and period unit (e.g., `5 minutes` = `300` seconds). Use this as a threshold for alerting.
165+
166+
### How heartbeats map to `checkly_check_result_total`
167+
168+
Heartbeat checks contribute to `checkly_check_result_total` using a 1-hour count window, the same as other check types. The `status` label values map to heartbeat event states as follows:
169+
170+
| `status` label | Heartbeat states counted |
171+
|----------------|--------------------------|
172+
| `passing` | `RECEIVED`, `EARLY`, `GRACE`, `LATE` — any ping that arrived |
173+
| `failing` | `FAILING` — no ping received before the grace period expired |
174+
| `degraded` | Always `0` — heartbeat monitors do not have a degraded state |
175+
176+
### Dead Man's Switch PromQL Examples
177+
178+
#### Alert when a heartbeat ping is overdue
179+
180+
Compare the current time against the last success timestamp. If the difference exceeds the expected interval, the job has missed its window:
181+
182+
```bash
183+
time() - checkly_heartbeat_last_success_timestamp_seconds > checkly_heartbeat_expected_interval_seconds
184+
```
185+
186+
#### Alert with a safety margin
187+
188+
Add a grace period multiplier (e.g., 1.5x) to avoid false alarms from minor delays:
189+
190+
```bash
191+
time() - checkly_heartbeat_last_success_timestamp_seconds > 1.5 * checkly_heartbeat_expected_interval_seconds
192+
```
193+
194+
#### Grafana table of heartbeat health
195+
196+
Show how long ago each heartbeat was last seen:
197+
198+
```bash
199+
time() - checkly_heartbeat_last_success_timestamp_seconds
200+
```
201+
149202
## Private Location Metrics
150203

151204
The Prometheus exporter also contains metrics for monitoring [Private Locations](/platform/private-locations/overview). These metrics can be used to ensure that your Private Locations have enough Checkly Agent instances running to execute all of your checks.

0 commit comments

Comments
 (0)