Description
The "WAL" health stat panel (Panel ID 612) in the cluster dashboard shows "Unsynced" (red) on idle databases where no WAL activity is occurring, even though the archiver is functioning correctly.
Root Cause
The panel query computes wall-clock seconds since the last WAL archival:
max((1 - cnpg_pg_replication_in_recovery) * (time() - timestamp(cnpg_pg_stat_archiver_seconds_since_last_archival) + cnpg_pg_stat_archiver_seconds_since_last_archival))
The thresholds are:
- Healthy: 0–360s
- Delayed: 360–900s
- Unsynced: 900s+
On an idle database (e.g., staging/dev environments with low traffic), PostgreSQL's archive_timeout only forces a WAL segment switch when there are WAL records in the current segment. If no writes occur, no new WAL segments are produced, seconds_since_last_archival grows indefinitely, and the panel turns red — even though there is nothing wrong and no data is at risk.
Evidence
Observed across multiple CNPG clusters in a staging environment:
cnpg_pg_stat_archiver_seconds_since_last_archival reached 50,000+ seconds on three clusters simultaneously
cnpg_pg_stat_archiver_failed_count = 0 (no archive failures)
cnpg_collector_pg_wal_archive_status{value="ready"} = 0 (no WALs pending)
rate(cnpg_collector_wal_bytes[6h]) = 0 (zero WAL generation)
- After a single write transaction, archiving resumed immediately (
seconds_since_last_archival dropped to ~20s, archived_count incremented)
This confirms the archiver is healthy — there was simply no WAL to archive.
Suggested Fix
Condition the panel query on whether there are actually WAL files pending archival (ready > 0). When no WALs are pending, the archiver has nothing to do, so the status should remain "Healthy" rather than degrading over time.
For example, multiply by ready > bool 0 so the result is 0 (Healthy) when no WALs are pending:
max(
(1 - cnpg_pg_replication_in_recovery{namespace=~"$namespace", pod=~"$instances"})
* (time() - timestamp(cnpg_pg_stat_archiver_seconds_since_last_archival{namespace=~"$namespace", pod=~"$instances"})
+ cnpg_pg_stat_archiver_seconds_since_last_archival{namespace=~"$namespace", pod=~"$instances"})
)
* on() group_left() clamp_min(max(cnpg_collector_pg_wal_archive_status{namespace=~"$namespace", pod=~"$instances", value="ready"}), 0) > bool 0
or
max(
(1 - cnpg_pg_replication_in_recovery{namespace=~"$namespace", pod=~"$instances"}) * 0
)
Or alternatively, show "Idle" (green/neutral) when ready == 0 and seconds_since_last_archival > threshold.
Happy to submit a PR if desired.
Environment
- CNPG Operator: 1.25.x
- PostgreSQL: 17
- Grafana: 11.x
- Dashboard: cloudnative-pg/grafana-dashboards cluster chart
Description
The "WAL" health stat panel (Panel ID 612) in the cluster dashboard shows "Unsynced" (red) on idle databases where no WAL activity is occurring, even though the archiver is functioning correctly.
Root Cause
The panel query computes wall-clock seconds since the last WAL archival:
The thresholds are:
On an idle database (e.g., staging/dev environments with low traffic), PostgreSQL's
archive_timeoutonly forces a WAL segment switch when there are WAL records in the current segment. If no writes occur, no new WAL segments are produced,seconds_since_last_archivalgrows indefinitely, and the panel turns red — even though there is nothing wrong and no data is at risk.Evidence
Observed across multiple CNPG clusters in a staging environment:
cnpg_pg_stat_archiver_seconds_since_last_archivalreached 50,000+ seconds on three clusters simultaneouslycnpg_pg_stat_archiver_failed_count= 0 (no archive failures)cnpg_collector_pg_wal_archive_status{value="ready"}= 0 (no WALs pending)rate(cnpg_collector_wal_bytes[6h])= 0 (zero WAL generation)seconds_since_last_archivaldropped to ~20s,archived_countincremented)This confirms the archiver is healthy — there was simply no WAL to archive.
Suggested Fix
Condition the panel query on whether there are actually WAL files pending archival (
ready > 0). When no WALs are pending, the archiver has nothing to do, so the status should remain "Healthy" rather than degrading over time.For example, multiply by
ready > bool 0so the result is 0 (Healthy) when no WALs are pending:Or alternatively, show "Idle" (green/neutral) when
ready == 0andseconds_since_last_archival > threshold.Happy to submit a PR if desired.
Environment