Skip to content

WAL health panel shows 'Unsynced' on idle databases (false positive) #55

@christophebeling

Description

@christophebeling

Description

The "WAL" health stat panel (Panel ID 612) in the cluster dashboard shows "Unsynced" (red) on idle databases where no WAL activity is occurring, even though the archiver is functioning correctly.

Root Cause

The panel query computes wall-clock seconds since the last WAL archival:

max((1 - cnpg_pg_replication_in_recovery) * (time() - timestamp(cnpg_pg_stat_archiver_seconds_since_last_archival) + cnpg_pg_stat_archiver_seconds_since_last_archival))

The thresholds are:

  • Healthy: 0–360s
  • Delayed: 360–900s
  • Unsynced: 900s+

On an idle database (e.g., staging/dev environments with low traffic), PostgreSQL's archive_timeout only forces a WAL segment switch when there are WAL records in the current segment. If no writes occur, no new WAL segments are produced, seconds_since_last_archival grows indefinitely, and the panel turns red — even though there is nothing wrong and no data is at risk.

Evidence

Observed across multiple CNPG clusters in a staging environment:

  • cnpg_pg_stat_archiver_seconds_since_last_archival reached 50,000+ seconds on three clusters simultaneously
  • cnpg_pg_stat_archiver_failed_count = 0 (no archive failures)
  • cnpg_collector_pg_wal_archive_status{value="ready"} = 0 (no WALs pending)
  • rate(cnpg_collector_wal_bytes[6h]) = 0 (zero WAL generation)
  • After a single write transaction, archiving resumed immediately (seconds_since_last_archival dropped to ~20s, archived_count incremented)

This confirms the archiver is healthy — there was simply no WAL to archive.

Suggested Fix

Condition the panel query on whether there are actually WAL files pending archival (ready > 0). When no WALs are pending, the archiver has nothing to do, so the status should remain "Healthy" rather than degrading over time.

For example, multiply by ready > bool 0 so the result is 0 (Healthy) when no WALs are pending:

max(
  (1 - cnpg_pg_replication_in_recovery{namespace=~"$namespace", pod=~"$instances"})
  * (time() - timestamp(cnpg_pg_stat_archiver_seconds_since_last_archival{namespace=~"$namespace", pod=~"$instances"})
     + cnpg_pg_stat_archiver_seconds_since_last_archival{namespace=~"$namespace", pod=~"$instances"})
)
* on() group_left() clamp_min(max(cnpg_collector_pg_wal_archive_status{namespace=~"$namespace", pod=~"$instances", value="ready"}), 0) > bool 0
or
max(
  (1 - cnpg_pg_replication_in_recovery{namespace=~"$namespace", pod=~"$instances"}) * 0
)

Or alternatively, show "Idle" (green/neutral) when ready == 0 and seconds_since_last_archival > threshold.

Happy to submit a PR if desired.

Environment

  • CNPG Operator: 1.25.x
  • PostgreSQL: 17
  • Grafana: 11.x
  • Dashboard: cloudnative-pg/grafana-dashboards cluster chart

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions