- ADR: ADR 0098
- Title: Patroni streaming replication with keepalived VIP providing automatic Postgres failover on a second VM (postgres-replica, VMID 151)
- Status: merged
- Branch:
codex/adr-0098-postgres-ha - Worktree:
../proxmox-host_server-postgres-ha - Owner: codex
- Depends On:
adr-0026-postgres-vm,adr-0064-health-probes,adr-0085-opentofu-vm-lifecycle,adr-0096-slo-tracking,adr-0097-alerting-routing - Conflicts With: none
- Shared Surfaces:
tofu/environments/production/,inventory/,inventory/group_vars/platform.yml,collections/ansible_collections/lv3/platform/roles/postgres_ha/,config/service-capability-catalog.json
- add
postgres-replicaVM totofu/environments/production/main.tf(VMID 151, clone of Debian 13 template) - write Ansible role
postgres_ha— installs Patroni, manages the PostgreSQL HA configuration, and ships Patroni metrics from both Postgres VMs - write Ansible role
linux_keepalived— installs keepalived with VIP10.10.10.55on both Postgres VMs - write Ansible role
etcd_cluster_member— provides the three-member DCS quorum onpostgres,postgres-replica, andmonitoring - add
postgres-replicatoinventory/hosts.ymland the canonical guest source-of-truth files - update all service roles that set Postgres connection strings to use
database.example.com - move the existing
database.example.comDNS target to the HA VIP10.10.10.55 - update health probes, Uptime Kuma, and Grafana dashboards to reflect the VIP and Patroni role metrics
- write
docs/runbooks/postgres-failover.md— switchover and failover procedures
- SLO automation for the Postgres VIP
- Connection pooling (PgBouncer) — separate concern for a future ADR
- Multi-master replication (streaming replication is primary → standby only)
tofu/environments/production/main.tf(patched: VMID 151 added)collections/ansible_collections/lv3/platform/roles/postgres_ha/collections/ansible_collections/lv3/platform/roles/linux_keepalived/collections/ansible_collections/lv3/platform/roles/etcd_cluster_member/inventory/hosts.yml(patched: postgres-replica added)inventory/host_vars/proxmox-host.ymlinventory/group_vars/platform.ymlconfig/health-probe-catalog.json(patched)config/uptime-kuma/monitors.json(patched)collections/ansible_collections/lv3/platform/roles/monitoring_vm/templates/lv3-platform-overview.json.j2collections/ansible_collections/lv3/platform/roles/monitoring_vm/templates/lv3-vm-detail.json.j2docs/runbooks/postgres-failover.mddocs/adr/0098-postgres-high-availability.mddocs/workstreams/adr-0098-postgres-ha.md
- VMID 151 (
postgres-replica) is running at10.10.10.51 patronictl -c /etc/patroni/patroni.yml listshows two nodes: one Leader, one Replica- VIP
10.10.10.55is reachable throughdatabase.example.com - Replication lag < 1 MB (Patroni failover threshold)
- All five dependent services (Keycloak, Windmill, NetBox, OpenBao, Mattermost) are connected to the VIP
patronictl listshows both nodes healthypsql -h database.example.com -U postgres -c "SELECT pg_is_in_recovery();"→f(primary)- Run planned switchover:
patronictl switchover postgres-ha --master postgres --candidate postgres-replica→ verify Keycloak, NetBox, Mattermost remain healthy within 30 seconds - Switch back to original primary
- Verify replication lag metric appears in the Postgres HA Grafana dashboard
- Both VMs provisioned and Patroni reports both as healthy
- VIP at
10.10.10.55works and moves on switchover - All five dependent services connect to VIP (not to bare IP)
- Replication lag Grafana panel shows data
- Planned switchover tested and services recovered within 30 seconds
- Repo implementation is merged; the remaining work is live rollout from
main - As of
2026-03-27, production no longer has VMID151inqm list, so active production host targeting must not includepostgres-replicauntil ADR 0098 is actually live-applied again. - Run
tofu applybefore any Ansible work — the replica VM must exist before Patroni can be configured - The keepalived
check_patroni_leader.shscript must usecurl -sf http://localhost:8008/leader(returns 200 if leader, 503 if not); do not usepatronictlin the script — it is too slow for keepalived's 2-second check interval - When updating service connection strings to use the VIP DNS, do a rolling restart (one service at a time) rather than all at once; some services require a full container restart to pick up the new connection string
- After switchover, the old primary becomes a standby; if keepalived is not running on the old primary, the VIP will not move back on the next planned switchover; ensure keepalived starts automatically on both VMs