How to operate, maintain, and troubleshoot your PostgreSQL HA cluster.
# Quick health check
curl -s http://localhost:8008/leader | python3 -m json.tool
# Full cluster view
curl -s http://localhost:8008/cluster | python3 -m json.tool | grep -E '"name"|"state"|"role"|"lag"'
# Check all containers
docker ps | grep -E 'pg-node|pgbouncer|etcd|vault|dbhub|datadog'
# Expected containers:
# pg-node-1, pg-node-2, pg-node-3 (PostgreSQL + Patroni)
# pgbouncer-1, pgbouncer-2 (connection poolers)
# etcd (distributed config store)
# vault (HashiCorp Vault, Raft backend — optional)
# vault-agent (Vault Agent sidecar — optional)
# datadog-agent (Datadog Agent — optional, datadog_enabled)
# pg-dashboard (nginx status dashboard — optional, dashboard_enabled)
# postgres-exporter-1/2/3 (pg metrics exporters — optional, monitoring_enabled)
# pgbouncer-exporter-1/2 (pgbouncer metrics exporters — optional, monitoring_enabled)
# prometheus (metrics store — optional, monitoring_enabled)
# grafana (dashboards — optional, monitoring_enabled)
# dbhub (Bytebase web UI)# Connect to admin console
psql -h localhost -p 6432 -U pgadmin -d pgbouncer
# Common commands (inside admin console)
pgbouncer> SHOW POOLS; # Pool statistics
pgbouncer> SHOW STATS; # Detailed statistics
pgbouncer> SHOW CLIENTS; # Active clients
pgbouncer> SHOW CONFIG; # Current configuration
pgbouncer> \q # Exit# Primary node
docker logs pg-node-1 -f
# Other nodes
docker logs pg-node-2 -f
docker logs pg-node-3 -f
# PgBouncer logs
docker logs pgbouncer-1 -f
docker logs pgbouncer-2 -f
# etcd logs
docker logs etcd -f
# Datadog Agent logs (if datadog_enabled = true)
docker logs datadog-agent -fNote: The primary node is dynamically elected by Patroni. Do not assume pg-node-1 is always primary. Find the current primary first:
# Find current primary
curl -s http://localhost:8008/leader | python3 -m json.tool | grep '"name"'# Check on primary (replace pg-node-1 with the actual primary node name)
docker exec pg-node-1 psql -U postgres -d postgres << 'EOF'
SELECT
application_name,
client_addr,
state,
sync_state,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
EOF# View etcd cluster status
curl -s http://localhost:2379/v3/cluster/member/list | python3 -m json.tool
# Check Patroni consensus
curl -s http://localhost:8008/cluster | python3 -m json.tool | grep -E '"dcs_last_seen"|"timeline"'# Docker volume sizes
docker volume ls | grep pg
# Check specific volume usage
docker exec pg-node-1 du -sh /var/lib/postgresql/18/main
# Check available disk
docker exec pg-node-1 df -h /Passwords are auto-generated by Terraform — there are no hardcoded defaults. When Terraform regenerates passwords (e.g., after terraform apply on a fresh run or variable change), you must sync the new password into the running PostgreSQL cluster.
# Step 1: View the newly generated passwords
terraform output -json generated_passwords
# Step 2: Apply Terraform (generates/rotates passwords and redeploys containers)
terraform apply -var-file=ha-test.tfvars
# Step 3: Sync the new pgadmin password into PostgreSQL on the primary node
# (replace pg-node-1 with the actual current primary)
docker exec -it pg-node-1 psql -U postgres -d postgres -c \
"ALTER USER pgadmin PASSWORD 'new-password-from-output';"
# Step 4: Restart PgBouncer containers to pick up the new password
docker restart pgbouncer-1 pgbouncer-2
# Step 5: Verify the connection works
psql -h localhost -p 6432 -U pgadmin -d postgres -c "SELECT 1;"If a node shows start failed in Patroni due to timeline divergence, use patronictl reinit from the primary node:
# Step 1: Find the current primary
curl -s http://localhost:8008/leader | python3 -m json.tool | grep '"name"'
# Step 2: Run reinit targeting the failed node (e.g. pg-node-2)
docker exec <primary-node> patronictl -c /etc/patroni/patroni.yml \
reinit pg-ha-cluster <failed-node> --force
# Example:
docker exec pg-node-1 patronictl -c /etc/patroni/patroni.yml \
reinit pg-ha-cluster pg-node-2 --force
# Step 3: Verify the node rejoins
sleep 30
curl -s http://localhost:8008/cluster | python3 -m json.tool | grep -E '"name"|"state"|"role"'# Step 1: Record current leader (query /cluster — works from any node regardless of role)
LEADER=$(curl -s http://localhost:8008/cluster | python3 -c "import sys,json; d=json.load(sys.stdin); print(next(m['name'] for m in d['members'] if m['role']=='leader'))")
echo "Current primary: $LEADER"
# Step 2: Stop the primary
docker stop $LEADER
# Step 3: Wait 30 seconds
sleep 30
# Step 4: Check new leader elected
curl -s http://localhost:8008/leader | python3 -m json.tool | grep '"name"'
# Should show pg-node-2 or pg-node-3
# Step 5: Verify applications still work
psql -h localhost -p 6432 -U pgadmin -d postgres -c "SELECT version();"
# Step 6: Bring primary back online
docker start $LEADER
# Step 7: Verify cluster heals
sleep 30
curl -s http://localhost:8008/cluster | python3 -m json.tool | grep -c '"role"'
# Should show 3 members# Connect to primary
docker exec -it pg-node-1 psql -U postgres -d postgres
# Enable logging (if not already)
postgres> CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
postgres> SET log_min_duration_statement = 1000; -- Log queries > 1s
# Find slow queries
postgres> SELECT query, mean_exec_time FROM pg_stat_statements
ORDER BY mean_exec_time DESC LIMIT 10;
postgres> \q# List backups
docker exec pg-node-1 pgbackrest info 2>/dev/null || echo "pgBackRest not configured"
# Create ad-hoc backup (if needed)
docker exec pg-node-1 pgbackrest backup --backup-standby 2>/dev/null || echo "Configure backup first"# Create table
psql -h localhost -p 6432 -U pgadmin -d postgres << 'EOF'
CREATE TABLE production_data AS
SELECT
generate_series(1, 1000000) as id,
md5(random()::text) as data,
NOW() as created_at;
CREATE INDEX idx_prod_id ON production_data(id);
EOF
# Verify on replica
psql -h localhost -p 5433 -U pgadmin -d postgres << 'EOF'
SELECT COUNT(*) FROM production_data;
EOF# Check current pool usage
psql -h localhost -p 6432 -U pgadmin -d pgbouncer << 'EOF'
SELECT
database,
user,
pool_mode,
cl_active,
cl_waiting,
sv_active,
sv_idle
FROM pgbouncer.pools;
EOF
# If approaching limits, increase in ha-test.tfvars:
# pgbouncer_default_pool_size = 50 (from 25)
# pgbouncer_max_client_conn = 2000 (from 1000)
# Then redeploy: terraform apply -var-file="ha-test.tfvars"Edit patroni/patroni-node-*.yml:
postgresql:
parameters:
# Connection handling
max_connections: 100 # Default: 100 (increase if needed)
superuser_reserved_connections: 3 # Default: 3
# Memory settings
shared_buffers: 256MB # Default: 256MB (tune for your RAM)
effective_cache_size: 1GB # Set to ~25% of total RAM
work_mem: 4MB # Per operation memory
# Replication
max_wal_senders: 5 # Default: 5 (one for each replica + margin)
wal_keep_size: 1GB # Default: 0 (keep WAL for streaming)
max_replication_slots: 10 # Default: 10Then redeploy:
terraform apply -var-file="ha-test.tfvars"Edit pgbouncer/pgbouncer.ini:
;; For high-throughput (many connections)
default_pool_size = 50 # Increase from 25
min_pool_size = 10 # Increase from 5
reserve_pool_size = 10 # Increase from 5
max_client_conn = 2000 # Increase from 1000
;; For low-latency
pool_mode = session # Instead of transaction
;; For debugging
log_connections = 1
log_disconnections = 1Then redeploy:
terraform apply -var-file="ha-test.tfvars"# Full database backup
docker exec pg-node-1 pg_dump -U postgres -F custom -f /tmp/backup.dump postgres
# Copy out of container
docker cp pg-node-1:/tmp/backup.dump ./backup.dump
# Compress
gzip backup.dump# Stop cluster
docker-compose down
# Rebuild from backup (advanced - see PostgreSQL docs)
# This requires careful coordination with Patroni and etcd
# Simpler: Use Patroni's built-in recovery
# Restore primary, replicas sync automatically# 1. Save state
docker ps -a
# 2. Update Dockerfile.patroni (change base image version)
# 3. Redeploy
terraform apply -var-file="ha-test.tfvars"
# 4. Verify
curl -s http://localhost:8008/leader | python3 -m json.tool# Update Dockerfile.patroni to new version
# Redeploy
terraform apply -var-file="ha-test.tfvars"
# Verify
docker exec pg-node-1 patronictl --versionIf the primary won't start:
# 1. Check if replica can be promoted
docker exec pg-node-2 psql -U postgres -c "SELECT pg_is_in_recovery();"
# Result: f = primary, t = replica
# 2. Wait - Patroni will auto-elect new primary
sleep 30
# 3. Verify
curl -s http://localhost:8008/leader
# 4. Fix original primary
docker logs pg-node-1
# Apply fixes, restart
docker start pg-node-1
sleep 30
# 5. Verify recovery
curl -s http://localhost:8008/cluster | python3 -m json.toolIf a node is isolated from etcd:
# 1. Node detects it's in minority partition
docker logs pg-node-1 | grep "promote"
# 2. Node shuts down to prevent split-brain
docker ps | grep pg-node-1
# 3. Restore network
# Node automatically rejoins after network is fixed
# 4. If not rejoining automatically
docker start pg-node-1
sleep 30
curl -s http://localhost:8008/cluster# 1. Check for data corruption
for i in 1 2 3; do
echo "=== Node $i ==="
docker logs pg-node-$i | tail -20
done
# 2. Start primary only
docker start pg-node-1
sleep 60
# 3. Start replicas
docker start pg-node-2 pg-node-3
sleep 60
# 4. Verify recovery
curl -s http://localhost:8008/cluster | python3 -m json.toolRequires
datadog_enabled = trueinha-test.tfvarsandTF_VAR_datadog_api_keyset.
Run the health check script after every deployment or at the start of each shift:
# Full 9-section report (recommended)
bash datadog-health-check.sh
# Compact status summary
bash datadog-health-check.sh --status
# Integration check results only
bash datadog-health-check.sh --checksWhat each section verifies:
| Section | Checks |
|---|---|
| 1. Container Status | datadog-agent is running and healthy |
| 2. Agent Connectivity | API key validated; agent hostname reported |
| 3. Integration Checks | postgres, pgbouncer, http_check — each shows metrics collected |
| 4. Patroni REST API | /liveness → HTTP 200 on all 3 nodes |
| 5. PostgreSQL TCP | Port 5432 reachable on all 3 nodes |
| 6. PgBouncer TCP | Port 6432 reachable on both poolers |
| 7. etcd Health | /health → HTTP 200 |
| 8. Vault Health | /v1/sys/health → HTTP 200 (if vault_enabled) |
| 9. Agent Errors | No actionable errors in last 100 log lines |
# PostgreSQL metrics (all 3 Patroni nodes)
docker exec datadog-agent agent check postgres
# PgBouncer pool metrics (both instances)
docker exec datadog-agent agent check pgbouncer
# HTTP liveness checks (Patroni + etcd + Vault)
docker exec datadog-agent agent check http_check
# Full verbose agent status
docker exec datadog-agent agent status# Enable (set API key first)
export TF_VAR_datadog_api_key="<your-real-api-key>"
# Edit ha-test.tfvars: datadog_enabled = true
terraform apply -var-file="ha-test.tfvars" -auto-approve
# Disable (removes the container, frees ~512 MB RAM)
# Edit ha-test.tfvars: datadog_enabled = false
terraform apply -var-file="ha-test.tfvars" -auto-approve
# Replace agent only (config change, no cluster disruption)
terraform apply -var-file="ha-test.tfvars" \
-replace='docker_container.datadog_agent[0]' -auto-approve# Step 1: Export the new key
export TF_VAR_datadog_api_key="<new-api-key>"
# Step 2: Re-apply — replaces only the agent container
terraform apply -var-file="ha-test.tfvars" -auto-approve
# Step 3: Verify connectivity
docker exec datadog-agent agent status | grep "API key"The rendered configs in datadog/rendered/ contain plaintext passwords. Check them only when debugging integration failures:
# List rendered files
ls -la datadog/rendered/
# Verify postgres check targets all 3 nodes
grep "host:" datadog/rendered/postgres.yaml
# Verify pgbouncer check targets both instances
grep "host:" datadog/rendered/pgbouncer.yaml
# Verify http_check has all Patroni + etcd + Vault URLs
grep "url:" datadog/rendered/http_check.yaml# Check every 10 seconds
watch -n 10 'curl -s http://localhost:8008/cluster | python3 -m json.tool | grep -E "\"name\"|\"state\"|\"role\"'# Simple script to monitor and alert
cat > monitor.sh << 'EOF'
#!/bin/bash
while true; do
STATUS=$(curl -s http://localhost:8008/leader | python3 -c "import sys, json; print(json.load(sys.stdin).get('state', 'unknown'))")
if [ "$STATUS" != "running" ]; then
echo "ALERT: Cluster status is $STATUS"
# Send email, Slack, etc.
fi
sleep 30
done
EOF
chmod +x monitor.sh
./monitor.sh &# 1. Take primary out of rotation
docker stop pg-node-1
sleep 30
# 2. Let replica become primary
curl -s http://localhost:8008/leader
# 3. Do maintenance on stopped primary
docker exec pg-node-1 reboot # or update, etc.
# 4. Restart primary
docker start pg-node-1
sleep 30
# 5. Verify cluster health
curl -s http://localhost:8008/cluster- Core containers running: pg-node-1/2/3, pgbouncer-1/2, etcd, dbhub
- Optional: vault, vault-agent (if
vault_enabled = true) - Optional: datadog-agent (if
datadog_enabled = true) - All 3 PostgreSQL nodes running
- All 2 PgBouncer instances healthy
- etcd cluster has quorum (2/2 or 3/3)
- Vault healthy:
curl -s http://localhost:8200/v1/sys/health - Datadog health check all green:
bash datadog-health-check.sh - All 3 Datadog integration checks collecting metrics
- Replication lag < 100ms
- No connection pool exhaustion
- No slow queries > 5s
- Disk usage < 80%
- All containers memory healthy
- Failover test successful
- Backups running (if configured)