Skip to content

Latest commit

 

History

History
617 lines (455 loc) · 15.9 KB

File metadata and controls

617 lines (455 loc) · 15.9 KB

🛠️ Operations & Maintenance Guide

How to operate, maintain, and troubleshoot your PostgreSQL HA cluster.

Daily Operations

Check Cluster Status

# Quick health check
curl -s http://localhost:8008/leader | python3 -m json.tool

# Full cluster view
curl -s http://localhost:8008/cluster | python3 -m json.tool | grep -E '"name"|"state"|"role"|"lag"'

# Check all containers
docker ps | grep -E 'pg-node|pgbouncer|etcd|vault|dbhub|datadog'

# Expected containers:
#   pg-node-1, pg-node-2, pg-node-3   (PostgreSQL + Patroni)
#   pgbouncer-1, pgbouncer-2           (connection poolers)
#   etcd                               (distributed config store)
#   vault                              (HashiCorp Vault, Raft backend — optional)
#   vault-agent                        (Vault Agent sidecar — optional)
#   datadog-agent                      (Datadog Agent — optional, datadog_enabled)
#   pg-dashboard                       (nginx status dashboard — optional, dashboard_enabled)
#   postgres-exporter-1/2/3            (pg metrics exporters — optional, monitoring_enabled)
#   pgbouncer-exporter-1/2             (pgbouncer metrics exporters — optional, monitoring_enabled)
#   prometheus                         (metrics store — optional, monitoring_enabled)
#   grafana                            (dashboards — optional, monitoring_enabled)
#   dbhub                              (Bytebase web UI)

Monitor PgBouncer

# Connect to admin console
psql -h localhost -p 6432 -U pgadmin -d pgbouncer

# Common commands (inside admin console)
pgbouncer> SHOW POOLS;       # Pool statistics
pgbouncer> SHOW STATS;       # Detailed statistics
pgbouncer> SHOW CLIENTS;     # Active clients
pgbouncer> SHOW CONFIG;      # Current configuration
pgbouncer> \q                # Exit

View Logs

# Primary node
docker logs pg-node-1 -f

# Other nodes
docker logs pg-node-2 -f
docker logs pg-node-3 -f

# PgBouncer logs
docker logs pgbouncer-1 -f
docker logs pgbouncer-2 -f

# etcd logs
docker logs etcd -f

# Datadog Agent logs (if datadog_enabled = true)
docker logs datadog-agent -f

Weekly Maintenance Tasks

Verify Replication

Note: The primary node is dynamically elected by Patroni. Do not assume pg-node-1 is always primary. Find the current primary first:

# Find current primary
curl -s http://localhost:8008/leader | python3 -m json.tool | grep '"name"'
# Check on primary (replace pg-node-1 with the actual primary node name)
docker exec pg-node-1 psql -U postgres -d postgres << 'EOF'
SELECT 
  application_name,
  client_addr,
  state,
  sync_state,
  write_lag,
  flush_lag,
  replay_lag
FROM pg_stat_replication;
EOF

Check Cluster Consensus

# View etcd cluster status
curl -s http://localhost:2379/v3/cluster/member/list | python3 -m json.tool

# Check Patroni consensus
curl -s http://localhost:8008/cluster | python3 -m json.tool | grep -E '"dcs_last_seen"|"timeline"'

Disk Usage Monitoring

# Docker volume sizes
docker volume ls | grep pg

# Check specific volume usage
docker exec pg-node-1 du -sh /var/lib/postgresql/18/main

# Check available disk
docker exec pg-node-1 df -h /

Monthly Tasks

Rotate Passwords

Passwords are auto-generated by Terraform — there are no hardcoded defaults. When Terraform regenerates passwords (e.g., after terraform apply on a fresh run or variable change), you must sync the new password into the running PostgreSQL cluster.

# Step 1: View the newly generated passwords
terraform output -json generated_passwords

# Step 2: Apply Terraform (generates/rotates passwords and redeploys containers)
terraform apply -var-file=ha-test.tfvars

# Step 3: Sync the new pgadmin password into PostgreSQL on the primary node
# (replace pg-node-1 with the actual current primary)
docker exec -it pg-node-1 psql -U postgres -d postgres -c \
  "ALTER USER pgadmin PASSWORD 'new-password-from-output';"

# Step 4: Restart PgBouncer containers to pick up the new password
docker restart pgbouncer-1 pgbouncer-2

# Step 5: Verify the connection works
psql -h localhost -p 6432 -U pgadmin -d postgres -c "SELECT 1;"

Patroni Node Reinit (Timeline Divergence)

If a node shows start failed in Patroni due to timeline divergence, use patronictl reinit from the primary node:

# Step 1: Find the current primary
curl -s http://localhost:8008/leader | python3 -m json.tool | grep '"name"'

# Step 2: Run reinit targeting the failed node (e.g. pg-node-2)
docker exec <primary-node> patronictl -c /etc/patroni/patroni.yml \
  reinit pg-ha-cluster <failed-node> --force

# Example:
docker exec pg-node-1 patronictl -c /etc/patroni/patroni.yml \
  reinit pg-ha-cluster pg-node-2 --force

# Step 3: Verify the node rejoins
sleep 30
curl -s http://localhost:8008/cluster | python3 -m json.tool | grep -E '"name"|"state"|"role"'

Test Failover

⚠️ IMPORTANT: Do this during low-traffic windows only

# Step 1: Record current leader (query /cluster — works from any node regardless of role)
LEADER=$(curl -s http://localhost:8008/cluster | python3 -c "import sys,json; d=json.load(sys.stdin); print(next(m['name'] for m in d['members'] if m['role']=='leader'))")
echo "Current primary: $LEADER"

# Step 2: Stop the primary
docker stop $LEADER

# Step 3: Wait 30 seconds
sleep 30

# Step 4: Check new leader elected
curl -s http://localhost:8008/leader | python3 -m json.tool | grep '"name"'
# Should show pg-node-2 or pg-node-3

# Step 5: Verify applications still work
psql -h localhost -p 6432 -U pgadmin -d postgres -c "SELECT version();"

# Step 6: Bring primary back online
docker start $LEADER

# Step 7: Verify cluster heals
sleep 30
curl -s http://localhost:8008/cluster | python3 -m json.tool | grep -c '"role"'
# Should show 3 members

Review Slow Queries

# Connect to primary
docker exec -it pg-node-1 psql -U postgres -d postgres

# Enable logging (if not already)
postgres> CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
postgres> SET log_min_duration_statement = 1000;  -- Log queries > 1s

# Find slow queries
postgres> SELECT query, mean_exec_time FROM pg_stat_statements 
          ORDER BY mean_exec_time DESC LIMIT 10;

postgres> \q

Check Backup Status (if pgBackRest configured)

# List backups
docker exec pg-node-1 pgbackrest info 2>/dev/null || echo "pgBackRest not configured"

# Create ad-hoc backup (if needed)
docker exec pg-node-1 pgbackrest backup --backup-standby 2>/dev/null || echo "Configure backup first"

Scaling Operations

Add More Data

# Create table
psql -h localhost -p 6432 -U pgadmin -d postgres << 'EOF'
CREATE TABLE production_data AS
SELECT 
  generate_series(1, 1000000) as id,
  md5(random()::text) as data,
  NOW() as created_at;
CREATE INDEX idx_prod_id ON production_data(id);
EOF

# Verify on replica
psql -h localhost -p 5433 -U pgadmin -d postgres << 'EOF'
SELECT COUNT(*) FROM production_data;
EOF

Monitor Connection Pools

# Check current pool usage
psql -h localhost -p 6432 -U pgadmin -d pgbouncer << 'EOF'
SELECT 
  database,
  user,
  pool_mode,
  cl_active,
  cl_waiting,
  sv_active,
  sv_idle
FROM pgbouncer.pools;
EOF

# If approaching limits, increase in ha-test.tfvars:
# pgbouncer_default_pool_size = 50  (from 25)
# pgbouncer_max_client_conn = 2000   (from 1000)
# Then redeploy: terraform apply -var-file="ha-test.tfvars"

Performance Tuning

PostgreSQL Connection Parameters

Edit patroni/patroni-node-*.yml:

postgresql:
  parameters:
    # Connection handling
    max_connections: 100                 # Default: 100 (increase if needed)
    superuser_reserved_connections: 3    # Default: 3

    # Memory settings
    shared_buffers: 256MB               # Default: 256MB (tune for your RAM)
    effective_cache_size: 1GB           # Set to ~25% of total RAM
    work_mem: 4MB                       # Per operation memory

    # Replication
    max_wal_senders: 5                  # Default: 5 (one for each replica + margin)
    wal_keep_size: 1GB                  # Default: 0 (keep WAL for streaming)
    max_replication_slots: 10           # Default: 10

Then redeploy:

terraform apply -var-file="ha-test.tfvars"

PgBouncer Tuning

Edit pgbouncer/pgbouncer.ini:

;; For high-throughput (many connections)
default_pool_size = 50              # Increase from 25
min_pool_size = 10                  # Increase from 5
reserve_pool_size = 10              # Increase from 5
max_client_conn = 2000              # Increase from 1000

;; For low-latency
pool_mode = session                 # Instead of transaction

;; For debugging
log_connections = 1
log_disconnections = 1

Then redeploy:

terraform apply -var-file="ha-test.tfvars"

Backup & Recovery

Manual Backup

# Full database backup
docker exec pg-node-1 pg_dump -U postgres -F custom -f /tmp/backup.dump postgres

# Copy out of container
docker cp pg-node-1:/tmp/backup.dump ./backup.dump

# Compress
gzip backup.dump

Restore from Backup

# Stop cluster
docker-compose down

# Rebuild from backup (advanced - see PostgreSQL docs)
# This requires careful coordination with Patroni and etcd

# Simpler: Use Patroni's built-in recovery
# Restore primary, replicas sync automatically

Upgrades & Maintenance Windows

PostgreSQL Minor Version Upgrade

# 1. Save state
docker ps -a

# 2. Update Dockerfile.patroni (change base image version)
# 3. Redeploy
terraform apply -var-file="ha-test.tfvars"

# 4. Verify
curl -s http://localhost:8008/leader | python3 -m json.tool

Patroni/etcd Upgrade

# Update Dockerfile.patroni to new version
# Redeploy
terraform apply -var-file="ha-test.tfvars"

# Verify
docker exec pg-node-1 patronictl --version

Emergency Procedures

Primary Node Completely Down

If the primary won't start:

# 1. Check if replica can be promoted
docker exec pg-node-2 psql -U postgres -c "SELECT pg_is_in_recovery();"
# Result: f = primary, t = replica

# 2. Wait - Patroni will auto-elect new primary
sleep 30

# 3. Verify
curl -s http://localhost:8008/leader

# 4. Fix original primary
docker logs pg-node-1
# Apply fixes, restart

docker start pg-node-1
sleep 30

# 5. Verify recovery
curl -s http://localhost:8008/cluster | python3 -m json.tool

Network Partition

If a node is isolated from etcd:

# 1. Node detects it's in minority partition
docker logs pg-node-1 | grep "promote"

# 2. Node shuts down to prevent split-brain
docker ps | grep pg-node-1

# 3. Restore network
# Node automatically rejoins after network is fixed

# 4. If not rejoining automatically
docker start pg-node-1
sleep 30
curl -s http://localhost:8008/cluster

All Nodes Down (Disaster Recovery)

# 1. Check for data corruption
for i in 1 2 3; do
  echo "=== Node $i ==="
  docker logs pg-node-$i | tail -20
done

# 2. Start primary only
docker start pg-node-1
sleep 60

# 3. Start replicas
docker start pg-node-2 pg-node-3
sleep 60

# 4. Verify recovery
curl -s http://localhost:8008/cluster | python3 -m json.tool

Datadog Observability (Monitoring)

Requires datadog_enabled = true in ha-test.tfvars and TF_VAR_datadog_api_key set.

Daily Datadog Health Check

Run the health check script after every deployment or at the start of each shift:

# Full 9-section report (recommended)
bash datadog-health-check.sh

# Compact status summary
bash datadog-health-check.sh --status

# Integration check results only
bash datadog-health-check.sh --checks

What each section verifies:

Section Checks
1. Container Status datadog-agent is running and healthy
2. Agent Connectivity API key validated; agent hostname reported
3. Integration Checks postgres, pgbouncer, http_check — each shows metrics collected
4. Patroni REST API /liveness → HTTP 200 on all 3 nodes
5. PostgreSQL TCP Port 5432 reachable on all 3 nodes
6. PgBouncer TCP Port 6432 reachable on both poolers
7. etcd Health /health → HTTP 200
8. Vault Health /v1/sys/health → HTTP 200 (if vault_enabled)
9. Agent Errors No actionable errors in last 100 log lines

Re-run Individual Integration Checks

# PostgreSQL metrics (all 3 Patroni nodes)
docker exec datadog-agent agent check postgres

# PgBouncer pool metrics (both instances)
docker exec datadog-agent agent check pgbouncer

# HTTP liveness checks (Patroni + etcd + Vault)
docker exec datadog-agent agent check http_check

# Full verbose agent status
docker exec datadog-agent agent status

Enable / Disable Datadog

# Enable (set API key first)
export TF_VAR_datadog_api_key="<your-real-api-key>"
# Edit ha-test.tfvars: datadog_enabled = true
terraform apply -var-file="ha-test.tfvars" -auto-approve

# Disable (removes the container, frees ~512 MB RAM)
# Edit ha-test.tfvars: datadog_enabled = false
terraform apply -var-file="ha-test.tfvars" -auto-approve

# Replace agent only (config change, no cluster disruption)
terraform apply -var-file="ha-test.tfvars" \
  -replace='docker_container.datadog_agent[0]' -auto-approve

Update Datadog API Key

# Step 1: Export the new key
export TF_VAR_datadog_api_key="<new-api-key>"

# Step 2: Re-apply — replaces only the agent container
terraform apply -var-file="ha-test.tfvars" -auto-approve

# Step 3: Verify connectivity
docker exec datadog-agent agent status | grep "API key"

Inspect Rendered Integration Configs

The rendered configs in datadog/rendered/ contain plaintext passwords. Check them only when debugging integration failures:

# List rendered files
ls -la datadog/rendered/

# Verify postgres check targets all 3 nodes
grep "host:" datadog/rendered/postgres.yaml

# Verify pgbouncer check targets both instances
grep "host:" datadog/rendered/pgbouncer.yaml

# Verify http_check has all Patroni + etcd + Vault URLs
grep "url:" datadog/rendered/http_check.yaml

Monitoring Health

Create Monitoring Dashboard

# Check every 10 seconds
watch -n 10 'curl -s http://localhost:8008/cluster | python3 -m json.tool | grep -E "\"name\"|\"state\"|\"role\"'

Set Up Alerts

# Simple script to monitor and alert
cat > monitor.sh << 'EOF'
#!/bin/bash
while true; do
  STATUS=$(curl -s http://localhost:8008/leader | python3 -c "import sys, json; print(json.load(sys.stdin).get('state', 'unknown'))")
  if [ "$STATUS" != "running" ]; then
    echo "ALERT: Cluster status is $STATUS"
    # Send email, Slack, etc.
  fi
  sleep 30
done
EOF

chmod +x monitor.sh
./monitor.sh &

Maintenance Windows

Safe Maintenance Procedure

# 1. Take primary out of rotation
docker stop pg-node-1
sleep 30

# 2. Let replica become primary
curl -s http://localhost:8008/leader

# 3. Do maintenance on stopped primary
docker exec pg-node-1 reboot  # or update, etc.

# 4. Restart primary
docker start pg-node-1
sleep 30

# 5. Verify cluster health
curl -s http://localhost:8008/cluster

Regular Inspections

Monthly Health Check Checklist

  • Core containers running: pg-node-1/2/3, pgbouncer-1/2, etcd, dbhub
  • Optional: vault, vault-agent (if vault_enabled = true)
  • Optional: datadog-agent (if datadog_enabled = true)
  • All 3 PostgreSQL nodes running
  • All 2 PgBouncer instances healthy
  • etcd cluster has quorum (2/2 or 3/3)
  • Vault healthy: curl -s http://localhost:8200/v1/sys/health
  • Datadog health check all green: bash datadog-health-check.sh
  • All 3 Datadog integration checks collecting metrics
  • Replication lag < 100ms
  • No connection pool exhaustion
  • No slow queries > 5s
  • Disk usage < 80%
  • All containers memory healthy
  • Failover test successful
  • Backups running (if configured)

Documentation References