Skip to content

Commit 5a48b09

Browse files
authored
Merge pull request #29 from randoneering/feature/system_info_checks
Feature/Adding Info and System Level Checks
2 parents 5e77b75 + 998dbf2 commit 5a48b09

30 files changed

Lines changed: 1885 additions & 151 deletions

.github/pull_request_template.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,10 @@
4343
**Has this code been deployed and tested on the following platforms?**
4444

4545
- [ ] Amazon RDS for PostgreSQL
46-
- [ ] Google Cloud SQL for PostgreSQL
47-
- [ ] Azure Database for PostgreSQL
46+
- [ ] Google Cloud SQL for PostgreSQL (currently unable to test)
47+
- [ ] Azure Database for PostgreSQL (currently unable to test)
48+
- [ ] Neon
49+
- [ ] Supabase
4850
- [ ] Self-managed PostgreSQL
4951

5052
**Platform-specific notes:**

.gitignore

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,23 +4,34 @@
44
terraform.tfstate
55
terraform.tfstate.backup
66
*.tfvars
7+
*.tfvars.json
78
!terraform.tfvars.example
89
*.tfstate
10+
*.tfstate.*
911
*.tfstate.backup
12+
crash.log
13+
crash.*.log
14+
override.tf
15+
override.tf.json
16+
*_override.tf
17+
*_override.tf.json
1018
# Sensitive files
1119
*.pem
1220
*.key
1321
credentials.json
1422
.pgpass
1523
pgpass
1624
# AI dirs
17-
.claude
1825
.agent
1926
.copilot
20-
21-
# AI Files
27+
.claude
28+
# AI files
29+
AGENTS.md
2230
CLAUDE.md
2331

32+
# Local docs
33+
docs/superpowers/plans/
34+
2435
# Python
2536
__pycache__/
2637
.pytest_cache/

README.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ That's it! No configuration needed. Deploy as a user with the highest possible p
7575

7676
- **Table Bloat** - Tables with >20% bloat affecting performance (tables >100MB)
7777
- **Missing Statistics** - Tables never analyzed, leaving the query planner without statistics
78-
- **Duplicate Indexes** - Multiple indexes with identical or overlapping column sets
78+
- **Duplicate Indexes** - Indexes with the same structure, including predicates and expressions
7979
- **Inactive Replication Slots** - Identifies replication slots that are inactive and can be removed if no longer needed
8080
- **Tables Larger Than 100GB** - Identifies tables that are larger than 100GB
8181
- **Tables With More Than 200 Columns** - List tables with more than 200 columns. You should probably look into those...
@@ -121,8 +121,8 @@ That's it! No configuration needed. Deploy as a user with the highest possible p
121121
- **PostgreSQL Version** - Version information and configuration details
122122
- **Installed Extensions** - Lists installed extensions on the Server
123123
- **Server Uptime** - Server uptime since last restart
124-
- **Log Directory** - Location of Log File(s). Results will vary for managed services like AWS RDS. (note: need access to AWS/Azure/GCP environments where I can test against!)
125-
- **Log File Sizes** - The size of the log files. Again, this will vary for managed services.
124+
- **Log Directory** - Current log directory when the platform exposes it
125+
- **Log File Sizes** - Current log file sizes when the platform exposes them
126126

127127
## Usage Tips
128128

@@ -210,11 +210,14 @@ pgFirstAid is designed to be lightweight and safe to run on production systems:
210210
- A coverage guard ensures every `check_name` in `pgFirstAid.sql` is referenced by at least one pgTAP assertion.
211211
- Managed database validation is exercised through the reusable workflow in `.github/workflows/managed-db-validate.yml`.
212212

213+
> **Important:** We currently validate managed-database testing against AWS, but we do not have the funding or credits needed to keep Azure and GCP test environments running. If you have access to Azure Database for PostgreSQL or GCP Cloud SQL and want to help validate pgFirstAid there, we would be happy to have the help.
214+
213215
## Compatibility
214216

215217
- **PostgreSQL 10+** - Supported, with active automated validation focused on PostgreSQL 15-18
216218
- **PostgreSQL 9.x** - Most features work (minor syntax adjustments may be needed)
217219
- Works with PostgreSQL-compatible databases, including Amazon RDS, Aurora, Azure Database for PostgreSQL, GCP Cloud SQL, and self-hosted PostgreSQL
220+
- Automated managed-database validation is active for AWS today. Azure and GCP support is best-effort until we can fund those test environments.
218221

219222
## Contributing
220223

pgFirstAid.sql

Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -267,6 +267,38 @@ with pss as (
267267
end;
268268
$$ language plpgsql;
269269

270+
-- Helper: returns formatted checkpoint stats compatible with PG15/16 (pg_stat_bgwriter)
271+
-- and PG17+ (pg_stat_checkpointer, which replaced the checkpoint columns in pg_stat_bgwriter)
272+
create or replace
273+
function _pg_firstaid_checkpoint_stats()
274+
returns text
275+
language plpgsql
276+
stable
277+
as $$
278+
declare
279+
v_timed bigint;
280+
v_forced bigint;
281+
begin
282+
if current_setting('server_version_num')::int >= 170000 then
283+
select num_timed, num_requested
284+
into v_timed, v_forced
285+
from pg_stat_checkpointer;
286+
else
287+
select checkpoints_timed, checkpoints_req
288+
into v_timed, v_forced
289+
from pg_stat_bgwriter;
290+
end if;
291+
292+
return 'timed: ' || v_timed::text ||
293+
', forced: ' || v_forced::text ||
294+
', forced ratio: ' ||
295+
case
296+
when v_timed + v_forced = 0 then '0%'
297+
else round(100.0 * v_forced / (v_timed + v_forced), 1)::text || '%'
298+
end;
299+
end;
300+
$$;
301+
270302
create or replace
271303
function pg_firstAid()
272304
returns table (
@@ -1433,6 +1465,156 @@ order by
14331465
'Keep PostgreSQL updated and review configuration settings' as recommended_action,
14341466
'https://www.postgresql.org/docs/current/upgrading.html' as documentation_link,
14351467
5 as severity_order;
1468+
-- INFO: shared_buffers current value
1469+
insert into health_results
1470+
select
1471+
'INFO' as severity,
1472+
'System Health' as category,
1473+
'shared_buffers Setting' as check_name,
1474+
'System' as object_name,
1475+
'Current value of shared_buffers. Recommended: ~25% of total system RAM for dedicated database servers.' as issue_description,
1476+
current_setting('shared_buffers') as current_value,
1477+
'No action needed if already tuned. For dedicated DB servers with 8GB+ RAM, target 25% of total RAM. Changes require a PostgreSQL restart.' as recommended_action,
1478+
'https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-SHARED-BUFFERS' as documentation_link,
1479+
5 as severity_order;
1480+
-- HIGH: shared_buffers still at 128MB PostgreSQL default
1481+
insert into health_results
1482+
select
1483+
'HIGH' as severity,
1484+
'System Health' as category,
1485+
'shared_buffers At Default' as check_name,
1486+
'System' as object_name,
1487+
'shared_buffers is set to the PostgreSQL default of 128MB. On any real workload this is almost certainly too low.' as issue_description,
1488+
current_setting('shared_buffers') as current_value,
1489+
'Set shared_buffers to approximately 25% of total system RAM (e.g., 2GB on an 8GB server). Requires a PostgreSQL restart.' as recommended_action,
1490+
'https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-SHARED-BUFFERS' as documentation_link,
1491+
2 as severity_order
1492+
where pg_size_bytes(current_setting('shared_buffers')) = pg_size_bytes('128MB');
1493+
1494+
-- INFO: work_mem current value
1495+
insert into health_results
1496+
select
1497+
'INFO' as severity,
1498+
'System Health' as category,
1499+
'work_mem Setting' as check_name,
1500+
'System' as object_name,
1501+
'Current value of work_mem. Allocated per sort/hash operation per session — multiply by max_connections and parallel workers to estimate peak memory consumption.' as issue_description,
1502+
current_setting('work_mem') || ' (max_connections: ' || current_setting('max_connections') || ')' as current_value,
1503+
'For OLTP workloads, 16-32MB is a common starting point. Monitor pg_stat_statements for temp file spills to determine if higher is warranted. Use SET work_mem per-session for large one-off queries rather than setting globally.' as recommended_action,
1504+
'https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-WORK-MEM' as documentation_link,
1505+
5 as severity_order;
1506+
1507+
-- MEDIUM: work_mem still at 4MB PostgreSQL default
1508+
insert into health_results
1509+
select
1510+
'MEDIUM' as severity,
1511+
'System Health' as category,
1512+
'work_mem At Default' as check_name,
1513+
'System' as object_name,
1514+
'work_mem is set to the PostgreSQL default of 4MB. On modern hardware this often causes unnecessary sort and hash spills to disk.' as issue_description,
1515+
current_setting('work_mem') as current_value,
1516+
'Consider raising work_mem to 16-32MB for OLTP workloads. Be aware that work_mem is allocated per operation per session — high concurrency multiplies total memory usage.' as recommended_action,
1517+
'https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-WORK-MEM' as documentation_link,
1518+
3 as severity_order
1519+
where pg_size_bytes(current_setting('work_mem')) = pg_size_bytes('4MB');
1520+
1521+
-- INFO: effective_cache_size current value
1522+
insert into health_results
1523+
select
1524+
'INFO' as severity,
1525+
'System Health' as category,
1526+
'effective_cache_size Setting' as check_name,
1527+
'System' as object_name,
1528+
'Current value of effective_cache_size. Tells the query planner how much memory is available for disk caching. Does not allocate memory — purely advisory.' as issue_description,
1529+
current_setting('effective_cache_size') as current_value,
1530+
'Set to ~50-75% of total system RAM (shared_buffers + expected OS page cache). Underestimates cause the planner to prefer nested loops over index scans.' as recommended_action,
1531+
'https://www.postgresql.org/docs/current/runtime-config-query.html#GUC-EFFECTIVE-CACHE-SIZE' as documentation_link,
1532+
5 as severity_order;
1533+
1534+
-- INFO: maintenance_work_mem current value
1535+
insert into health_results
1536+
select
1537+
'INFO' as severity,
1538+
'System Health' as category,
1539+
'maintenance_work_mem Setting' as check_name,
1540+
'System' as object_name,
1541+
'Current value of maintenance_work_mem. Used by VACUUM, CREATE INDEX, ALTER TABLE, and each autovacuum worker.' as issue_description,
1542+
current_setting('maintenance_work_mem') as current_value,
1543+
'Consider 256MB-1GB on modern hardware. Higher values speed up index builds and autovacuum on large tables. Changes take effect immediately for new sessions.' as recommended_action,
1544+
'https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-MAINTENANCE-WORK-MEM' as documentation_link,
1545+
5 as severity_order;
1546+
1547+
-- INFO: Transaction ID wraparound risk per database
1548+
insert into health_results
1549+
select
1550+
'INFO' as severity,
1551+
'System Health' as category,
1552+
'Transaction ID Wraparound Risk' as check_name,
1553+
datname as object_name,
1554+
'Age of the oldest unfrozen transaction ID in this database. PostgreSQL must freeze XIDs before reaching ~2.1 billion to prevent data loss from wraparound.' as issue_description,
1555+
datname || ': XID age ' || trim(to_char(age(datfrozenxid), 'FM999,999,999,990')) ||
1556+
' (' || round(age(datfrozenxid)::numeric * 100 / 2000000000, 1)::text ||
1557+
'% of wraparound window, ~' ||
1558+
trim(to_char(greatest(2000000000::bigint - age(datfrozenxid)::bigint, 0), 'FM999,999,999,990')) ||
1559+
' remaining)' as current_value,
1560+
'Run VACUUM FREEZE on databases approaching high XID age. Ensure autovacuum is enabled and not blocked. Monitor databases with age > 500,000,000.' as recommended_action,
1561+
'https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND' as documentation_link,
1562+
5 as severity_order
1563+
from
1564+
pg_database
1565+
where
1566+
datallowconn = true;
1567+
1568+
-- INFO: Checkpoint statistics (PG15/16: pg_stat_bgwriter, PG17+: pg_stat_checkpointer)
1569+
insert into health_results
1570+
select
1571+
'INFO' as severity,
1572+
'System Health' as category,
1573+
'Checkpoint Stats' as check_name,
1574+
'System' as object_name,
1575+
'Checkpoint activity since stats last reset. Forced checkpoints occur when WAL fills up before the scheduled interval — high ratios suggest max_wal_size may be too small. PG15/16 reads from pg_stat_bgwriter; PG17+ reads from pg_stat_checkpointer.' as issue_description,
1576+
_pg_firstaid_checkpoint_stats() as current_value,
1577+
'If forced checkpoints are consistently above 50% of total, consider increasing max_wal_size. Reset stats with: SELECT pg_stat_reset_shared(''' ||
1578+
case
1579+
when current_setting('server_version_num')::int >= 170000 then 'checkpointer'
1580+
else 'bgwriter'
1581+
end ||
1582+
''').' as recommended_action,
1583+
'https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-BGWRITER-VIEW' as documentation_link,
1584+
5 as severity_order;
1585+
1586+
-- INFO: Server role (primary vs standby)
1587+
insert into health_results
1588+
select
1589+
'INFO' as severity,
1590+
'System Info' as category,
1591+
'Server Role' as check_name,
1592+
'System' as object_name,
1593+
'Whether this server is operating as a primary or standby replica. Context for interpreting other checks — some checks are only relevant on standbys.' as issue_description,
1594+
case
1595+
when pg_is_in_recovery() then 'Standby (replica)'
1596+
else 'Primary'
1597+
end as current_value,
1598+
'No action needed — informational.' as recommended_action,
1599+
'https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-RECOVERY-INFO-TABLE' as documentation_link,
1600+
5 as severity_order;
1601+
1602+
-- INFO: Connection utilization
1603+
insert into health_results
1604+
select
1605+
'INFO' as severity,
1606+
'System Health' as category,
1607+
'Connection Utilization' as check_name,
1608+
'System' as object_name,
1609+
'Current connection usage as a percentage of max_connections. Includes all connection states, not just active queries.' as issue_description,
1610+
count(*)::text || ' total / ' || current_setting('max_connections') || ' max (' ||
1611+
round(100.0 * count(*) / current_setting('max_connections')::int, 1)::text || '% used)' as current_value,
1612+
'If consistently above 80%, consider a connection pooler such as PgBouncer. Reserve headroom for superuser connections (superuser_reserved_connections).' as recommended_action,
1613+
'https://www.postgresql.org/docs/current/runtime-config-connection.html' as documentation_link,
1614+
5 as severity_order
1615+
from
1616+
pg_stat_activity;
1617+
14361618
-- INFO: Installed Extensions
14371619
insert
14381620
into

testing/.flox/.gitattributes

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
env/manifest.lock linguist-generated=true linguist-language=JSON

testing/.flox/.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
run/
2+
cache/
3+
lib/
4+
log/
5+
!env/

testing/.flox/env.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"name": "pgFirstAid",
3+
"version": 1
4+
}

0 commit comments

Comments
 (0)