Skip to content

Code review: findings across CLI, reporter, Flask backend, infra, docs & CI #58

@NikolayS

Description

@NikolayS

Summary

Comprehensive review of the postgresai repo covering architecture, code quality, tests, CI, docs, dependency management, and security. Most of the project is in decent shape — a strong quality framework exists, CodeQL + gitleaks are wired up, and the bulk of the checkup/reporter logic is well commented. However, there are a number of real issues worth fixing. They are grouped below by severity.

Scope examined: cli/ (TypeScript), reporter/ (Python), monitoring_flask_backend/ (Python Flask), config/, docker-compose*.yml, .gitlab-ci.yml, quality/, tests/, terraform/, postgres_ai_helm/, top-level docs.

Tooling run locally: ruff check (0.15.8), git log, module import checks, grep/find across the tree.


🔴 Critical / High

1. PostgreSQL version support gap — CI, conftest, and README disagree

The README advertises PostgreSQL 14-18 (badge + Requirements section), but:

  • conftest.py:9 only searches for binaries in versions ["16", "15", "14", "13", "12", "11"] — never finds PG 17 or PG 18 (the current latest stable).
  • quality/gitlab-ci-quality.yml:225 quality:pg-version-matrix runs only PG_VERSION: ["14", "15", "16"].
  • No CI coverage for PG 17 or PG 18.

This is a correctness/QA claim the project cannot currently back up. Either update the matrix + conftest to cover 17/18, or adjust the README badge to the versions actually tested.

2. Large "planning" doc committed to repo root

TMP_MON_PREVIEWS.md (57 KB, 1000+ lines) is a dated (2026-01-25, v1.7.0) implementation plan for preview environments. The TMP_ prefix and the content ("Status: Implementation Complete - Awaiting VM Provisioning", embedded docker-compose samples at lines 1074, 1347) strongly suggest this should not be checked in. The pre-commit check-added-large-files rule (.pre-commit-config.yaml:11 — maxkb=500) would now block this file, but it was committed earlier.

Action: move to the wiki / issue / internal docs and delete from the repo.

3. README is structurally duplicated / inconsistent

README.md reads like two READMEs glued together:

  • Lines 28-213: modern/lean postgresai README.
  • Lines 228-534: emoji-heavy legacy doc (## 🎯 Use cases, ## 🔧 Management commands, ## 🔄 Upgrading, ## ☸️ Kubernetes deployment, ## 📋 Checkup reports, ## 🌐 Access points, etc.) repeating and partly contradicting the earlier sections.

Concrete problems:

  • Two Roadmaps (one in the clean half, one further down at line 456).
  • A stray <div align="center"> opens at line 224 and closes only at line 535, so the entire bottom half is inadvertently centre-aligned.
  • The testing section at line 464 says "Python-based report generation lives under reporter/" but the CLI is TypeScript and the project's CLI coverage badge (line 10) points at cli:node:tests. Both are true — the README just doesn't explain the mix.
  • The 🧪 Testing section references tests/reporter/ as the entire suite; there are actually 6 test roots (tests/reporter/, tests/monitoring_flask_backend/, tests/e2e/, tests/lock_waits/, tests/settings/, tests/compliance_vectors/).
  • .env.example:6 pins PGAI_TAG=0.14.0, README line 332 also mentions 0.14.0, but cli/package.json:3 and pgai/package.json:3 both have "version": "0.0.0-dev.0". There's no single source of truth for the shipped version.

Action: de-duplicate README.md, remove the stray centre <div>, and document the Python/TypeScript split in one place.

4. Repository URLs point to GitLab only

cli/package.json:9, cli/package.json:12, pgai/package.json:9, pgai/package.json:12 all point to gitlab.com/postgres-ai/postgres_ai. The package is also hosted at github.com/postgres-ai/postgresai (this issue is being filed there), and the README references GitLab for issues while the GitHub repo is the mirror target (.github/workflows/mirror-to-gitlab.yml). Decide which is canonical and make it consistent — at minimum fix bugs.url/repository.url in the npm manifests.

5. Dependency fragmentation & drift

Three separate Python dependency files, no pyproject.toml:

File pytest boto3 requests-aws4auth
reporter/requirements.txt (not pinned) 1.34.69 1.2.3
reporter/requirements-dev.txt 8.3.5
monitoring_flask_backend/requirements.txt 9.0.2 1.34.69 1.3.1

Two different pytest majors, two different requests-aws4auth versions, and boto3 pinned to a ~1-year-old release on both sides. No top-level pyproject.toml → no single dev-env bootstrap. Consider consolidating into pyproject.toml with optional extras ([project.optional-dependencies]) and aligning pins.

6. Massive single-file modules (god-objects)

File LOC Notes
cli/bin/postgres-ai.ts 4,836 Single CLI entry file; commander tree likely needs splitting into subcommand modules.
reporter/postgres_reports.py 5,166 One class (PostgresReportGenerator) with 78 methods; several functions >150 LOC.
cli/lib/checkup.ts 1,744
monitoring_flask_backend/app.py 1,536 10 routes + SQL allowlist logic + truncation helpers all in one module.
cli/lib/init.ts 1,131
cli/lib/issues.ts 1,060

These aren't bugs, but they are the main obstacle to onboarding new contributors and to targeted testing. Suggest extracting one check family per module (e.g. reporter/generators/hX.py, reporter/generators/fX.py) and splitting the Commander tree in postgres-ai.ts into per-command files.


🟠 Medium

7. zip() without strict= on paired time series

reporter/postgres_reports.py:2002, 2021, 2379 (and two more, B905 ×5 total) zip exec/plan or read/write values that are assumed to line up hour-by-hour. If Prometheus returns a different number of points for the two series (stale scrape, gap), results are silently wrong, not errored:

hourly_total_time = [e + p for e, p in zip(exec_values, plan_values)]

Add strict=True (Python 3.10+) or explicit length validation.

8. Bare except Exception: pass swallowing failures

Silent swallows with no logging:

  • reporter/postgres_reports.py:178 — file read failure returns None silently.
  • reporter/postgres_reports.py:4407S110 — metrics parsing failure hidden by comment-only justification.
  • monitoring_flask_backend/app.py:228 — smart query truncation falls back silently.
  • tests/lock_waits/test_lock_waits_metric.py:170, 334, 342 — three occurrences in a single test.

At minimum add logger.debug(…, exc_info=True) so these failures are traceable in production.

9. Flask backend posture

  • monitoring_flask_backend/app.py:403 /health leaks PROMETHEUS_URL to any caller (unauthenticated) — could help an attacker pivot to internal Prometheus.
  • monitoring_flask_backend/app.py:272 Prometheus client defaults to disable_ssl=True; OK for a Docker-internal URL but nothing validates that PROMETHEUS_URL actually points at localhost before disabling TLS. A misconfigured env var would MITM happily.
  • No security headers (X-Content-Type-Options, X-Frame-Options, basic CSP) on the CSV/JSON endpoints. Add a @app.after_request handler.
  • Most routes (/pgss_metrics/csv, /query_texts, /btree_bloat/csv, /table_info/csv, /query_info_metrics, /metrics, /debug/metrics, /version) have no authentication. If the container is not behind an authenticating reverse proxy, these are open. Document the required deployment contract or add optional basic-auth.
  • monitoring_flask_backend/app.py:1536 app.run(host='0.0.0.0', …) — only used when run directly (gunicorn is the real entrypoint) but still ruff-flagged S104; pin to localhost for dev.

10. No Python linting or type-checking in CI

Release-readiness runs bun run typecheck for the CLI, but there is no ruff/mypy/black/flake8 gate for the 6 700+ lines of Python. Running ruff check . --select E,W,B,S surfaces 3 158 issues today, including:

  • 5 × S608 hardcoded SQL (monitoring_flask_backend/app.py:181, 185, 218, test code)
  • 4 × S110 try/except/pass
  • 5 × B905 zip-without-strict (see Text review for errors #7)
  • 9 × B007 unused loop control variable
  • 27 × F841 unused variable
  • 35 × F401 unused import
  • 419 × W293 blank-line-with-whitespace
  • 1 071 × E501 line-too-long
  • 1 × S104 bind-all-interfaces

These are individually small but cumulatively indicate code hygiene drift. Recommend adding ruff check + mypy to .pre-commit-config.yaml and to .gitlab-ci.yml (non-blocking at first, then block on a pinned baseline).

11. CLI smoke job doesn't actually test behaviour

.gitlab-ci.yml:220-250 (cli:node:smoke) validates that the binary runs (--help, mon status --help, prepare-db --print-sql | grep CREATE). It does not assert any CLI output correctness. The real unit tests live in cli:node:tests (bun test) — that's fine, but then the smoke job is misnamed; it's closer to a "binary sanity" check.

12. quality:sql-safety and quality:connection-safety are advisory only

Both have allow_failure: true (quality/gitlab-ci-quality.yml:69, 153). For a PostgreSQL observability tool that executes SQL against customer databases, these checks should eventually be hard gates on main.

13. Test-file names leak coverage-farming history

tests/reporter/test_final_coverage_push.py and tests/reporter/test_final_push_to_85.py exist as top-level files. The names advertise that they were written to hit a coverage threshold rather than to verify behaviour. They should be renamed by concern (e.g. test_parse_memory_value.py, test_format_bytes.py) and merged into the existing per-feature suites.

14. Commander version jump + Bun/Node dual runtime

cli/package.json:44 has "commander": "^14.0.3" and "typescript": "^6.0.2" — both are noticeably ahead of what's typical and may not yet be widely tested with bun. Also the build script rewrites the shebang from #!/usr/bin/env bun to #!/usr/bin/env node via inline node -e; this works but is fragile. Consider a small scripts/postbuild.ts instead.


🟡 Low / polish

15. Makefile is trivial

Makefile (lines 1-15) only exposes up, up-local, down, logs. No test, lint, format, build, install-dev. Given how scattered the build steps are (bun in cli/, pip in reporter/ + monitoring_flask_backend/, helm in postgres_ai_helm/), a thin Makefile would materially help contributors.

16. No issue or PR templates

.github/ has dependabot.yml, codeql-analysis.yml, mirror-to-gitlab.yml, but no ISSUE_TEMPLATE/*.md and no pull_request_template.md. The ecosystem (Cursor, Claude Code) discovers these and they raise the quality of first-issue reports.

17. Cross-module import reaches into Flask backend from reporter

reporter/postgres_reports.py:47 imports monitoring_flask_backend.promql_utils.escape_promql_label with a try/except fallback. Putting the small utility in a shared common/ or publishing it once would eliminate the cross-component coupling and the inline fallback.

18. requirements-aws4auth and boto3 used for AMP auth are not gated behind a feature flag

Both packages ship unconditionally in reporter/requirements.txt even though AMP (AWS-managed Prometheus) support is optional. Moving them to an [amp] extra keeps the default install slim.

19. Misc code smells found by ruff

  • reporter/postgres_reports.py:426 B007 — loop variable label_name never used.
  • tests/compliance_vectors/test_vm_auth.py:116, 125 S607subprocess.run(['helm', …]) uses partial path.
  • Trailing-whitespace / blank-line whitespace spread across hundreds of lines; a single ruff --fix pass will clean it up.

20. Infra nits

  • docker-compose.yml:11postgres-ai-configs:${PGAI_TAG:?…} hard-fails if PGAI_TAG is missing, but there's no hint in the error message to point at .env.example. README has recovery guidance but the error message does not.
  • docker-compose.yml:261 cAdvisor runs privileged: true — necessary for cAdvisor, but worth an inline comment justifying it for security auditors.
  • docker-compose.yml:148 --sink=prometheus://0.0.0.0:9091/pgwatch — literal 0.0.0.0 as a client target reads like a typo; confirm this is "listen on all" vs "connect to all".

What worked well (so we don't regress)

  • quality/QUALITY_ENGINEERING_GUIDE.md + quality/scripts/release-readiness.sh are a solid gate: 70 % Python / 60 % CLI coverage minimums enforced, with 9 other checks (build, types, schemas, gitleaks, large-file, docs).
  • Secret-handling is genuinely careful: gitleaks pre-commit (.githooks/pre-commit), gitleaks pre-commit hook repo, gitleaks.toml with synthetic-secret allowlist, no real secrets committed.
  • /execute-query debug endpoint is defence-in-depth'd: hidden behind ENABLE_DEBUG, constant-time hmac.compare_digest auth, SQL allowlist + quote-aware comment stripper + multi-statement reject + connect_timeout=10 + statement_timeout=10s + read-only transaction. This is the highest-quality code in the repo.
  • CLI schema validation (cli/test/schema-validation.test.ts) against JSON schemas in reporter/schemas/ gives cross-language guarantees.
  • CodeQL + dependabot + mirror-to-gitlab are wired up.

Suggested triage

Priority Items
Do first #1 PG-version mismatch, #2 remove TMP_MON_PREVIEWS.md, #3 de-duplicate README, #4 fix repo URLs
Next #5 consolidate deps, #7 zip(strict=), #8 except-pass logging, #9 Flask hardening, #10 ruff/mypy in CI
Backlog #6 split god-modules, #11-14 CI polish, #15-20 DX & cleanups

Happy to open PRs for any of the "Do first" items — they are small and mostly mechanical.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions