Skip to content

fix: validate pg_dumpall globals dump exit code#479

Open
alwynpan wants to merge 1 commit into
nfrastack:mainfrom
alwynpan:fix/pgsql-globals-exit-code-check
Open

fix: validate pg_dumpall globals dump exit code#479
alwynpan wants to merge 1 commit into
nfrastack:mainfrom
alwynpan:fix/pgsql-globals-exit-code-check

Conversation

@alwynpan
Copy link
Copy Markdown
Contributor

@alwynpan alwynpan commented May 30, 2026

Problem

backup_pgsql_globals() calls check_exit_code without the mandatory first positional argument:

check_exit_code "${backup_job_filename}"

check_exit_code switches on $1 and only acts on a backup or move literal (there is no default case):

check_exit_code() {
    case "${1}" in
        backup ) ... ;;   # inspects ${exit_code}, notifies, sets master_exit_code=1
        move )   ... ;;   # inspects ${move_exit_code}
    esac
}

With $1 set to the filename, no branch matches and the body is skipped entirely — the pg_dumpall -g exit code captured on the previous line is never inspected.

This is the only one of the 13 backup-phase check_exit_code call sites missing the backup literal; every other site (and the move check in this same function at the next line) passes it.

Impact

A failed PostgreSQL globals dump (roles/grants/tablespaces) is completely silent:

  • no error logged, no failure notification,
  • master_exit_code stays 0 so the job reports success,
  • and cleanup_old_data still prunes older good global backups (it only skips pruning when master_exit_code == 1).

Fix

Pass the backup literal so the dump's exit code is evaluated like every other backup call site:

-        check_exit_code "${backup_job_filename}"
+        check_exit_code backup "${backup_job_filename}"

Test plan

  • bash -n install/assets/functions/10-db-backup — syntax OK
  • Force a pg_dumpall -g failure (e.g. insufficient role privileges) and confirm the job now logs the error, fires a notification, sets master_exit_code=1, and skips cleanup

Validation

Tested locally against a real PostgreSQL backend: a postgres:16 container with the actual (unmodified) backup_pgsql_globals() and check_exit_code() functions — extracted from both the pre-fix commit and this fix — driving a real pg_dumpall -g. Only peripheral orchestration helpers were stubbed; check_exit_code and the failure path are the real code. A genuine dump failure was triggered via bad credentials (real password authentication failed from pg_dumpall).

Scenario master_exit_code notification Result
Pre-fix, dump fails 0 none bug reproduced — silent failure; cleanup runs and would prune good globals
Fixed, dump fails 1 fired ([FATAL] ... Backup failed completely) ✅ error logged, notified, cleanup skipped
Pre-fix, dump succeeds 0 none ✅ no false alarm
Fixed, dump succeeds 0 none ✅ no false alarm (now also logs success)

In the pre-fix failure run the backup branch of check_exit_code never executes — control jumps straight from the dump to the move check — which is exactly the defect.

backup_pgsql_globals called check_exit_code without the "backup" arg, so
a failed globals dump went undetected. Pass "backup" like other sites.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant