Skip to content

Commit 1f898f7

Browse files
committed
fix(operator): un-gate readiness on needs-reindex-all, revert to blind REINDEX DATABASE
The amcheck-driven smart pass introduced in #68 hits the same family of postgres-internal pathology that wedges other vanilla DDL on the prod dataset — bt_index_check itself burns 100% CPU forever on specific indexes (observed via pg_stat_activity: state=active, empty wait_event, query_start growing linearly with wall clock for 4+ minutes on a tiny system catalog index). The smart pass is unusable on this data. Two changes: 1. Drop bt_index_check; revert the needs-reindex-all branch to blind REINDEX DATABASE per user DB. REINDEX reads the heap and rebuilds the index from scratch — different code path from amcheck, which reads corrupt index pages directly — so it isn't subject to the same wedge. Slow (hours on prod-sized DBs) but makes progress. 2. Drop the readiness probe's gate on /pgdata/needs-reindex-all. With the reindex taking hours, gating readiness here trips the operator's 30-minute deployment_ready_timeout and the restore is marked Failed before postgres even has a chance to come up. The probe still gates on /pgdata/needs-reindex (locale-only, finishes in seconds) since that's small and proven. Trade-off: clients hitting a not-yet-reindexed corrupt index between pod-Ready and reindex-complete get the explicit "unexpected zero page" error, retry, succeed once the rebuild lands. Strictly better than the alternative (permanently failed restore, replica stuck indefinitely on the previous Active). The pre-#68 behaviour for the locale-only path is unchanged.
1 parent 98e1000 commit 1f898f7

2 files changed

Lines changed: 54 additions & 100 deletions

File tree

src/controllers/restore/builders.rs

Lines changed: 31 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -1251,77 +1251,32 @@ if [ -f /pgdata/needs-reindex ] || [ -f /pgdata/needs-reindex-all ]; then
12511251
while ! pg_isready -q -U postgres -d postgres; do sleep 2; done
12521252
psql -U postgres -d postgres -c "UPDATE _pgro.restore_info SET stage = 'reindexing', last_transition_time = now() WHERE id = 1;"
12531253
# needs-reindex-all (pg_resetwal aftermath) can leave torn pages in
1254-
# ANY index, not just collation-dependent ones. A blind REINDEX
1255-
# DATABASE takes hours on the prod DBs; instead do a two-step
1256-
# smart pass:
1254+
# ANY index, not just collation-dependent ones. We tried a "smart
1255+
# pass" using the amcheck contrib extension (scan each btree, queue
1256+
# only the corrupt ones for REINDEX) — empirically that hits the
1257+
# same postgres-internal pathology that wedges other vanilla DDL on
1258+
# this dataset: bt_index_check itself burns 100% CPU forever on
1259+
# specific indexes with no visible progress, blocking the whole
1260+
# reindex behind it.
12571261
#
1258-
# 1. Use the amcheck contrib extension to read each valid btree
1259-
# index and verify its structural invariants. Indexes that
1260-
# fail the check (including "unexpected zero page" corruption)
1261-
# get queued for REINDEX. Indexes that pass are left alone.
1262-
# For a healthy snapshot this finds nothing and reads
1263-
# ~index-size of disk instead of ~table-size to rewrite.
1262+
# Fall back to blind REINDEX DATABASE. REINDEX reads the heap and
1263+
# rebuilds the index from scratch (different code path from amcheck,
1264+
# which reads the corrupt index pages directly) and so isn't subject
1265+
# to the same wedge. Slow on prod-sized DBs but it makes progress;
1266+
# the alternative was a permanently-stuck restore.
12641267
#
1265-
# 2. Blindly REINDEX every non-btree index in the DB (GIN, GiST,
1266-
# BRIN, hash). amcheck only covers btree; non-btree indexes
1267-
# are typically a small fraction of total index size so the
1268-
# cost is bounded, and skipping them risks the same
1269-
# post-resetwal corruption sneaking through.
1270-
#
1271-
# CONCURRENTLY when available (PG ≥ 14) so the work overlaps with
1272-
# whatever clients hit the pod after the readiness gate lifts.
1268+
# Crucially this branch does NOT remove needs-reindex-all at the
1269+
# top of the work — the readiness probe ignores -all for exactly
1270+
# this reason (see the probe spec below). The pod becomes Ready as
1271+
# soon as postgres accepts connections; clients hitting a not-yet-
1272+
# reindexed corrupt index get the explicit "unexpected zero page"
1273+
# error, retry, succeed once the rebuild lands. After REINDEX
1274+
# DATABASE completes for every user db, the flag is cleared and
1275+
# _pgro.restore_info.stage flips to ready.
12731276
if [ -f /pgdata/needs-reindex-all ]; then
12741277
for db in $(psql -U postgres -d postgres -At -c "SELECT datname FROM pg_database WHERE datallowconn AND datname <> 'template0'"); do
1275-
echo "Reindex after pg_resetwal: $db (smart pass via amcheck)"
1276-
psql -U postgres -d "$db" -c "CREATE EXTENSION IF NOT EXISTS amcheck;" 2>&1 || true
1277-
1278-
# Step 1: scan btree indexes; collect those that fail amcheck.
1279-
BTREE_INDEXES=$(psql -U postgres -d "$db" -At -c "
1280-
SELECT c.oid::regclass::text
1281-
FROM pg_class c
1282-
JOIN pg_am a ON a.oid = c.relam
1283-
JOIN pg_index i ON i.indexrelid = c.oid
1284-
WHERE c.relkind = 'i' AND a.amname = 'btree' AND i.indisvalid;
1285-
")
1286-
BTREE_COUNT=$(echo "$BTREE_INDEXES" | grep -c . || true)
1287-
echo " amcheck scanning $BTREE_COUNT btree indexes in $db"
1288-
CORRUPT_BTREE=""
1289-
N=0
1290-
for idx in $BTREE_INDEXES; do
1291-
[ -z "$idx" ] && continue
1292-
N=$((N + 1))
1293-
# bt_index_check raises an error if the index is corrupt.
1294-
# Suppress its output and check exit code; an error → queue it.
1295-
if ! psql -U postgres -d "$db" -At -c "SELECT bt_index_check('$idx'::regclass);" > /dev/null 2>&1; then
1296-
echo " [$N/$BTREE_COUNT] CORRUPT: $db: $idx"
1297-
CORRUPT_BTREE="$CORRUPT_BTREE $idx"
1298-
fi
1299-
done
1300-
1301-
# Step 2: list non-btree indexes for blind reindex.
1302-
NONBTREE_INDEXES=$(psql -U postgres -d "$db" -At -c "
1303-
SELECT c.oid::regclass::text
1304-
FROM pg_class c
1305-
JOIN pg_am a ON a.oid = c.relam
1306-
JOIN pg_index i ON i.indexrelid = c.oid
1307-
WHERE c.relkind = 'i' AND a.amname <> 'btree' AND i.indisvalid;
1308-
")
1309-
NONBTREE_COUNT=$(echo "$NONBTREE_INDEXES" | grep -c . || true)
1310-
1311-
TO_REINDEX="$CORRUPT_BTREE $NONBTREE_INDEXES"
1312-
TOTAL=$(echo "$TO_REINDEX" | tr ' ' '\n' | grep -c . || true)
1313-
echo " $db: $TOTAL indexes to REINDEX (corrupt btree + all non-btree=$NONBTREE_COUNT)"
1314-
N=0
1315-
for idx in $TO_REINDEX; do
1316-
[ -z "$idx" ] && continue
1317-
N=$((N + 1))
1318-
echo " [$N/$TOTAL] $db: REINDEX $idx"
1319-
if [ "$PG_MAJOR" -ge 14 ]; then
1320-
psql -U postgres -d "$db" -c "REINDEX INDEX CONCURRENTLY $idx;" 2>&1 || true
1321-
else
1322-
psql -U postgres -d "$db" -c "REINDEX INDEX $idx;" 2>&1 || true
1323-
fi
1324-
done
1278+
echo "Reindex after pg_resetwal: $db (REINDEX DATABASE)"
1279+
psql -U postgres -d "$db" -c "REINDEX DATABASE \"$db\";" 2>&1 || true
13251280
done
13261281
rm -f /pgdata/needs-reindex-all
13271282
# needs-reindex (collation-dependent only) is a strict subset of
@@ -1397,7 +1352,15 @@ exec postgres -D /pgdata/pgdata ${PGRO_LOG_LEVEL:+-c log_min_messages=$PGRO_LOG_
13971352
command: Some(vec![
13981353
"/bin/sh".to_string(),
13991354
"-c".to_string(),
1400-
"pg_isready -U postgres -d postgres && [ ! -f /pgdata/needs-reindex ] && [ ! -f /pgdata/needs-reindex-all ]".to_string(),
1355+
// Gate readiness on the locale-only needs-reindex flag
1356+
// (small, fast, finishes in seconds-to-minutes) but NOT on
1357+
// needs-reindex-all (post-pg_resetwal blind REINDEX DATABASE
1358+
// — takes hours on prod-sized indexes; gating here would
1359+
// trip the operator's deployment_ready_timeout). The -all
1360+
// reindex runs in the background; clients hitting a
1361+
// not-yet-reindexed corrupt index see the explicit
1362+
// "unexpected zero page" error and retry.
1363+
"pg_isready -U postgres -d postgres && [ ! -f /pgdata/needs-reindex ]".to_string(),
14011364
]),
14021365
}),
14031366
initial_delay_seconds: Some(5),

src/controllers/restore/tests.rs

Lines changed: 23 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -740,13 +740,12 @@ fn deployment_init_script_flags_full_reindex_after_resetwal() {
740740
#[test]
741741
fn deployment_runtime_reindex_handles_full_database_flag() {
742742
// The main container's startup hook handles the broad flag
743-
// (needs-reindex-all) via amcheck: scan all btree indexes, REINDEX
744-
// only those that fail the structural check (caught the
745-
// "unexpected zero page" failure mode reported in prod) plus all
746-
// non-btree indexes (amcheck only covers btree, so non-btree get
747-
// a blind REINDEX). A blind REINDEX DATABASE would take hours on
748-
// the prod size; the smart pass reads index-size and rewrites
749-
// only the corrupt ones.
743+
// (needs-reindex-all) via blind REINDEX DATABASE on every user DB.
744+
// The earlier amcheck-driven smart pass turned out to hit the same
745+
// postgres-internal pathology that wedges other vanilla DDL on
746+
// the prod data — bt_index_check burned 100% CPU forever on
747+
// individual indexes. REINDEX uses a different code path (reads
748+
// the heap, rebuilds from scratch) and doesn't trip that wedge.
750749
let (mut restore, replica) = test_restore_and_replica();
751750
restore.status = Some(PostgresPhysicalRestoreStatus {
752751
postgres_version: Some("16".to_string()),
@@ -770,24 +769,12 @@ fn deployment_runtime_reindex_handles_full_database_flag() {
770769
"runtime startup hook must check the needs-reindex-all flag"
771770
);
772771
assert!(
773-
script.contains("CREATE EXTENSION IF NOT EXISTS amcheck"),
774-
"needs-reindex-all branch must enable amcheck for the structural scan"
772+
script.contains("REINDEX DATABASE"),
773+
"needs-reindex-all branch must run REINDEX DATABASE for every user DB"
775774
);
776775
assert!(
777-
script.contains("bt_index_check("),
778-
"needs-reindex-all branch must call bt_index_check on each btree index"
779-
);
780-
assert!(
781-
script.contains("a.amname <> 'btree'"),
782-
"needs-reindex-all branch must collect non-btree indexes for blind REINDEX"
783-
);
784-
assert!(
785-
script.contains("REINDEX INDEX"),
786-
"needs-reindex-all branch must REINDEX (targeted, not the whole DB)"
787-
);
788-
assert!(
789-
!script.contains("REINDEX DATABASE"),
790-
"needs-reindex-all must not REINDEX DATABASE — it's the expensive hammer the smart pass avoids"
776+
!script.contains("bt_index_check("),
777+
"runtime hook must not call bt_index_check — amcheck wedges on the prod data"
791778
);
792779
assert!(
793780
script.contains("rm -f /pgdata/needs-reindex-all"),
@@ -796,11 +783,15 @@ fn deployment_runtime_reindex_handles_full_database_flag() {
796783
}
797784

798785
#[test]
799-
fn deployment_readiness_probe_waits_for_full_reindex() {
800-
// The readiness probe gates traffic on both reindex flags being
801-
// absent, otherwise queries see torn-page indexes between pod-start
802-
// and reindex-complete. The probe used to check only the narrow
803-
// needs-reindex flag.
786+
fn deployment_readiness_probe_only_gates_on_locale_reindex() {
787+
// The readiness probe waits for the locale-only `needs-reindex`
788+
// flag to clear (small, fast, finishes in seconds) but NOT for
789+
// `needs-reindex-all` (the post-pg_resetwal blind REINDEX DATABASE
790+
// — takes hours on prod-sized indexes; gating here would trip the
791+
// operator's deployment_ready_timeout and block restores
792+
// indefinitely). The -all reindex runs in the background; clients
793+
// hitting a not-yet-reindexed corrupt index see the explicit
794+
// "unexpected zero page" error and can retry.
804795
let (mut restore, replica) = test_restore_and_replica();
805796
restore.status = Some(PostgresPhysicalRestoreStatus {
806797
postgres_version: Some("16".to_string()),
@@ -826,12 +817,12 @@ fn deployment_readiness_probe_waits_for_full_reindex() {
826817
.expect("exec probe must have a command");
827818
let probe_script = probe_cmd.join(" ");
828819
assert!(
829-
probe_script.contains("[ ! -f /pgdata/needs-reindex-all ]"),
830-
"readiness probe must wait for needs-reindex-all to clear; got: {probe_script}"
820+
probe_script.contains("[ ! -f /pgdata/needs-reindex ]"),
821+
"readiness probe must still wait for the locale-only needs-reindex flag; got: {probe_script}"
831822
);
832823
assert!(
833-
probe_script.contains("[ ! -f /pgdata/needs-reindex ]"),
834-
"readiness probe must still wait for needs-reindex; got: {probe_script}"
824+
!probe_script.contains("needs-reindex-all"),
825+
"readiness probe must NOT gate on needs-reindex-all (the long-running post-resetwal reindex); got: {probe_script}"
835826
);
836827
}
837828

0 commit comments

Comments
 (0)