Skip to content

fix(operator): un-gate readiness on needs-reindex-all, REINDEX DATABASE CONCURRENTLY#69

Merged
passcod merged 1 commit into
mainfrom
operator-reindex-no-readiness-gate
Jun 10, 2026
Merged

fix(operator): un-gate readiness on needs-reindex-all, REINDEX DATABASE CONCURRENTLY#69
passcod merged 1 commit into
mainfrom
operator-reindex-no-readiness-gate

Conversation

@passcod

@passcod passcod commented Jun 10, 2026

Copy link
Copy Markdown
Member

🤖

Summary

Two changes that together unblock nauru-prod (and any replica whose snapshot triggers pg_resetwal):

  1. Drop the amcheck-driven smart pass for needs-reindex-all; replace it with a blind REINDEX DATABASE per user DB — CONCURRENTLY on PG ≥ 12 so clients can keep using the old indexes during the rebuild.
  2. Drop the readiness probe's gate on /pgdata/needs-reindex-all. Keep the gate on the locale-only /pgdata/needs-reindex (small, fast, proven).

Why

Verified live on nauru-prod after #68 deployed. Every restore attempt failed at the operator's 30-minute deployment_ready_timeout. Tracing inside a fresh pod:

  • /pgdata/needs-reindex-all flag present (pg_resetwal did fire — expected on this source).
  • pg_stat_activity shows the background reindex hook running.
  • bt_index_check('pg_type_oid_index'::regclass) (a tiny system catalog index, normally a few KB) — state active, empty wait_event, query_start growing linearly with wall clock. 4+ minutes on that one index with no visible progress.

Same shape as the CREATE TABLE spinloop documented earlier on this dataset: vanilla DDL pegs 100% CPU forever inside postgres with no visible reason. The smart pass via amcheck can't avoid the hammer — it IS a hammer that triggers the same wedge.

REINDEX DATABASE CONCURRENTLY reads the heap (not the index) to rebuild from scratch — a completely different code path that isn't subject to whatever amcheck is hitting in the corrupt index pages. PG ≥ 12 syntax; falls back to blocking REINDEX DATABASE on older.

But REINDEX DATABASE on the prod-sized DBs takes hours. Gating readiness on it makes deployment_ready_timeout (30 min default) trip every time and the restore fails. So the second change is necessary: let the pod become Ready as soon as postgres accepts connections; let the reindex run in background. CONCURRENTLY means the corrupt indexes stay queryable until the atomic swap, narrowing the window where clients can see corruption.

Caveats

  • REINDEX DATABASE CONCURRENTLY skips system catalogs (PG won't CONCURRENTLY them). For an analytics replica that's the right trade — user-data indexes are what client queries hit; system catalog corruption surfaces as different errors and is rare.

Trade-off

Clients hitting a not-yet-reindexed corrupt index get the explicit unexpected zero page at block N error from postgres — same error the downstream users originally reported. With CONCURRENTLY the window is narrow (queries hit the old index until the atomic swap); they retry, succeed once the rebuild lands.

Strictly better than the alternative (permanently-failed restore, replica stuck indefinitely on the previous Active).

Shape

  • builders.rs: the needs-reindex-all branch in the postgres container startup hook replaces the amcheck loop with REINDEX DATABASE CONCURRENTLY "$db" per user database (with a PG < 12 fallback). Comment explains the wedge and why amcheck is wrong here.
  • builders.rs: readiness probe drops the needs-reindex-all check. Comment explains.
  • Tests: updated assertions — runtime hook uses REINDEX DATABASE CONCURRENTLY, gates on PG_MAJOR ≥ 12, no bt_index_check; readiness probe gates only on the locale-only flag.

Recovery

For replicas already in the failure loop (pre-fix gate), the readiness probe will start returning success on its next check after the operator rolls the new image (the needs-reindex-all file may still be there, but the probe no longer cares). The background reindex keeps churning. For the current live nauru-prod pod, I removed the flag by hand to unblock immediately — the next restore will exercise the new code path.

…SE CONCURRENTLY

The amcheck-driven smart pass introduced in #68 hits the same family
of postgres-internal pathology that wedges other vanilla DDL on the
prod dataset — bt_index_check itself burns 100% CPU forever on
specific indexes (observed via pg_stat_activity: state=active,
empty wait_event, query_start growing linearly with wall clock for
4+ minutes on a tiny system catalog index). The smart pass is
unusable on this data.

Two changes:

1. Drop bt_index_check; revert the needs-reindex-all branch to a
   blind REINDEX DATABASE per user DB, CONCURRENTLY on PG ≥ 12 so
   clients can keep using the old indexes during the rebuild and
   the atomic swap makes corruption disappear without downtime.
   REINDEX reads the heap and rebuilds the index from scratch — a
   different code path from amcheck (which reads corrupt index
   pages directly) — so it isn't subject to the same wedge. Slow
   on prod-sized DBs (hours) but makes progress.

2. Drop the readiness probe's gate on /pgdata/needs-reindex-all.
   With the reindex taking hours, gating readiness here trips the
   operator's 30-minute deployment_ready_timeout and the restore is
   marked Failed before postgres even has a chance to come up. The
   probe still gates on /pgdata/needs-reindex (locale-only,
   finishes in seconds) since that's small and proven.

Trade-off: clients hitting a not-yet-reindexed corrupt index in the
window between pod-Ready and reindex-complete get the explicit
"unexpected zero page" error from postgres. With CONCURRENTLY the
window is narrow (queries hit the old index until the atomic swap)
and clients can retry. Strictly better than the alternative
(permanently failed restore, replica stuck indefinitely on the
previous Active).

The pre-#68 behaviour for the locale-only path is unchanged.
@passcod passcod force-pushed the operator-reindex-no-readiness-gate branch from 1f898f7 to 6987f64 Compare June 10, 2026 03:53
@passcod passcod changed the title fix(operator): un-gate readiness on needs-reindex-all, revert to blind REINDEX DATABASE fix(operator): un-gate readiness on needs-reindex-all, REINDEX DATABASE CONCURRENTLY Jun 10, 2026
@passcod passcod enabled auto-merge June 10, 2026 03:59
@passcod passcod merged commit e9e0b9f into main Jun 10, 2026
18 checks passed
@passcod passcod deleted the operator-reindex-no-readiness-gate branch June 10, 2026 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant