fix(operator): bound the schema-migration timeout cleanup to 60s by passcod · Pull Request #62 · beyondessential/postgres-restore-operator

passcod · 2026-06-05T13:17:04Z

🤖

Summary

Follow-up to #61. The timeout-skip path runs `DROP SCHEMA … CASCADE` to clear stale persistent schemas on the new restore before switchover, but the DROP itself can hang on the same lock the original migration was waiting on — when the target restore has a backend stuck on the schema's namespace (e.g. a CREATE TABLE spinning at 100% CPU and ignoring SIGTERM, the exact failure mode that tripped the budget in the first place).

Result observed in prod: budget fires at T+72min, operator cancels the migration Job and starts cleanup, the cleanup's DROP queues behind the wedged backend, reconcile blocks on the network query, the switchover never happens. We've traded a 100% indefinite wedge for a 0% certain wedge.

Fix

Wrap the DROP cleanup in `tokio::time::timeout(60s)`. If it doesn't complete:

`Ok(Err(e))`: log the error and proceed.
`Err(_)` (timeout fired): log the timeout and proceed.

In both cases the switchover still happens. The replica comes up with the leftover schemas still in the database — which is strictly better than not coming up at all. The next restore cycle starts with a fresh target PVC where the pre-migration prep can clear the leftovers normally.

Shape

Extracted `drop_persistent_schemas_on_target` from the inline cleanup so the timeout wraps a single async future.
60-second cap. Generous enough that any cleanup that can complete still does; short enough that the switchover doesn't drag on noticeably when it can't.

Note on leaked server-side work

When `tokio::time::timeout` fires, the dropped future stops awaiting but the server-side query continues queued on the lock — this is a small leak. In practice the wedged backend that's holding the lock gets cleaned up by the next restore cycle's pod restart, so the leaked session goes with it. Worth knowing but not worth solving with a server-side cancel right now.

The timeout path added in v0.3.22 calls DROP SCHEMA … CASCADE on the new restore to clear stale persistent_schemas before switchover. But that DROP can hang indefinitely if a backend on the target has the schema's namespace locked — which is precisely the failure mode that tripped the budget in the first place (e.g. a CREATE TABLE spinning at 100% CPU and ignoring SIGTERM). The cleanup queued behind that lock, wedging the switchover the cleanup was meant to enable. Wrap the DROP cleanup in tokio::time::timeout (60s). If it doesn't complete in time (either errors or times out), log it and proceed with the switchover anyway — leftover schemas in the target are strictly better than the replica never coming up. The next restore cycle will re-attempt migration with a fresh target PVC and (typically) clear the leftovers as part of pre-migration prep. Restructures the cleanup into a separate drop_persistent_schemas_on_target helper so the timeout wraps a single async future.

passcod enabled auto-merge June 5, 2026 13:20

passcod merged commit 551c347 into main Jun 5, 2026
18 checks passed

passcod deleted the operator-schema-migration-timeout-cleanup-bounded branch June 5, 2026 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(operator): bound the schema-migration timeout cleanup to 60s#62

fix(operator): bound the schema-migration timeout cleanup to 60s#62
passcod merged 1 commit into
mainfrom
operator-schema-migration-timeout-cleanup-bounded

passcod commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

passcod commented Jun 5, 2026

Summary

Fix

Shape

Note on leaked server-side work

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant