Skip to content

fix(operator): bound the schema-migration timeout cleanup to 60s#62

Merged
passcod merged 1 commit into
mainfrom
operator-schema-migration-timeout-cleanup-bounded
Jun 5, 2026
Merged

fix(operator): bound the schema-migration timeout cleanup to 60s#62
passcod merged 1 commit into
mainfrom
operator-schema-migration-timeout-cleanup-bounded

Conversation

@passcod

@passcod passcod commented Jun 5, 2026

Copy link
Copy Markdown
Member

🤖

Summary

Follow-up to #61. The timeout-skip path runs `DROP SCHEMA … CASCADE` to clear stale persistent schemas on the new restore before switchover, but the DROP itself can hang on the same lock the original migration was waiting on — when the target restore has a backend stuck on the schema's namespace (e.g. a CREATE TABLE spinning at 100% CPU and ignoring SIGTERM, the exact failure mode that tripped the budget in the first place).

Result observed in prod: budget fires at T+72min, operator cancels the migration Job and starts cleanup, the cleanup's DROP queues behind the wedged backend, reconcile blocks on the network query, the switchover never happens. We've traded a 100% indefinite wedge for a 0% certain wedge.

Fix

Wrap the DROP cleanup in `tokio::time::timeout(60s)`. If it doesn't complete:

  • `Ok(Err(e))`: log the error and proceed.
  • `Err(_)` (timeout fired): log the timeout and proceed.

In both cases the switchover still happens. The replica comes up with the leftover schemas still in the database — which is strictly better than not coming up at all. The next restore cycle starts with a fresh target PVC where the pre-migration prep can clear the leftovers normally.

Shape

  • Extracted `drop_persistent_schemas_on_target` from the inline cleanup so the timeout wraps a single async future.
  • 60-second cap. Generous enough that any cleanup that can complete still does; short enough that the switchover doesn't drag on noticeably when it can't.

Note on leaked server-side work

When `tokio::time::timeout` fires, the dropped future stops awaiting but the server-side query continues queued on the lock — this is a small leak. In practice the wedged backend that's holding the lock gets cleaned up by the next restore cycle's pod restart, so the leaked session goes with it. Worth knowing but not worth solving with a server-side cancel right now.

The timeout path added in v0.3.22 calls DROP SCHEMA … CASCADE on the new
restore to clear stale persistent_schemas before switchover. But that
DROP can hang indefinitely if a backend on the target has the schema's
namespace locked — which is precisely the failure mode that tripped
the budget in the first place (e.g. a CREATE TABLE spinning at 100%
CPU and ignoring SIGTERM). The cleanup queued behind that lock,
wedging the switchover the cleanup was meant to enable.

Wrap the DROP cleanup in tokio::time::timeout (60s). If it doesn't
complete in time (either errors or times out), log it and proceed with
the switchover anyway — leftover schemas in the target are strictly
better than the replica never coming up. The next restore cycle will
re-attempt migration with a fresh target PVC and (typically) clear
the leftovers as part of pre-migration prep.

Restructures the cleanup into a separate drop_persistent_schemas_on_target
helper so the timeout wraps a single async future.
@passcod passcod enabled auto-merge June 5, 2026 13:20
@passcod passcod merged commit 551c347 into main Jun 5, 2026
18 checks passed
@passcod passcod deleted the operator-schema-migration-timeout-cleanup-bounded branch June 5, 2026 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant