fix(operator): bound the schema-migration timeout cleanup to 60s#62
Merged
Merged
Conversation
The timeout path added in v0.3.22 calls DROP SCHEMA … CASCADE on the new restore to clear stale persistent_schemas before switchover. But that DROP can hang indefinitely if a backend on the target has the schema's namespace locked — which is precisely the failure mode that tripped the budget in the first place (e.g. a CREATE TABLE spinning at 100% CPU and ignoring SIGTERM). The cleanup queued behind that lock, wedging the switchover the cleanup was meant to enable. Wrap the DROP cleanup in tokio::time::timeout (60s). If it doesn't complete in time (either errors or times out), log it and proceed with the switchover anyway — leftover schemas in the target are strictly better than the replica never coming up. The next restore cycle will re-attempt migration with a fresh target PVC and (typically) clear the leftovers as part of pre-migration prep. Restructures the cleanup into a separate drop_persistent_schemas_on_target helper so the timeout wraps a single async future.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖
Summary
Follow-up to #61. The timeout-skip path runs `DROP SCHEMA … CASCADE` to clear stale persistent schemas on the new restore before switchover, but the DROP itself can hang on the same lock the original migration was waiting on — when the target restore has a backend stuck on the schema's namespace (e.g. a CREATE TABLE spinning at 100% CPU and ignoring SIGTERM, the exact failure mode that tripped the budget in the first place).
Result observed in prod: budget fires at T+72min, operator cancels the migration Job and starts cleanup, the cleanup's DROP queues behind the wedged backend, reconcile blocks on the network query, the switchover never happens. We've traded a 100% indefinite wedge for a 0% certain wedge.
Fix
Wrap the DROP cleanup in `tokio::time::timeout(60s)`. If it doesn't complete:
In both cases the switchover still happens. The replica comes up with the leftover schemas still in the database — which is strictly better than not coming up at all. The next restore cycle starts with a fresh target PVC where the pre-migration prep can clear the leftovers normally.
Shape
Note on leaked server-side work
When `tokio::time::timeout` fires, the dropped future stops awaiting but the server-side query continues queued on the lock — this is a small leak. In practice the wedged backend that's holding the lock gets cleaned up by the next restore cycle's pod restart, so the leaked session goes with it. Worth knowing but not worth solving with a server-side cancel right now.