fix(kotsadm): add wait-for-rqlite init container before schemahero migrations#5925
Draft
kriscoleman wants to merge 2 commits into
Draft
fix(kotsadm): add wait-for-rqlite init container before schemahero migrations#5925kriscoleman wants to merge 2 commits into
kriscoleman wants to merge 2 commits into
Conversation
When kotsadm and rqlite restart simultaneously (e.g., during EC upgrades), schemahero-plan runs before rqlite accepts connections, causing CrashLoopBackOff with "tried all peers unsuccessfully". Insert a wait-for-rqlite init container at position 0 in both the Deployment and StatefulSet that polls http://kotsadm-rqlite:4001/readyz until rqlite reports ready. This prevents the race between schemahero and rqlite startup. Affects both KotsadmDeployment() and KotsadmStatefulSet() init container lists. Closes replicated-collab/netbox-replicated#149 Ref: replicated-collab/netbox-replicated#148 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
Member
Author
|
Looks like we have some CI checks here to look into. |
Address review findings: - Add 5-minute timeout to wait loop so rqlite failures surface as a clear init error rather than an indefinite hang - Use kotsadm-migrations image (lighter, already pulled for schemahero) - Replace private repo link with Shortcut story reference - Add unit tests for waitForRqliteInitContainer() and verify init container ordering in KotsadmDeployment() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
wait-for-rqliteinit container at position 0 in the kotsadm Deployment and StatefulSet, beforeschemahero-plan. This prevents the race condition where schemahero tries to connect to rqlite before it's accepting connections, causing CrashLoopBackOff during EC upgrades.Problem
When kotsadm and rqlite restart simultaneously (e.g., during Embedded Cluster upgrades),
schemahero-planruns before rqlite is ready. It exits non-zero immediately, and Kubernetes retries with CrashLoopBackOff (exponential backoff: 10s, 20s, 40s...). This self-heals eventually but adds unnecessary delay and confusing error messages.Fix
Insert a lightweight init container that polls
http://kotsadm-rqlite:4001/readyzin a loop beforeschemahero-planstarts. Uses the existingkotsadmimage (already pulled). Applied to bothKotsadmDeployment()andKotsadmStatefulSet()inpkg/kotsadm/objects/kotsadm_objects.go.Validation
Reproduced and validated on CMX EC cluster:
Shortcut: sc-138103