Skip to content

fix(kotsadm): add wait-for-rqlite init container before schemahero migrations#5925

Draft
kriscoleman wants to merge 2 commits into
mainfrom
kriscoleman/fix-rqlite-readiness
Draft

fix(kotsadm): add wait-for-rqlite init container before schemahero migrations#5925
kriscoleman wants to merge 2 commits into
mainfrom
kriscoleman/fix-rqlite-readiness

Conversation

@kriscoleman
Copy link
Copy Markdown
Member

@kriscoleman kriscoleman commented Jun 1, 2026

Summary

Adds a wait-for-rqlite init container at position 0 in the kotsadm Deployment and StatefulSet, before schemahero-plan. This prevents the race condition where schemahero tries to connect to rqlite before it's accepting connections, causing CrashLoopBackOff during EC upgrades.

Problem

When kotsadm and rqlite restart simultaneously (e.g., during Embedded Cluster upgrades), schemahero-plan runs before rqlite is ready. It exits non-zero immediately, and Kubernetes retries with CrashLoopBackOff (exponential backoff: 10s, 20s, 40s...). This self-heals eventually but adds unnecessary delay and confusing error messages.

Fix

Insert a lightweight init container that polls http://kotsadm-rqlite:4001/readyz in a loop before schemahero-plan starts. Uses the existing kotsadm image (already pulled). Applied to both KotsadmDeployment() and KotsadmStatefulSet() in pkg/kotsadm/objects/kotsadm_objects.go.

Validation

Reproduced and validated on CMX EC cluster:

  • Before fix: schemahero-plan CrashLoopBackOff'd with 2-3 restarts
  • After fix: schemahero-plan ran immediately with 0 restarts

Shortcut: sc-138103

When kotsadm and rqlite restart simultaneously (e.g., during EC
upgrades), schemahero-plan runs before rqlite accepts connections,
causing CrashLoopBackOff with "tried all peers unsuccessfully".

Insert a wait-for-rqlite init container at position 0 in both the
Deployment and StatefulSet that polls http://kotsadm-rqlite:4001/readyz
until rqlite reports ready. This prevents the race between schemahero
and rqlite startup.

Affects both KotsadmDeployment() and KotsadmStatefulSet() init
container lists.

Closes replicated-collab/netbox-replicated#149
Ref: replicated-collab/netbox-replicated#148

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@kriscoleman
Copy link
Copy Markdown
Member Author

Looks like we have some CI checks here to look into.
I would also like to try a RC of this with my reproduction from the support issue this stemmed from. I'd love to see it solve my reproduction as an User Acceptance Test.

Address review findings:
- Add 5-minute timeout to wait loop so rqlite failures surface as a
  clear init error rather than an indefinite hang
- Use kotsadm-migrations image (lighter, already pulled for schemahero)
- Replace private repo link with Shortcut story reference
- Add unit tests for waitForRqliteInitContainer() and verify init
  container ordering in KotsadmDeployment()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants