Skip to content

Cache kubeconfigs on all raft members and fix race condition with re-engagement#1439

Merged
andrewstucki merged 3 commits intomainfrom
as/kubeconfig-raft-cache
Apr 14, 2026
Merged

Cache kubeconfigs on all raft members and fix race condition with re-engagement#1439
andrewstucki merged 3 commits intomainfrom
as/kubeconfig-raft-cache

Conversation

@andrewstucki
Copy link
Copy Markdown
Contributor

@andrewstucki andrewstucki commented Apr 10, 2026

Note: this a stacked PR which will get rebased against main once its base branch is merged

Problem

Kubeconfigs only exist in-memory on the raft leader. When a raft leader is elected it fetches kubeconfigs from all peers via gRPC and holds them in memory. If that leader crashes, the new leader must reach every peer over gRPC before it can manage those clusters. If any peer's operator is also down (common during a rolling incident), the new leader cannot engage that cluster until the downed pod recovers -- a recovery dependency cycle.

Solution

Every raft member now runs a startupKubeconfigFetcher at startup that unconditionally fetches each peer's kubeconfig via gRPC and stores it as a Secret in its own local cluster (<kubeconfigName>-<peerName>). Failed peers are retried every 5 seconds. When a node becomes raft leader, it reads from the local Secret cache first and only falls back to a live gRPC call if the Secret is absent, writing the result back to the cache on success.

Additional fixes

Problem

Concurrent broadcaster notifications could be silently dropped. wrapStart spawned a goroutine that listened on the broadcaster channel and re-ran doEngage on each notification. If two notify() calls fired in rapid succession (e.g. two clusters added during the same bootstrap sweep), the second notification could be lost: the goroutine woke on the first close, replaced ch with the fresh open channel, ran doEngage, then re-read the channel — getting the already-replaced channel from the second notify(), which it would wait on indefinitely. One cluster would never be engaged after failover.

Solution

Extracted restartBroadcaster and the notification loop into broadcaster.go. The drain loop now snapshots the channel reference before calling fn. After fn returns, if the reference changed (a notify() fired during fn) it calls fn again, repeating until the reference stabilises. This guarantees no notification is dropped regardless of how many concurrent notify() calls overlap with an in-flight fn.

Tests

  • broadcaster_test.go — unit tests using testing/synctest to step goroutines deterministically through the concurrent-notification race.
  • TestIntegrationKubeconfigCaching — three envtest instances in bootstrap mode. Verifies every node writes a kubeconfig Secret for each peer, then kills the raft leader and verifies the new leader engages the former leader's cluster from the local Secret cache alone, with the former leader's gRPC transport offline.

@andrewstucki andrewstucki requested a review from hidalgopl as a code owner April 10, 2026 15:58
Base automatically changed from as/rpk-plugin-wiring to main April 14, 2026 18:20
@andrewstucki andrewstucki force-pushed the as/kubeconfig-raft-cache branch from d3b1a73 to a71835e Compare April 14, 2026 18:24
@andrewstucki andrewstucki merged commit d852c6f into main Apr 14, 2026
11 checks passed
@andrewstucki andrewstucki deleted the as/kubeconfig-raft-cache branch April 14, 2026 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants