Cache kubeconfigs on all raft members and fix race condition with re-engagement by andrewstucki · Pull Request #1439 · redpanda-data/redpanda-operator

andrewstucki · 2026-04-10T15:58:26Z

~~Note: this a stacked PR which will get rebased against main once its base branch is merged~~

Problem

Kubeconfigs only exist in-memory on the raft leader. When a raft leader is elected it fetches kubeconfigs from all peers via gRPC and holds them in memory. If that leader crashes, the new leader must reach every peer over gRPC before it can manage those clusters. If any peer's operator is also down (common during a rolling incident), the new leader cannot engage that cluster until the downed pod recovers -- a recovery dependency cycle.

Solution

Every raft member now runs a startupKubeconfigFetcher at startup that unconditionally fetches each peer's kubeconfig via gRPC and stores it as a Secret in its own local cluster (<kubeconfigName>-<peerName>). Failed peers are retried every 5 seconds. When a node becomes raft leader, it reads from the local Secret cache first and only falls back to a live gRPC call if the Secret is absent, writing the result back to the cache on success.

Additional fixes

Problem

Concurrent broadcaster notifications could be silently dropped. wrapStart spawned a goroutine that listened on the broadcaster channel and re-ran doEngage on each notification. If two notify() calls fired in rapid succession (e.g. two clusters added during the same bootstrap sweep), the second notification could be lost: the goroutine woke on the first close, replaced ch with the fresh open channel, ran doEngage, then re-read the channel — getting the already-replaced channel from the second notify(), which it would wait on indefinitely. One cluster would never be engaged after failover.

Solution

Extracted restartBroadcaster and the notification loop into broadcaster.go. The drain loop now snapshots the channel reference before calling fn. After fn returns, if the reference changed (a notify() fired during fn) it calls fn again, repeating until the reference stabilises. This guarantees no notification is dropped regardless of how many concurrent notify() calls overlap with an in-flight fn.

Tests

broadcaster_test.go — unit tests using testing/synctest to step goroutines deterministically through the concurrent-notification race.
TestIntegrationKubeconfigCaching — three envtest instances in bootstrap mode. Verifies every node writes a kubeconfig Secret for each peer, then kills the raft leader and verifies the new leader engages the former leader's cluster from the local Secret cache alone, with the former leader's gRPC transport offline.

…engagement

andrewstucki requested review from RafalKorepta, chrisseto and gene-redpanda as code owners April 10, 2026 15:58

andrewstucki added the no-changelog label Apr 10, 2026

andrewstucki requested a review from hidalgopl as a code owner April 10, 2026 15:58

Base automatically changed from as/rpk-plugin-wiring to main April 14, 2026 18:20

Cache kubeconfigs on all raft members and fix race condition with re-…

a71835e

…engagement

andrewstucki force-pushed the as/kubeconfig-raft-cache branch from d3b1a73 to a71835e Compare April 14, 2026 18:24

andrewstucki added 2 commits April 14, 2026 14:35

fix pointer deref

3bc4476

fix more rebase issues

818d09c

RafalKorepta approved these changes Apr 14, 2026

View reviewed changes

andrewstucki merged commit d852c6f into main Apr 14, 2026
11 checks passed

andrewstucki deleted the as/kubeconfig-raft-cache branch April 14, 2026 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache kubeconfigs on all raft members and fix race condition with re-engagement#1439

Cache kubeconfigs on all raft members and fix race condition with re-engagement#1439
andrewstucki merged 3 commits intomainfrom
as/kubeconfig-raft-cache

andrewstucki commented Apr 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andrewstucki commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Additional fixes

Problem

Solution

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andrewstucki commented Apr 10, 2026 •

edited

Loading