Skip to content

Multicluster resiliancy and documentation#1440

Open
andrewstucki wants to merge 1 commit intoas/kubeconfig-raft-cachefrom
as/operator-failure-modes-and-docs
Open

Multicluster resiliancy and documentation#1440
andrewstucki wants to merge 1 commit intoas/kubeconfig-raft-cachefrom
as/operator-failure-modes-and-docs

Conversation

@andrewstucki
Copy link
Copy Markdown
Contributor

Note: this is another stacked PR

Summary

Bugfix

  • Concurrent cluster engagement — each peer cluster's engagement (connecting, starting informers, registering watch handlers) now runs in its own goroutine. Previously a single slow or unreachable cluster could block all others and prevent the reconciler from starting. Engagements that fail are retried after ten seconds.

Changes

  • checkSpecConsistency no longer hard-errors on unreachable clusters — previously, if any peer cluster was unreachable during the spec consistency check, reconciliation was blocked entirely. Now unreachable clusters are skipped with a log warning; reconciliation continues against the clusters that are available. A new ClusterUnreachable condition reason (added to statuses.yaml and regenerated) distinguishes a partial check from confirmed drift or a genuine error. Confirmed drift on a reachable cluster still blocks reconciliation as before. This change mainly facilitates failure modes where a region (Kubernetes cluster) becomes unavailable. Rather than blocking reconciliation we still attempt to do as much as we can in case there is any remediation that is being attempted on the cluster.

Additions

  • Architecture documentation (docs/multicluster-operator.md) — new doc covering leader election, cluster discovery, kubeconfig caching, controller startup sequence, StretchCluster reconciliation phases, all failure modes (undeployed peer, leader crash, worker node failure with PVC unbinder / Redpanda 26.1 ghost broker ejection, API server unreachable vs full infrastructure loss, spec drift), status condition reference, and mTLS bootstrapping via the rpk plugin including certificate rotation.

@andrewstucki andrewstucki force-pushed the as/operator-failure-modes-and-docs branch from 1af6fbd to 936cdb6 Compare April 10, 2026 18:02
Copy link
Copy Markdown
Contributor

@david-yu david-yu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Missing failure mode: Quorum loss

The doc covers leader crash (new leader elected) and unreachable API server (reconciliation continues on reachable clusters), but doesn't address what happens when quorum is lost — e.g., 2-of-3 operator clusters are down simultaneously.

Users need to know:

  • Redpanda brokers keep running. Operator leader election only affects the operator's ability to reconcile. Existing broker pods, StatefulSets, Services, and Redpanda partition replication all continue functioning because they're managed by kubelet and the Kubernetes control plane, not the operator.
  • No new leader can be elected until a majority of operator nodes are reachable again.
  • No data loss occurs — this is purely a management-plane outage.

Suggest adding a "Quorum loss" section under Failure Modes that makes this explicit. This is probably the number 1 question users will have.

2. "What keeps working without the operator" — missing section

Related to the above, a short section listing what continues to function when the operator has no leader would be very helpful:

  • Keeps working: broker pods, partition replication, client produce/consume, topic operations via admin API
  • Stops working: scaling, configuration changes, decommissioning, status updates, new cluster engagement

This is referenced implicitly in the API server unreachable section but deserves its own callout.

3. Leader crash mid-reconciliation: idempotency guarantee?

docs/multicluster-operator.md line 96 says the new leader "resumes normal reconciliation" but doesn't address what happens if the crash occurs between reconciliation phases. For example:

  • StatefulSet scaled down (phase 2) but broker not yet decommissioned (phase 3)
  • Bootstrap user synced to 2-of-3 clusters but not the third

Are all reconciliation steps idempotent and safe to re-execute? If yes, a single sentence confirming that would close the question: "All reconciliation steps are idempotent — the new leader re-runs the full reconciliation loop and converges to the correct state regardless of where the previous leader crashed."

4. checkSpecConsistency — unreachable cluster could mask drift

multicluster_controller.go lines 257-270: When a cluster is unreachable, it's added to the unreachable list and skipped. If the unreachable cluster has a drifted spec, this won't be detected until it comes back. The ClusterUnreachable condition signals this to the user, which is good.

However, consider: when the unreachable cluster recovers, the next reconcile will detect the drift and block reconciliation on all clusters (including the ones that were working fine). This could be surprising — a cluster coming back online causes a worse state than it being down. The doc should mention this recovery behavior explicitly so users aren't caught off guard.

5. Recovery timeline — rough estimates would help operators

The doc describes the mechanics but not the timing. Even rough order-of-magnitude would help:

  • K8s lease re-election: ~15s (default leaseDuration)
  • Raft re-election: seconds after new raft participant joins
  • Kubeconfig cache read + cluster engagement: seconds
  • Total failover time in the common case: under a minute

6. raft.go concurrent engage — context lifecycle

raft.go line 672-685: The doEngage goroutines receive cancelCtx from wrapStart. When leadership is lost, cancelCtx is cancelled. A few questions:

  • If r.Engage(ctx, name, cluster) is blocked on WaitForCacheSync for an unreachable cluster, does the context cancellation unblock it promptly? Or could the goroutine leak until the cache sync times out?
  • The 10-second retry schedules broadcaster.notify() which re-invokes doEngage with the same context. If the context is already cancelled, the retry is a no-op (good). But if leadership is regained quickly (same node), does a new cancelCtx get created? Want to confirm there's no stale goroutine accumulation across leadership transitions.

7. Status write-back to unreachable clusters

The doc says status is propagated to "every cluster" (line 83). When a cluster is unreachable:

  • Does status write-back to the unreachable cluster fail silently, or does it error and block the reconcile?
  • Could the reachable clusters end up with updated status while the unreachable one retains stale status? If so, when it recovers, the stale status could confuse users.

8. Minor: cert rotation window

docs/multicluster-operator.md line 167: Re-running bootstrap generates a new CA. This means there's a window during rotation where nodes have different CAs and can't authenticate. The doc should note whether the rotation should be done atomically across all clusters, or if there's a graceful rollover mechanism (e.g., the operator accepts both old and new CA during transition).

One raft group spans all participating clusters. Each cluster contributes exactly one participant — the pod holding the local lease, or the single pod if there is no local election. The raft group elects a single leader from among these participants. Only the raft leader starts the reconciliation controllers and begins processing `StretchCluster` resources.

The raft group requires a quorum of `⌊n/2⌋ + 1` nodes to elect a leader and continue operating. For a three-cluster deployment, two clusters must be reachable. For five clusters, three must be reachable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quorum requirement is well explained here. Consider adding what happens when quorum is lost — users will read this and immediately wonder: "if 2-of-3 clusters go down, does my Redpanda cluster stop?" The answer is no (brokers keep running, only operator reconciliation pauses), and stating that here would be very reassuring.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's valid and something I don't mind adding just as a one-liner -- namely that an operator crash/inoperability doesn't affect an otherwise quiesced cluster.

@andrewstucki
Copy link
Copy Markdown
Contributor Author

I'll add something about how the operator being up doesn't actually affect the individual broker availability, and maybe add a "general recovery timeline" guideline, the one other thing to potentially document:

Status write-back to unreachable clusters

So we'll attempt to write back statuses to all clusters, if one of them fails we just do the typical retry error propagation, but it shouldn't affect the other stretch cluster statuses. Eventually, on recovery this will all catch back up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants