You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix redis split-brain after pod-0 restart during failover
When redis-redis-0 (the bootstrap pod) is deleted during a failover,
it restarts and tries to contact sentinel to find the current master.
Three problems caused it to fall through to the bootstrap path and
start a new independent master, creating a split-brain:
1. Single-try timeout: if sentinel was momentarily unreachable (e.g.
the sentinel container on pod-0 itself was still starting), the
3-second timeout expired and pod-0 immediately bootstrapped.
2. Headless service DNS: with PublishNotReadyAddresses: true, the
headless service DNS can resolve to pod-0's own IP, so redis-cli
connects to its own uninitialized sentinel instead of a peer.
3. Stale master identity: even when contacting a peer sentinel, it
may still report the restarting pod as master (within the
down-after-milliseconds window before failover completes).
Fix by adding a wait_for_master() function in common.sh that:
- Contacts each peer pod individually by FQDN (skipping self)
- Uses the REPLICAS env var to iterate only over actual peers
- Retries up to 10 times (30s total) before allowing bootstrap
- Rejects answers where the peer still thinks we are master
- Returns immediately if no peers are reachable at all (e.g. first
deployment), avoiding unnecessary delay on initial bootstrap
Also pass a REPLICAS env var from the StatefulSet spec so the
script knows the exact replica count, increase InitialDelaySeconds
to 40s on all redis and sentinel probes so Kubernetes doesn't kill
the pod before the retry loop completes, and remove unused TCP
probe variables that were never referenced by the redis container.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 commit comments