You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix redis split-brain after pod-0 restart during failover
When redis-redis-0 (the bootstrap pod) is deleted during a failover,
it restarts and tries to contact sentinel to find the current master.
Three problems caused it to fall through to the bootstrap path and
start a new independent master, creating a split-brain:
1. Single-try timeout: if sentinel was momentarily unreachable (e.g.
the sentinel container on pod-0 itself was still starting), the
3-second timeout expired and pod-0 immediately bootstrapped.
2. Headless service DNS: with PublishNotReadyAddresses: true, the
headless service DNS can resolve to pod-0's own IP, so redis-cli
connects to its own uninitialized sentinel instead of a peer.
3. Stale master identity: even when contacting a peer sentinel, it
may still report the restarting pod as master (within the
down-after-milliseconds window before failover completes).
Fix by adding a wait_for_master() function in common.sh that:
- Contacts each peer pod individually by FQDN (skipping self)
- Retries up to 10 times (30s total) before allowing bootstrap
- Rejects answers where the peer still thinks we are master
Also increase InitialDelaySeconds to 40s on all redis and sentinel
probes so Kubernetes doesn't kill the pod before the retry loop
completes, and remove unused TCP probe variables that were never
referenced by the redis container.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 commit comments