This document describes the improvements made to handle nodes that temporarily miss heartbeats due to being under load, ensuring they can properly recover and rejoin the cluster either as a leader or follower.
When running under heavy load, nodes may temporarily fail to respond to heartbeats, leading to:
- Premature offline marking: Nodes marked offline after a single timeout
- Leadership churn: Rapid leadership changes causing instability
- Flapping: Nodes bouncing between online/offline states
- Panic on errors: Unwrap() calls causing crashes during transient failures
New Configuration Constants:
static MAX_TIMEOUT: i64 = 10_000; // 10 seconds before considering potentially offline
static OFFLINE_GRACE_CHECKS: u8 = 3; // Require 3 consecutive failures before marking offline
static ONLINE_GRACE_CHECKS: u8 = 2; // Require 2 consecutive successes before marking online
static MIN_STATE_CHANGE_INTERVAL: i64 = 5_000; // Minimum 5 seconds between state changesBenefits:
- Resilience: Tolerates temporary hiccups (up to 3 timeout checks × 500ms = ~1.5 seconds grace)
- Anti-flapping: Minimum 5 second interval prevents rapid state oscillation
- Smooth recovery: Requires 2 consecutive successful heartbeats before marking node back online
New HBStatus Fields:
struct HBStatus {
last_updated: i64,
online: bool,
consecutive_failures: u8, // Count of consecutive timeout checks
consecutive_successes: u8, // Count of consecutive successful checks
last_state_change: i64, // Timestamp of last state transition
}Behavior:
- Online → Offline: Tracks consecutive timeouts, only transitions after reaching threshold AND minimum interval
- Offline → Online: Tracks consecutive successful heartbeats, transitions after reaching threshold AND minimum interval
- Stable states: Resets counters when nodes are consistently responsive
All .unwrap() calls replaced with proper error handling:
Fixed Functions:
compose_client_member: Now returnsOption<ClientMember>instead of panickinggroup_leader_candidate_available: Logs errors instead of panickinggroup_leader_candidate_unavailable: Handles all failure cases gracefullynotify_for_member_*: Early returns with error logging on failures- Mutex lock failures: Gracefully handled with error logging
Result:
- No more panics during transient failures
- Clear error logs for debugging
- System continues operating even when individual operations fail
When leadership transfers (e.g., during reelection):
async fn transfer_leadership(&self) {
// Give all online members fresh timestamps
// Reset all failure/success counters
// Prevents immediate timeout after leadership change
}Benefits:
- New leader gets time to stabilize before checking heartbeats
- Prevents cascading failures during leadership transitions
- All members get a "fresh start" under new leadership
Timeline:
- Node A is leader and becomes overloaded
- Misses heartbeat at T+10s (consecutive_failures = 1)
- Misses heartbeat at T+10.5s (consecutive_failures = 2)
- Misses heartbeat at T+11s (consecutive_failures = 3)
- Now marked offline (after 3 consecutive failures)
- Leadership election: Node B becomes leader
- Node A recovers, starts sending heartbeats again
- Receives heartbeat at T+15s (consecutive_successes = 1)
- Receives heartbeat at T+15.5s (consecutive_successes = 2)
- Marked back online (after 2 consecutive successes AND 5s minimum interval)
- Node A becomes follower of Node B
Key Points:
- ~1.5 second tolerance before marking offline (3 × 500ms checks)
- Minimum 5 second offline period (anti-flapping protection)
- Node A does NOT automatically reclaim leadership (stability)
- Node A properly syncs as follower under Node B
Timeline:
- Node experiences single timeout (consecutive_failures = 1)
- Next heartbeat succeeds (consecutive_failures reset to 0)
- Node remains online - no state change
Key Points:
- Single hiccups don't trigger state changes
- Prevents unnecessary leadership elections
- Maintains cluster stability
Timeline:
- Node genuinely fails (hardware/crash)
- Consecutive failures accumulate: 1, 2, 3
- Marked offline after 3 checks
- Leadership transfers to healthy node
- Eventually removed from cluster if doesn't recover
Key Points:
- Real failures still detected quickly (~1.5 seconds)
- System continues with remaining healthy nodes
- No false positives from temporary load
During failure detection:
DEBUG: Member 12345 timeout check 1/3 (10500ms since last update, 2000ms since last state change)
DEBUG: Member 12345 timeout check 2/3 (11000ms since last update, 2500ms since last state change)
WARN: Marking member 12345 as offline after 3 consecutive timeout checks (11500ms since last update)
During recovery:
DEBUG: Member 12345 recovery check 1/2 (3000ms since last state change)
INFO: Marking member 12345 as back online after 2 consecutive successful checks
Error scenarios:
ERROR: Failed to change leader for group 789 to member 12345
ERROR: Failed to find online member for group 789 after member 12345 became unavailable
ERROR: Failed to compose client member 12345 for online notification
You can adjust these constants based on your needs:
- Increase
OFFLINE_GRACE_CHECKS: More tolerance for slow responses (longer detection time) - Decrease
OFFLINE_GRACE_CHECKS: Faster failure detection (less tolerance) - Increase
MIN_STATE_CHANGE_INTERVAL: More aggressive anti-flapping (longer recovery time) - Decrease
MIN_STATE_CHANGE_INTERVAL: Faster recovery (more risk of flapping) - Increase
MAX_TIMEOUT: More lenient heartbeat requirements - Decrease
MAX_TIMEOUT: Stricter heartbeat requirements
- Load testing: Verify nodes can recover under realistic load
- Network partition: Test with simulated network splits
- Chaos testing: Randomly kill/restart nodes to test recovery paths
- Long-running stability: Monitor for log growth and state flapping
All changes are backward compatible:
- Wire protocol unchanged
- State machine behavior unchanged (only timing/resilience improved)
- Existing clusters will benefit immediately upon upgrade
- Minimal CPU overhead: Simple counter increments
- Minimal memory overhead: 3 extra bytes per member (2 u8 counters + i64 timestamp)
- Reduced network churn: Fewer unnecessary state changes = less Raft log entries
- Improved stability: Less leadership churn = better overall performance
Potential future improvements:
- Configurable parameters: Make timeouts/thresholds runtime-configurable
- Adaptive timeouts: Adjust based on observed network latency
- Priority-based leader election: Prefer certain nodes as leaders
- Health scoring: Multi-factor health beyond just heartbeats
- Metrics export: Prometheus/OpenTelemetry integration for monitoring