Skip to content

Sentinel/Cluster failover event tracking with slowlog correlation #28

@KIvanow

Description

@KIvanow

Description

Track Sentinel and cluster failover events and correlate them with slowlog/COMMANDLOG spikes for post-incident analysis.

Problem

During failovers, latency spikes and command failures are common but the cause isn't always obvious from metrics alone. Being able to see "a failover happened at 03:12, and here's the slowlog spike that preceded/followed it" would make post-incident analysis significantly easier.

Proposed Scope

  • Detect failover events by monitoring INFO replication role changes and CLUSTER INFO state transitions
  • Persist failover events with timestamps, old/new primary, and trigger reason where available
  • Correlate failover timestamps with existing slowlog and anomaly detection data
  • Surface in the UI timeline alongside existing anomaly events
  • Add failover.started and failover.completed webhook event types

Prior Art / Context

Requested by community — correlating failovers with slowlog spikes is a common post-incident debugging need for teams running Sentinel or Cluster topologies.

Related

  • Existing cluster topology visualization
  • Existing per-slot heatmaps and migration tracking
  • Anomaly detection correlator (could add a FAILOVER pattern)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions