Skip to content

Commit 2042a47

Browse files
ysqyangyang.qiujdheyburn
authored
fix race condition between forget and failover (valkey-io#105)
## Summary Fix a race condition between `forgetStaleNodes` and Valkey's auto-failover that can permanently prevent a replica from being promoted after its primary dies. See valkey-io#103 for more context. ### The bug When a primary's deployment is deleted, the controller's `forgetStaleNodes` issues `CLUSTER FORGET` for the dead node from every surviving node. If this runs before Valkey's auto-failover election completes, it removes the dead primary from the other masters' node tables. Those masters can then no longer validate the replica's `FAILOVER_AUTH_REQUEST` (they don't recognize the dead node), so they never vote. The replica is permanently stuck as a slave, `findShardPrimary` never finds a primary for the shard, and the cluster enters an infinite loop of: ``` ERROR command failed: CLUSTER FORGET {"error": "Can't forget my master!"} DEBUG skipping replica; primary not ready yet DEBUG missing replicas, requeue.. ``` This is a timing-dependent race. The window is roughly 0.5–1 second between the `fail` flag being set and the failover election completing. It was reported by a user who hit it when deleting a primary deployment. ### The fix Before issuing `CLUSTER FORGET`, check whether any live node in the cluster still considers the failing node as its master (`HasReplicaOf`). If so, skip the FORGET — the replica needs the dead node in the other masters' node tables to complete the failover election. Once the failover completes and the replica is promoted, it no longer reports itself as a slave of the dead node, so the next reconcile will proceed with FORGET normally. ### Changes - **`internal/valkey/clusterstate.go`**: Add `HasReplicaOf(nodeId)` method on `ClusterState` that checks if any node's `CLUSTER NODES` self-report shows it as a replica of the given node ID. Add `MasterIdFromSelf()` helper on `NodeState` that extracts `fields[3]` (master ID) from the `myself` line. - **`internal/controller/valkeycluster_controller.go`**: Guard `forgetStaleNodes` with the `HasReplicaOf` check. When skipped, log `"skipping forget; failover pending for node"` at V(1). ### Why this is safe - **Dead replica (not a master):** No node claims a replica as its master → `HasReplicaOf` returns false → FORGET proceeds immediately. No behavior change. - **Both master and replica are dead:** The dead replica isn't in `state.Shards` (connection failed) → `HasReplicaOf` returns false → FORGET proceeds. Correct — no failover is possible anyway. - **Scale-down stale nodes:** Drained masters have no replicas left → `HasReplicaOf` returns false → FORGET proceeds. No behavior change. - **Failover permanently blocked for other reasons** (e.g., replica too far behind): `HasReplicaOf` returns true, FORGET is deferred. This is no worse than today where FORGET runs but the failover is also permanently blocked. With this fix, at least the failover has a chance if the blocking condition resolves. --------- Signed-off-by: yang.qiu <yang.qiu@reddit.com> Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com> Co-authored-by: yang.qiu <yang.qiu@reddit.com> Co-authored-by: Joseph Heyburn <jdheyburn@gmail.com>
1 parent 54ad6dd commit 2042a47

2 files changed

Lines changed: 51 additions & 8 deletions

File tree

internal/controller/valkeycluster_controller.go

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -805,14 +805,25 @@ func (r *ValkeyClusterReconciler) forgetStaleNodes(ctx context.Context, cluster
805805
idx := slices.IndexFunc(nodes.Items, func(n valkeyiov1alpha1.ValkeyNode) bool {
806806
return n.Status.PodIP == failing.Address
807807
})
808-
if idx == -1 {
809-
log.V(1).Info("forget a failing node", "address", failing.Address, "Id", failing.Id)
810-
if err := node.Client.Do(ctx, node.Client.B().ClusterForget().NodeId(failing.Id).Build()).Error(); err != nil {
811-
log.Error(err, "command failed: CLUSTER FORGET")
812-
r.Recorder.Eventf(cluster, nil, corev1.EventTypeWarning, "NodeForgetFailed", "ForgetNode", "Failed to forget node: %v", err)
813-
} else {
814-
r.Recorder.Eventf(cluster, nil, corev1.EventTypeNormal, "StaleNodeForgotten", "ForgetNode", "Forgot stale node %v", failing.Address)
815-
}
808+
if idx != -1 {
809+
continue
810+
}
811+
// A live replica still considers this failing node its
812+
// primary. Forgetting it from the other primaries now would
813+
// remove it from their node tables and prevent them from
814+
// voting in the auto-failover election, permanently
815+
// blocking the replica's promotion.
816+
if state.HasReplicaOf(failing.Id) {
817+
log.V(1).Info("skipping forget; failover pending for node",
818+
"address", failing.Address, "Id", failing.Id)
819+
continue
820+
}
821+
log.V(1).Info("forget a failing node", "address", failing.Address, "Id", failing.Id)
822+
if err := node.Client.Do(ctx, node.Client.B().ClusterForget().NodeId(failing.Id).Build()).Error(); err != nil {
823+
log.Error(err, "command failed: CLUSTER FORGET")
824+
r.Recorder.Eventf(cluster, nil, corev1.EventTypeWarning, "NodeForgetFailed", "ForgetNode", "Failed to forget node: %v", err)
825+
} else {
826+
r.Recorder.Eventf(cluster, nil, corev1.EventTypeNormal, "StaleNodeForgotten", "ForgetNode", "Forgot stale node %v", failing.Address)
816827
}
817828
}
818829
}

internal/valkey/clusterstate.go

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,38 @@ func (n *NodeState) IsReplicationInSync() bool {
193193
return n.Info["master_link_status"] == "up"
194194
}
195195

196+
// HasReplicaOf returns true if any live node in the cluster state reports
197+
// itself as a replica of the given node ID. This is used to prevent
198+
// CLUSTER FORGET from racing with auto-failover: forgetting a failed
199+
// primary from other primaries removes it from their node tables, which
200+
// prevents them from voting in the replica's failover election.
201+
func (s *ClusterState) HasReplicaOf(nodeId string) bool {
202+
for _, shard := range s.Shards {
203+
for _, node := range shard.Nodes {
204+
if node.PrimaryIdFromSelf() == nodeId {
205+
return true
206+
}
207+
}
208+
}
209+
return false
210+
}
211+
212+
// PrimaryIdFromSelf returns the primary node ID that this node reports as its
213+
// own primary in CLUSTER NODES (fields[3] of the "myself" line). Returns "-"
214+
// for primaries and the primary's node ID for replicas.
215+
func (n *NodeState) PrimaryIdFromSelf() string {
216+
for line := range strings.SplitSeq(n.ClusterNodes, "\n") {
217+
fields := strings.Fields(line)
218+
if len(fields) < 8 {
219+
continue
220+
}
221+
if strings.Contains(fields[2], "myself") {
222+
return fields[3]
223+
}
224+
}
225+
return ""
226+
}
227+
196228
// GetFailingNodes returns all known nodes that are failing.
197229
func (n *NodeState) GetFailingNodes() []NodeState {
198230
nodes := []NodeState{}

0 commit comments

Comments
 (0)