fix(kv): bound verifyLeaderEngine ReadIndex with 5s deadline

bootjp · bootjp · commit ad924ad03044 · 2026-05-08T17:10:49.000+09:00
verifyLeaderEngine() called engine.VerifyLeader with context.Background(),
so any caller without an upstream context blocked indefinitely on a
ReadIndex round-trip. A single transient stall accumulated callers
permanently because they never timed out and never returned.

Production hit this on 2026-05-08: follower 192.168.0.214 lost its
network route (no route to host, ARP INCOMPLETE), the leader's
ReadIndex completion stalled intermittently, and verify-callers piled
up at roughly 9/sec without bound. After ~37 minutes the leader
(192.168.0.212) held 20,560 goroutines (20,478 in submitRead select,
oldest 39 minutes), CPU pinned at 1870% (Engine.run Ready loop walks
pendingReads O(N) per tick, so the queue feeds back on itself), and
host MemAvailable trended toward 0 until OOM. Each new leader after
failover re-entered the same death spiral.

Affected callers (all use the no-context variant):
- LeaderProxy.Commit / .Abort -- every Redis write hits this
- Coordinate.VerifyLeader / ShardedCoordinator.VerifyLeader[ForKey]
- adapter S3/SQS /healthz/leader handlers (Caddy probes)
- main_admin.go LeaderProbe (admin dashboard /admin/healthz/leader)
- adapter/sqs.go isVerifiedSQSLeader, adapter/s3.go isVerifiedS3Leader

Fix: cap the no-context path at 5s (matching leaderForwardTimeout). On
timeout, callers see context.DeadlineExceeded -- LeaderProxy falls back
to forwardWithRetry as it already does for any verify failure, healthz
handlers report not-leader, and the lock resolver skips this tick.

Self-review (5 lenses):
1. Data loss -- none. The fix only shortens a never-returning wait.
   verifyLeaderEngine is a freshness check, not a write path.
2. Concurrency -- the new ctx is local to each call (defer cancel),
   no shared state, no lock changes. Engine-side blocking semantics
   unchanged; we just stop waiting forever.
3. Performance -- positive. Removes the unbounded goroutine pile-up
   and the O(N) pendingReads walk it caused. No new allocations on
   the success path beyond the WithTimeout context.
4. Data consistency -- ReadIndex still completes when quorum heartbeats
   land within 5s. A timeout means the caller could not confirm
   leadership freshness, which the existing "fall through to forward"
   path already treats as a soft failure.
5. Test coverage -- kv/raft_engine_test.go pins the regression: a
   blockingLeaderView that holds VerifyLeader on its ctx must surface
   DeadlineExceeded within 2x verifyLeaderTimeout.

Test: go test -race -count=1 ./kv -- 9.3s, all green.

Future work (separate PRs): plumb real request contexts through
LeaderProxy.Commit/Abort and the healthz handlers so a client-side
deadline cascades naturally instead of relying on this fixed bound.
diff --git a/kv/raft_engine.go b/kv/raft_engine.go
@@ -2,12 +2,38 @@ package kv
 
 import (
 	"context"
+	"time"
 
 	"github.com/bootjp/elastickv/internal/monoclock"
 	"github.com/bootjp/elastickv/internal/raftengine"
 	"github.com/cockroachdb/errors"
 )
 
+// verifyLeaderTimeout caps how long the no-context verifyLeaderEngine path
+// is willing to wait for a ReadIndex round-trip.
+//
+// A previous version called engine.VerifyLeader with context.Background(),
+// so callers without an upstream deadline (LeaderProxy.Commit/Abort,
+// Coordinate.VerifyLeader, ShardedCoordinator.VerifyLeader[ForKey], and
+// the S3/SQS/admin /healthz/leader handlers) blocked indefinitely whenever
+// ReadIndex completion stalled — a single transient stall accumulated
+// callers permanently.
+//
+// Production hit this on 2026-05-08: a follower (192.168.0.214) lost its
+// network route mid-flight and the leader's ReadIndex completion stalled
+// intermittently. verifyLeaderEngine callers piled up at ~9/sec without
+// bound; after ~37 minutes the leader was holding 20,560 goroutines
+// (20,478 in submitRead select, oldest 39 minutes), CPU pinned at 1870%
+// (the Engine.run Ready loop walks pendingReads O(N) per tick, so the
+// queue feeds back on itself), and host MemAvailable trended toward 0
+// until OOM. The same pattern repeated on each new leader after failover.
+//
+// 5s matches leaderForwardTimeout: a verify that takes longer than a
+// single forward RPC is, by definition, useless as a freshness check,
+// and the proxy's verify-then-forward path stays within its 5s retry
+// budget.
+const verifyLeaderTimeout = 5 * time.Second
+
 func engineForGroup(g *ShardGroup) raftengine.Engine {
 	if g == nil {
 		return nil
@@ -41,7 +67,9 @@ func verifyLeaderEngineCtx(ctx context.Context, engine raftengine.LeaderView) er
 }
 
 func verifyLeaderEngine(engine raftengine.LeaderView) error {
-	return verifyLeaderEngineCtx(context.Background(), engine)
+	ctx, cancel := context.WithTimeout(context.Background(), verifyLeaderTimeout)
+	defer cancel()
+	return verifyLeaderEngineCtx(ctx, engine)
 }
 
 func linearizableReadEngineCtx(ctx context.Context, engine raftengine.LeaderView) (uint64, error) {
diff --git a/kv/raft_engine_test.go b/kv/raft_engine_test.go
@@ -0,0 +1,54 @@
+package kv
+
+import (
+	"context"
+	stderrors "errors"
+	"testing"
+	"time"
+
+	"github.com/bootjp/elastickv/internal/raftengine"
+)
+
+// blockingLeaderView is a LeaderView whose VerifyLeader blocks until ctx is
+// cancelled, modelling the production pathology where ReadIndex stalls
+// because heartbeat acks fail to land. LinearizableRead is similarly
+// well-behaved on cancel; State / Leader are stamped enough to satisfy the
+// callers under test.
+type blockingLeaderView struct{}
+
+func (blockingLeaderView) State() raftengine.State        { return raftengine.StateLeader }
+func (blockingLeaderView) Leader() raftengine.LeaderInfo  { return raftengine.LeaderInfo{ID: "self"} }
+func (blockingLeaderView) VerifyLeader(ctx context.Context) error {
+	<-ctx.Done()
+	return ctx.Err()
+}
+func (blockingLeaderView) LinearizableRead(ctx context.Context) (uint64, error) {
+	<-ctx.Done()
+	return 0, ctx.Err()
+}
+
+// TestVerifyLeaderEngine_BoundsBlockingReadIndex pins the regression: if a
+// stalled ReadIndex used to return only when the underlying ctx fired, but
+// callers passed context.Background(), the goroutine pinned forever. After
+// 2026-05-08-style stalls in production this must complete within roughly
+// verifyLeaderTimeout, surfacing context.DeadlineExceeded.
+func TestVerifyLeaderEngine_BoundsBlockingReadIndex(t *testing.T) {
+	t.Parallel()
+
+	start := time.Now()
+	err := verifyLeaderEngine(blockingLeaderView{})
+	elapsed := time.Since(start)
+
+	if err == nil {
+		t.Fatalf("verifyLeaderEngine(blocking) returned nil; expected DeadlineExceeded")
+	}
+	if !stderrors.Is(err, context.DeadlineExceeded) {
+		t.Fatalf("verifyLeaderEngine(blocking) err = %v; want DeadlineExceeded", err)
+	}
+	// Allow generous slack so a slow CI host does not flake; the point is
+	// not to assert a tight bound but to prove the call returns at all.
+	if elapsed > 2*verifyLeaderTimeout {
+		t.Fatalf("verifyLeaderEngine(blocking) returned after %s; want <= 2x verifyLeaderTimeout (%s)", elapsed, verifyLeaderTimeout)
+	}
+}
+