fix(kv): bound verifyLeaderEngine ReadIndex with 5s deadline (#745)

bootjp · web-flow · commit 5f12d5d686d0 · 2026-05-08T17:51:36.000+09:00
## Summary

`verifyLeaderEngine()` called `engine.VerifyLeader` with
`context.Background()`, so callers without an upstream context blocked
indefinitely on a ReadIndex round-trip. A single transient stall
accumulated callers permanently. This caps the no-context path at 5s.

## Production incident — 2026-05-08

Follower 192.168.0.214 lost its network route (`no route to host`, ARP
`INCOMPLETE`). The leader's ReadIndex completion stalled intermittently
and verify-callers piled up at ~9/sec without bound.

After ~37 minutes the leader (192.168.0.212) showed:
- **20,560 goroutines**, 20,478 of them in `etcd.(*Engine).submitRead`
`[select, 35-39 minutes]`
- **CPU 1870%** (`Engine.run` Ready loop walks `pendingReads` O(N) per
tick → queue feeds back on itself)
- **Host MemAvailable** trending toward 0 → OOM
- Each new leader after failover re-entered the same death spiral

Mitigation: `docker restart elastickv` on 212 dropped it to 74% CPU /
163 MiB. 214 was hardware-rebooted and is REACHABLE again. This PR
prevents the next leader from re-entering the spiral.

## Affected callers

All use the no-context `verifyLeaderEngine` variant:
- `kv/leader_proxy.go` — `LeaderProxy.Commit` / `.Abort` (every Redis
write)
- `kv/coordinator.go` — `Coordinate.VerifyLeader`
- `kv/sharded_coordinator.go` — `ShardedCoordinator.VerifyLeader` /
`VerifyLeaderForKey`
- `adapter/s3.go` — `isVerifiedS3Leader` / inline VerifyLeader at line
2291 (healthz)
- `adapter/sqs.go` — `isVerifiedSQSLeader` (healthz)
- `main_admin.go` — `LeaderProbe` callback for `/admin/healthz/leader`

## Failure mode on timeout

`context.DeadlineExceeded` surfaces to the caller. `LeaderProxy` falls
back to `forwardWithRetry` (the existing path for any verify failure).
Healthz handlers report 503 not-leader. Background loops (lock resolver,
HLC lease) skip this tick.

No new infinite loop: even when this node *is* the leader, a
verify-failure → forward path already exists in `LeaderProxy.Commit`;
that path is bounded by `leaderProxyRetryBudget = 5s` and
`maxForwardRetries = 3`.

## Self-review (5 lenses)

1. **Data loss** — none. The fix only shortens a never-returning wait.
`verifyLeaderEngine` is a freshness check, not a write path.
Already-committed proposals are unaffected.
2. **Concurrency** — the new ctx is local to each call (`defer cancel`),
no shared state, no lock changes. Engine-side blocking semantics
unchanged; we just stop waiting forever.
3. **Performance** — net positive. Removes the unbounded goroutine
pile-up and the O(N) `pendingReads` walk it caused. No new allocations
on the success path beyond the `WithTimeout` context.
4. **Data consistency** — ReadIndex still completes when quorum
heartbeats land within 5s. A timeout means the caller could not confirm
leadership freshness, which the existing "fall through to forward" path
already treats as a soft failure.
5. **Test coverage** —
`kv/raft_engine_test.go::TestVerifyLeaderEngine_BoundsBlockingReadIndex`
pins the regression: a `blockingLeaderView` that holds `VerifyLeader` on
its ctx must surface `DeadlineExceeded` within `2 *
verifyLeaderTimeout`.

## Test plan

- [x] `go test -race -count=1 ./kv` — 9.3s, all green
- [x] New regression test
`TestVerifyLeaderEngine_BoundsBlockingReadIndex` covers the blocking
case
- [ ] Roll out to 192.168.0.x cluster after merge, watch CPU/Mem panel
for the next 4-6h to confirm no more OOM cascade

## Future work (separate PRs)

Plumb real request contexts through `LeaderProxy.Commit/Abort` and the
healthz handlers so client-side deadlines cascade naturally instead of
relying on this fixed bound. Today the Redis adapter's per-command
deadline doesn't reach `LeaderProxy`; the proxy interface takes
`[]*pb.Request` only.

&lt;!-- This is an auto-generated comment: release notes by coderabbit.ai
--&gt;

## Summary by CodeRabbit

* **Bug Fixes**
* Implemented timeout bounds for leader verification operations with a
5-second limit to prevent indefinite blocking.

* **Tests**
* Added test to verify leader verification properly handles timeout
scenarios and completes within the expected timeframe under stalled
conditions.

&lt;!-- end of auto-generated comment: release notes by coderabbit.ai --&gt;
diff --git a/kv/raft_engine.go b/kv/raft_engine.go
@@ -2,12 +2,30 @@ package kv
 
 import (
 	"context"
+	"time"
 
 	"github.com/bootjp/elastickv/internal/monoclock"
 	"github.com/bootjp/elastickv/internal/raftengine"
 	"github.com/cockroachdb/errors"
 )
 
+// verifyLeaderTimeout caps how long the no-context verifyLeaderEngine path
+// is willing to wait for a ReadIndex round-trip. Without this bound,
+// callers that hold context.Background() (LeaderProxy.Commit/Abort,
+// Coordinate.VerifyLeader, ShardedCoordinator.VerifyLeader[ForKey], and
+// the S3/SQS/admin /healthz/leader handlers) blocked indefinitely whenever
+// ReadIndex completion stalled, and a single transient stall accumulated
+// callers permanently — Engine.run's Ready loop walks pendingReads O(N)
+// per tick, so the queue feeds back on itself once it grows.
+//
+// 5s matches leaderForwardTimeout: a verify that takes longer than a
+// single forward RPC is useless as a freshness check, and the proxy's
+// verify-then-forward path stays within its 5s retry budget.
+//
+// See PR #745 / incident 2026-05-08 for the goroutine-pile production
+// failure this prevents.
+const verifyLeaderTimeout = 5 * time.Second
+
 func engineForGroup(g *ShardGroup) raftengine.Engine {
 	if g == nil {
 		return nil
@@ -41,7 +59,9 @@ func verifyLeaderEngineCtx(ctx context.Context, engine raftengine.LeaderView) er
 }
 
 func verifyLeaderEngine(engine raftengine.LeaderView) error {
-	return verifyLeaderEngineCtx(context.Background(), engine)
+	ctx, cancel := context.WithTimeout(context.Background(), verifyLeaderTimeout)
+	defer cancel()
+	return verifyLeaderEngineCtx(ctx, engine)
 }
 
 func linearizableReadEngineCtx(ctx context.Context, engine raftengine.LeaderView) (uint64, error) {
diff --git a/kv/raft_engine_test.go b/kv/raft_engine_test.go
@@ -0,0 +1,73 @@
+package kv
+
+import (
+	"context"
+	"testing"
+	"time"
+
+	"github.com/bootjp/elastickv/internal/raftengine"
+	"github.com/cockroachdb/errors"
+)
+
+// blockingLeaderView is a LeaderView whose VerifyLeader blocks until ctx is
+// cancelled, modelling the production pathology where ReadIndex stalls
+// because heartbeat acks fail to land. LinearizableRead is similarly
+// well-behaved on cancel; State / Leader are stamped enough to satisfy the
+// callers under test.
+type blockingLeaderView struct{}
+
+func (blockingLeaderView) State() raftengine.State       { return raftengine.StateLeader }
+func (blockingLeaderView) Leader() raftengine.LeaderInfo { return raftengine.LeaderInfo{ID: "self"} }
+func (blockingLeaderView) VerifyLeader(ctx context.Context) error {
+	<-ctx.Done()
+	return ctx.Err()
+}
+func (blockingLeaderView) LinearizableRead(ctx context.Context) (uint64, error) {
+	<-ctx.Done()
+	return 0, ctx.Err()
+}
+
+// TestVerifyLeaderEngine_BoundsBlockingReadIndex pins the regression: if a
+// stalled ReadIndex used to return only when the underlying ctx fired, but
+// callers passed context.Background(), the goroutine pinned forever. After
+// the 2026-05-08 incident this must complete within roughly
+// verifyLeaderTimeout, surfacing context.DeadlineExceeded.
+//
+// Skipped under -short because the whole point is to wait for the deadline
+// to fire; the no-skip path adds verifyLeaderTimeout (5s) to every default
+// `make test` run.
+func TestVerifyLeaderEngine_BoundsBlockingReadIndex(t *testing.T) {
+	t.Parallel()
+	if testing.Short() {
+		t.Skip("skipping: blocks for verifyLeaderTimeout (5s)")
+	}
+
+	start := time.Now()
+	err := verifyLeaderEngine(blockingLeaderView{})
+	elapsed := time.Since(start)
+
+	if err == nil {
+		t.Fatalf("verifyLeaderEngine(blocking) returned nil; expected DeadlineExceeded")
+	}
+	if !errors.Is(err, context.DeadlineExceeded) {
+		t.Fatalf("verifyLeaderEngine(blocking) err = %v; want DeadlineExceeded", err)
+	}
+	// Lower bound: confirm the engine actually held the call until the
+	// deadline fired, not that some other error path returned
+	// immediately. Without this, a future regression that returned
+	// DeadlineExceeded before doing any work (e.g. a misplaced ctx
+	// check before the engine call) would silently pass.
+	//
+	// Tolerate a 200ms early-return slack so a slow CI scheduler that
+	// trips ctx.Done() a hair before the wall clock catches up does
+	// not flake.
+	const slack = 200 * time.Millisecond
+	if elapsed+slack < verifyLeaderTimeout {
+		t.Fatalf("verifyLeaderEngine(blocking) returned too early after %s; want >= %s (-%s slack)", elapsed, verifyLeaderTimeout, slack)
+	}
+	// Upper bound: prove the call returned at all. Generous so a slow
+	// CI host does not flake.
+	if elapsed > 2*verifyLeaderTimeout {
+		t.Fatalf("verifyLeaderEngine(blocking) returned after %s; want <= 2x verifyLeaderTimeout (%s)", elapsed, verifyLeaderTimeout)
+	}
+}