Skip to content

Commit ad924ad

Browse files
committed
fix(kv): bound verifyLeaderEngine ReadIndex with 5s deadline
verifyLeaderEngine() called engine.VerifyLeader with context.Background(), so any caller without an upstream context blocked indefinitely on a ReadIndex round-trip. A single transient stall accumulated callers permanently because they never timed out and never returned. Production hit this on 2026-05-08: follower 192.168.0.214 lost its network route (no route to host, ARP INCOMPLETE), the leader's ReadIndex completion stalled intermittently, and verify-callers piled up at roughly 9/sec without bound. After ~37 minutes the leader (192.168.0.212) held 20,560 goroutines (20,478 in submitRead select, oldest 39 minutes), CPU pinned at 1870% (Engine.run Ready loop walks pendingReads O(N) per tick, so the queue feeds back on itself), and host MemAvailable trended toward 0 until OOM. Each new leader after failover re-entered the same death spiral. Affected callers (all use the no-context variant): - LeaderProxy.Commit / .Abort -- every Redis write hits this - Coordinate.VerifyLeader / ShardedCoordinator.VerifyLeader[ForKey] - adapter S3/SQS /healthz/leader handlers (Caddy probes) - main_admin.go LeaderProbe (admin dashboard /admin/healthz/leader) - adapter/sqs.go isVerifiedSQSLeader, adapter/s3.go isVerifiedS3Leader Fix: cap the no-context path at 5s (matching leaderForwardTimeout). On timeout, callers see context.DeadlineExceeded -- LeaderProxy falls back to forwardWithRetry as it already does for any verify failure, healthz handlers report not-leader, and the lock resolver skips this tick. Self-review (5 lenses): 1. Data loss -- none. The fix only shortens a never-returning wait. verifyLeaderEngine is a freshness check, not a write path. 2. Concurrency -- the new ctx is local to each call (defer cancel), no shared state, no lock changes. Engine-side blocking semantics unchanged; we just stop waiting forever. 3. Performance -- positive. Removes the unbounded goroutine pile-up and the O(N) pendingReads walk it caused. No new allocations on the success path beyond the WithTimeout context. 4. Data consistency -- ReadIndex still completes when quorum heartbeats land within 5s. A timeout means the caller could not confirm leadership freshness, which the existing "fall through to forward" path already treats as a soft failure. 5. Test coverage -- kv/raft_engine_test.go pins the regression: a blockingLeaderView that holds VerifyLeader on its ctx must surface DeadlineExceeded within 2x verifyLeaderTimeout. Test: go test -race -count=1 ./kv -- 9.3s, all green. Future work (separate PRs): plumb real request contexts through LeaderProxy.Commit/Abort and the healthz handlers so a client-side deadline cascades naturally instead of relying on this fixed bound.
1 parent 303ebe2 commit ad924ad

2 files changed

Lines changed: 83 additions & 1 deletion

File tree

kv/raft_engine.go

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,38 @@ package kv
22

33
import (
44
"context"
5+
"time"
56

67
"github.com/bootjp/elastickv/internal/monoclock"
78
"github.com/bootjp/elastickv/internal/raftengine"
89
"github.com/cockroachdb/errors"
910
)
1011

12+
// verifyLeaderTimeout caps how long the no-context verifyLeaderEngine path
13+
// is willing to wait for a ReadIndex round-trip.
14+
//
15+
// A previous version called engine.VerifyLeader with context.Background(),
16+
// so callers without an upstream deadline (LeaderProxy.Commit/Abort,
17+
// Coordinate.VerifyLeader, ShardedCoordinator.VerifyLeader[ForKey], and
18+
// the S3/SQS/admin /healthz/leader handlers) blocked indefinitely whenever
19+
// ReadIndex completion stalled — a single transient stall accumulated
20+
// callers permanently.
21+
//
22+
// Production hit this on 2026-05-08: a follower (192.168.0.214) lost its
23+
// network route mid-flight and the leader's ReadIndex completion stalled
24+
// intermittently. verifyLeaderEngine callers piled up at ~9/sec without
25+
// bound; after ~37 minutes the leader was holding 20,560 goroutines
26+
// (20,478 in submitRead select, oldest 39 minutes), CPU pinned at 1870%
27+
// (the Engine.run Ready loop walks pendingReads O(N) per tick, so the
28+
// queue feeds back on itself), and host MemAvailable trended toward 0
29+
// until OOM. The same pattern repeated on each new leader after failover.
30+
//
31+
// 5s matches leaderForwardTimeout: a verify that takes longer than a
32+
// single forward RPC is, by definition, useless as a freshness check,
33+
// and the proxy's verify-then-forward path stays within its 5s retry
34+
// budget.
35+
const verifyLeaderTimeout = 5 * time.Second
36+
1137
func engineForGroup(g *ShardGroup) raftengine.Engine {
1238
if g == nil {
1339
return nil
@@ -41,7 +67,9 @@ func verifyLeaderEngineCtx(ctx context.Context, engine raftengine.LeaderView) er
4167
}
4268

4369
func verifyLeaderEngine(engine raftengine.LeaderView) error {
44-
return verifyLeaderEngineCtx(context.Background(), engine)
70+
ctx, cancel := context.WithTimeout(context.Background(), verifyLeaderTimeout)
71+
defer cancel()
72+
return verifyLeaderEngineCtx(ctx, engine)
4573
}
4674

4775
func linearizableReadEngineCtx(ctx context.Context, engine raftengine.LeaderView) (uint64, error) {

kv/raft_engine_test.go

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
package kv
2+
3+
import (
4+
"context"
5+
stderrors "errors"
6+
"testing"
7+
"time"
8+
9+
"github.com/bootjp/elastickv/internal/raftengine"
10+
)
11+
12+
// blockingLeaderView is a LeaderView whose VerifyLeader blocks until ctx is
13+
// cancelled, modelling the production pathology where ReadIndex stalls
14+
// because heartbeat acks fail to land. LinearizableRead is similarly
15+
// well-behaved on cancel; State / Leader are stamped enough to satisfy the
16+
// callers under test.
17+
type blockingLeaderView struct{}
18+
19+
func (blockingLeaderView) State() raftengine.State { return raftengine.StateLeader }
20+
func (blockingLeaderView) Leader() raftengine.LeaderInfo { return raftengine.LeaderInfo{ID: "self"} }
21+
func (blockingLeaderView) VerifyLeader(ctx context.Context) error {
22+
<-ctx.Done()
23+
return ctx.Err()
24+
}
25+
func (blockingLeaderView) LinearizableRead(ctx context.Context) (uint64, error) {
26+
<-ctx.Done()
27+
return 0, ctx.Err()
28+
}
29+
30+
// TestVerifyLeaderEngine_BoundsBlockingReadIndex pins the regression: if a
31+
// stalled ReadIndex used to return only when the underlying ctx fired, but
32+
// callers passed context.Background(), the goroutine pinned forever. After
33+
// 2026-05-08-style stalls in production this must complete within roughly
34+
// verifyLeaderTimeout, surfacing context.DeadlineExceeded.
35+
func TestVerifyLeaderEngine_BoundsBlockingReadIndex(t *testing.T) {
36+
t.Parallel()
37+
38+
start := time.Now()
39+
err := verifyLeaderEngine(blockingLeaderView{})
40+
elapsed := time.Since(start)
41+
42+
if err == nil {
43+
t.Fatalf("verifyLeaderEngine(blocking) returned nil; expected DeadlineExceeded")
44+
}
45+
if !stderrors.Is(err, context.DeadlineExceeded) {
46+
t.Fatalf("verifyLeaderEngine(blocking) err = %v; want DeadlineExceeded", err)
47+
}
48+
// Allow generous slack so a slow CI host does not flake; the point is
49+
// not to assert a tight bound but to prove the call returns at all.
50+
if elapsed > 2*verifyLeaderTimeout {
51+
t.Fatalf("verifyLeaderEngine(blocking) returned after %s; want <= 2x verifyLeaderTimeout (%s)", elapsed, verifyLeaderTimeout)
52+
}
53+
}
54+

0 commit comments

Comments
 (0)