Skip to content

Commit 5f12d5d

Browse files
authored
fix(kv): bound verifyLeaderEngine ReadIndex with 5s deadline (#745)
## Summary `verifyLeaderEngine()` called `engine.VerifyLeader` with `context.Background()`, so callers without an upstream context blocked indefinitely on a ReadIndex round-trip. A single transient stall accumulated callers permanently. This caps the no-context path at 5s. ## Production incident — 2026-05-08 Follower 192.168.0.214 lost its network route (`no route to host`, ARP `INCOMPLETE`). The leader's ReadIndex completion stalled intermittently and verify-callers piled up at ~9/sec without bound. After ~37 minutes the leader (192.168.0.212) showed: - **20,560 goroutines**, 20,478 of them in `etcd.(*Engine).submitRead` `[select, 35-39 minutes]` - **CPU 1870%** (`Engine.run` Ready loop walks `pendingReads` O(N) per tick → queue feeds back on itself) - **Host MemAvailable** trending toward 0 → OOM - Each new leader after failover re-entered the same death spiral Mitigation: `docker restart elastickv` on 212 dropped it to 74% CPU / 163 MiB. 214 was hardware-rebooted and is REACHABLE again. This PR prevents the next leader from re-entering the spiral. ## Affected callers All use the no-context `verifyLeaderEngine` variant: - `kv/leader_proxy.go` — `LeaderProxy.Commit` / `.Abort` (every Redis write) - `kv/coordinator.go` — `Coordinate.VerifyLeader` - `kv/sharded_coordinator.go` — `ShardedCoordinator.VerifyLeader` / `VerifyLeaderForKey` - `adapter/s3.go` — `isVerifiedS3Leader` / inline VerifyLeader at line 2291 (healthz) - `adapter/sqs.go` — `isVerifiedSQSLeader` (healthz) - `main_admin.go` — `LeaderProbe` callback for `/admin/healthz/leader` ## Failure mode on timeout `context.DeadlineExceeded` surfaces to the caller. `LeaderProxy` falls back to `forwardWithRetry` (the existing path for any verify failure). Healthz handlers report 503 not-leader. Background loops (lock resolver, HLC lease) skip this tick. No new infinite loop: even when this node *is* the leader, a verify-failure → forward path already exists in `LeaderProxy.Commit`; that path is bounded by `leaderProxyRetryBudget = 5s` and `maxForwardRetries = 3`. ## Self-review (5 lenses) 1. **Data loss** — none. The fix only shortens a never-returning wait. `verifyLeaderEngine` is a freshness check, not a write path. Already-committed proposals are unaffected. 2. **Concurrency** — the new ctx is local to each call (`defer cancel`), no shared state, no lock changes. Engine-side blocking semantics unchanged; we just stop waiting forever. 3. **Performance** — net positive. Removes the unbounded goroutine pile-up and the O(N) `pendingReads` walk it caused. No new allocations on the success path beyond the `WithTimeout` context. 4. **Data consistency** — ReadIndex still completes when quorum heartbeats land within 5s. A timeout means the caller could not confirm leadership freshness, which the existing "fall through to forward" path already treats as a soft failure. 5. **Test coverage** — `kv/raft_engine_test.go::TestVerifyLeaderEngine_BoundsBlockingReadIndex` pins the regression: a `blockingLeaderView` that holds `VerifyLeader` on its ctx must surface `DeadlineExceeded` within `2 * verifyLeaderTimeout`. ## Test plan - [x] `go test -race -count=1 ./kv` — 9.3s, all green - [x] New regression test `TestVerifyLeaderEngine_BoundsBlockingReadIndex` covers the blocking case - [ ] Roll out to 192.168.0.x cluster after merge, watch CPU/Mem panel for the next 4-6h to confirm no more OOM cascade ## Future work (separate PRs) Plumb real request contexts through `LeaderProxy.Commit/Abort` and the healthz handlers so client-side deadlines cascade naturally instead of relying on this fixed bound. Today the Redis adapter's per-command deadline doesn't reach `LeaderProxy`; the proxy interface takes `[]*pb.Request` only. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Implemented timeout bounds for leader verification operations with a 5-second limit to prevent indefinite blocking. * **Tests** * Added test to verify leader verification properly handles timeout scenarios and completes within the expected timeframe under stalled conditions. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
2 parents 008eb81 + a2ab245 commit 5f12d5d

2 files changed

Lines changed: 94 additions & 1 deletion

File tree

kv/raft_engine.go

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,30 @@ package kv
22

33
import (
44
"context"
5+
"time"
56

67
"github.com/bootjp/elastickv/internal/monoclock"
78
"github.com/bootjp/elastickv/internal/raftengine"
89
"github.com/cockroachdb/errors"
910
)
1011

12+
// verifyLeaderTimeout caps how long the no-context verifyLeaderEngine path
13+
// is willing to wait for a ReadIndex round-trip. Without this bound,
14+
// callers that hold context.Background() (LeaderProxy.Commit/Abort,
15+
// Coordinate.VerifyLeader, ShardedCoordinator.VerifyLeader[ForKey], and
16+
// the S3/SQS/admin /healthz/leader handlers) blocked indefinitely whenever
17+
// ReadIndex completion stalled, and a single transient stall accumulated
18+
// callers permanently — Engine.run's Ready loop walks pendingReads O(N)
19+
// per tick, so the queue feeds back on itself once it grows.
20+
//
21+
// 5s matches leaderForwardTimeout: a verify that takes longer than a
22+
// single forward RPC is useless as a freshness check, and the proxy's
23+
// verify-then-forward path stays within its 5s retry budget.
24+
//
25+
// See PR #745 / incident 2026-05-08 for the goroutine-pile production
26+
// failure this prevents.
27+
const verifyLeaderTimeout = 5 * time.Second
28+
1129
func engineForGroup(g *ShardGroup) raftengine.Engine {
1230
if g == nil {
1331
return nil
@@ -41,7 +59,9 @@ func verifyLeaderEngineCtx(ctx context.Context, engine raftengine.LeaderView) er
4159
}
4260

4361
func verifyLeaderEngine(engine raftengine.LeaderView) error {
44-
return verifyLeaderEngineCtx(context.Background(), engine)
62+
ctx, cancel := context.WithTimeout(context.Background(), verifyLeaderTimeout)
63+
defer cancel()
64+
return verifyLeaderEngineCtx(ctx, engine)
4565
}
4666

4767
func linearizableReadEngineCtx(ctx context.Context, engine raftengine.LeaderView) (uint64, error) {

kv/raft_engine_test.go

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
package kv
2+
3+
import (
4+
"context"
5+
"testing"
6+
"time"
7+
8+
"github.com/bootjp/elastickv/internal/raftengine"
9+
"github.com/cockroachdb/errors"
10+
)
11+
12+
// blockingLeaderView is a LeaderView whose VerifyLeader blocks until ctx is
13+
// cancelled, modelling the production pathology where ReadIndex stalls
14+
// because heartbeat acks fail to land. LinearizableRead is similarly
15+
// well-behaved on cancel; State / Leader are stamped enough to satisfy the
16+
// callers under test.
17+
type blockingLeaderView struct{}
18+
19+
func (blockingLeaderView) State() raftengine.State { return raftengine.StateLeader }
20+
func (blockingLeaderView) Leader() raftengine.LeaderInfo { return raftengine.LeaderInfo{ID: "self"} }
21+
func (blockingLeaderView) VerifyLeader(ctx context.Context) error {
22+
<-ctx.Done()
23+
return ctx.Err()
24+
}
25+
func (blockingLeaderView) LinearizableRead(ctx context.Context) (uint64, error) {
26+
<-ctx.Done()
27+
return 0, ctx.Err()
28+
}
29+
30+
// TestVerifyLeaderEngine_BoundsBlockingReadIndex pins the regression: if a
31+
// stalled ReadIndex used to return only when the underlying ctx fired, but
32+
// callers passed context.Background(), the goroutine pinned forever. After
33+
// the 2026-05-08 incident this must complete within roughly
34+
// verifyLeaderTimeout, surfacing context.DeadlineExceeded.
35+
//
36+
// Skipped under -short because the whole point is to wait for the deadline
37+
// to fire; the no-skip path adds verifyLeaderTimeout (5s) to every default
38+
// `make test` run.
39+
func TestVerifyLeaderEngine_BoundsBlockingReadIndex(t *testing.T) {
40+
t.Parallel()
41+
if testing.Short() {
42+
t.Skip("skipping: blocks for verifyLeaderTimeout (5s)")
43+
}
44+
45+
start := time.Now()
46+
err := verifyLeaderEngine(blockingLeaderView{})
47+
elapsed := time.Since(start)
48+
49+
if err == nil {
50+
t.Fatalf("verifyLeaderEngine(blocking) returned nil; expected DeadlineExceeded")
51+
}
52+
if !errors.Is(err, context.DeadlineExceeded) {
53+
t.Fatalf("verifyLeaderEngine(blocking) err = %v; want DeadlineExceeded", err)
54+
}
55+
// Lower bound: confirm the engine actually held the call until the
56+
// deadline fired, not that some other error path returned
57+
// immediately. Without this, a future regression that returned
58+
// DeadlineExceeded before doing any work (e.g. a misplaced ctx
59+
// check before the engine call) would silently pass.
60+
//
61+
// Tolerate a 200ms early-return slack so a slow CI scheduler that
62+
// trips ctx.Done() a hair before the wall clock catches up does
63+
// not flake.
64+
const slack = 200 * time.Millisecond
65+
if elapsed+slack < verifyLeaderTimeout {
66+
t.Fatalf("verifyLeaderEngine(blocking) returned too early after %s; want >= %s (-%s slack)", elapsed, verifyLeaderTimeout, slack)
67+
}
68+
// Upper bound: prove the call returned at all. Generous so a slow
69+
// CI host does not flake.
70+
if elapsed > 2*verifyLeaderTimeout {
71+
t.Fatalf("verifyLeaderEngine(blocking) returned after %s; want <= 2x verifyLeaderTimeout (%s)", elapsed, verifyLeaderTimeout)
72+
}
73+
}

0 commit comments

Comments
 (0)