Commit 2bcd177
committed
fix(raft): bump snapshot spool cap to 16 GiB + env override
Receive-side snapshot spool was hardcoded to 1 GiB
(maxSnapshotPayloadBytes). Production FSM snapshots at 1.35 GiB
exceeded that ceiling: snapshotSpool.Write returned
errSnapshotPayloadTooLarge mid-stream, the gRPC SendSnapshot stream
broke, and etcd raft retried the snapshot indefinitely because
each retry hit the same wall.
Followers stuck at stale applied indices (the 213 follower in the
2026-05-08 incident never moved past applied=26,459,962 — over a
million entries behind), the leader sustained ~100 MB/s outbound for
hours sending the same 1.35 GiB snapshot over and over, and host
disks saturated at 73-99% util. Each receive cycle re-created an
elastickv-etcd-snapshot-* spool file with a fresh random suffix,
making the loop visible from the outside as continuously-changing
in-progress filenames.
Fix:
- Default cap raised to 16 GiB (~12x the production-observed FSM
size) so it does not drift back into the runway as data grows.
- Cap is now resolved per spool creation via
ELASTICKV_RAFT_MAX_SNAPSHOT_PAYLOAD_BYTES, so an operator can
raise it without a binary rebuild if even the new default is
ever insufficient.
- Each spool instance captures its own maxSize at construction
rather than reading a package-level var on every Write, so a
test or env flip cannot tear an in-flight receive.
The cap still exists -- defense against a misbehaving / compromised
peer streaming unbounded data into the spool dir is the original
intent, and that intent survives -- but the magnitude is now
realistic.
Self-review (5 lenses):
1. Data loss -- none. The cap was rejecting valid snapshots;
raising it lets receivers actually accept FSM transfers they
should already have been accepting. No persisted state changes.
2. Concurrency -- maxSize is captured at newSnapshotSpool time and
read-only thereafter. No new locks. The env resolver is plain
os.Getenv + ParseInt; no shared state.
3. Performance -- one Getenv + ParseInt per snapshot creation.
Snapshots are infrequent (hours-scale on a stable cluster), so
negligible. The 16 GiB default does NOT pre-allocate; the spool
grows on disk only as bytes arrive.
4. Data consistency -- snapshot integrity unchanged. The fix only
widens the reception envelope; the same chunk-validation,
metadata, and final-flag handling apply.
5. Test coverage -- TestSnapshotSpool_DefaultCapAcceptsRealisticFSM
pins the regression by writing 1.5 GiB through Write
(skipped under -short to keep `make test` fast).
TestSnapshotSpool_OverrideViaEnv exercises a lowered-cap value
to confirm the env knob actually moves the cap and the
errSnapshotPayloadTooLarge sentinel still surfaces past it.
TestSnapshotSpool_OverrideInvalidFallsBack pins fail-soft on
malformed env input so a typo doesn't zero the cap.
Test:
go test -race -count=1 -short ./internal/raftengine/etcd
-- 11.4s, all green.
go test -race -count=1 \
-run TestSnapshotSpool_DefaultCapAcceptsRealisticFSM \
./internal/raftengine/etcd
-- 1.96s, green (1.5 GiB write succeeds).
Future work (separate PRs):
- snapshotSpool.Bytes() materializes the entire payload as []byte
for RawNode.Step. With 16 GiB allowed, this is a real OOM risk
on memory-constrained nodes. Streaming snapshot apply
(RawNode.Step accepts an io.Reader, or the FSM-side path
bypasses raftpb materialization) is the next step.
- Make the leader respect a follower-advertised receive cap so a
cluster running mixed binaries can negotiate a safe value.1 parent 5f12d5d commit 2bcd177
2 files changed
Lines changed: 124 additions & 14 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
5 | 6 | | |
6 | 7 | | |
| 8 | + | |
| 9 | + | |
7 | 10 | | |
8 | 11 | | |
9 | 12 | | |
10 | 13 | | |
11 | | - | |
12 | | - | |
13 | | - | |
14 | | - | |
15 | | - | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
16 | 29 | | |
17 | | - | |
18 | | - | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
19 | 51 | | |
20 | 52 | | |
21 | 53 | | |
22 | 54 | | |
23 | | - | |
24 | | - | |
25 | | - | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
26 | 59 | | |
27 | 60 | | |
28 | 61 | | |
29 | 62 | | |
30 | 63 | | |
31 | 64 | | |
32 | 65 | | |
33 | | - | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
34 | 71 | | |
35 | 72 | | |
36 | 73 | | |
37 | | - | |
38 | | - | |
| 74 | + | |
| 75 | + | |
39 | 76 | | |
40 | 77 | | |
41 | 78 | | |
| |||
50 | 87 | | |
51 | 88 | | |
52 | 89 | | |
53 | | - | |
| 90 | + | |
54 | 91 | | |
55 | 92 | | |
56 | 93 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
| 8 | + | |
7 | 9 | | |
8 | 10 | | |
| 11 | + | |
9 | 12 | | |
10 | 13 | | |
11 | 14 | | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
12 | 85 | | |
13 | 86 | | |
14 | 87 | | |
| |||
0 commit comments