Commit 6a5e493
authored
fix(raft): bump snapshot spool cap to 16 GiB + env override (#746)
## Summary
Receive-side snapshot spool was hardcoded to 1 GiB. Production FSM
snapshots at 1.35 GiB exceeded that ceiling: `snapshotSpool.Write`
returned `errSnapshotPayloadTooLarge` mid-stream, the gRPC
`SendSnapshot` stream broke, and etcd raft retried the snapshot
indefinitely.
This PR raises the default to 16 GiB and adds an
`ELASTICKV_RAFT_MAX_SNAPSHOT_PAYLOAD_BYTES` env override.
## Production incident — 2026-05-08
Two followers (192.168.0.211 and 192.168.0.213) fell behind the leader's
log during an earlier OOM cascade. The leader truncated past their match
indices, so catch-up required a full FSM snapshot. Each transfer
attempt:
1. Leader streams 1.35 GiB FSM via `streamFSMSnapshot` (no send-side
cap)
2. Receiver writes chunks into `snapshotSpool`
3. At ~1 GiB the spool returns `errSnapshotPayloadTooLarge`
4. Receive returns error → gRPC stream closed → leader sees EOF
5. etcd raft fires `Progress.PendingSnapshot` retry → loop
Symptoms observed:
- Follower 213 frozen at `applied=26,459,962` (over 1.16M entries
behind, never moved for 4+ hours)
- Leader 210 sustained ~100 MB/s outbound for hours
- Host disks at 73-99% util, ~125 MB/s sustained
- Container 211 receive dir contained `elastickv-etcd-snapshot-<random>`
files whose IDs changed every probe — visual confirmation of the
receive-then-discard loop
- Goroutine 1573 on leader stuck in `streamFSMSnapshot` →
`sendSnapshotChunk` → gRPC `writeQuota.get` (HTTP/2 flow-control),
waiting for receiver acks that never came because the receive had
already errored out
Cluster impact: 4/5 voters caught up was sufficient for write quorum, so
the cluster stayed up; but two followers were perpetually stale and the
leader's CPU + disk were burned on the futile retries.
## Fix
```go
const defaultMaxSnapshotPayloadBytes int64 = 16 << 30 // 16 GiB
```
- **16 GiB** is sized as ~12× the production-observed FSM size, well
past the runway.
- **Per-spool capture**: `maxSize` is resolved at `newSnapshotSpool`
time and read-only thereafter, so a test (or future env flip) cannot
tear an in-flight receive.
- **`ELASTICKV_RAFT_MAX_SNAPSHOT_PAYLOAD_BYTES`** env override for
operators on extreme-data deployments. Invalid values fall back to the
default with a `slog.Warn` (fail-soft so a typo doesn't zero the cap and
break every receive).
The cap still exists — defense against a misbehaving / compromised peer
streaming unbounded data into the spool dir survives — but at a
magnitude that is realistic.
## Self-review (5 lenses)
1. **Data loss** — none. The cap was rejecting valid snapshots; raising
it lets receivers accept FSM transfers they should already have been
accepting. No persisted state changes.
2. **Concurrency** — `maxSize` captured at construction, read-only
thereafter. No new locks. The env resolver is plain `os.Getenv` +
`ParseInt`; no shared state.
3. **Performance** — one `Getenv` + `ParseInt` per snapshot creation.
Snapshots are infrequent (hours-scale on a stable cluster), so
negligible. The 16 GiB default does NOT pre-allocate; the spool grows on
disk only as bytes arrive.
4. **Data consistency** — snapshot integrity unchanged. The fix only
widens the reception envelope; the same chunk-validation, metadata, and
final-flag handling apply.
5. **Test coverage**:
- `TestSnapshotSpool_DefaultCapAcceptsRealisticFSM` writes 1.5 GiB
through `Write` (skipped under `-short` to keep `make test` fast).
- `TestSnapshotSpool_OverrideViaEnv` exercises a lowered-cap value to
confirm the env knob actually moves the cap and the
`errSnapshotPayloadTooLarge` sentinel still surfaces past it.
- `TestSnapshotSpool_OverrideInvalidFallsBack` pins fail-soft on
malformed env input so a typo doesn't zero the cap.
## Test plan
- [x] `go test -race -count=1 -short ./internal/raftengine/etcd` —
11.4s, all green
- [x] `go test -race -count=1 -run
TestSnapshotSpool_DefaultCapAcceptsRealisticFSM
./internal/raftengine/etcd` — 1.96s, green (1.5 GiB write succeeds)
- [ ] After merge: deploy to 192.168.0.x cluster, verify 213 receives a
fresh snapshot and `applied_index` advances to match the leader
## Follow-up (separate PRs)
- `snapshotSpool.Bytes()` materializes the entire payload as `[]byte`
for `RawNode.Step`. With 16 GiB allowed this is a real OOM risk on
memory-constrained nodes. Streaming snapshot apply (the FSM-side path
bypassing `raftpb` materialization) is the next step.
- Make the leader respect a follower-advertised receive cap so a cluster
running mixed binaries can negotiate a safe value.
- 211/213 formal recovery: now that this PR unblocks snapshot
completion, plan the operational steps to re-add 211 (currently stopped,
data wiped) via a Learner path.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Snapshot payload size limit is now configurable via
`ELASTICKV_RAFT_MAX_SNAPSHOT_PAYLOAD_BYTES` environment variable
(default: 16 GiB).
* Invalid environment values gracefully fall back to default
configuration.
* **Bug Fixes**
* Enhanced error messages when snapshots exceed limits, displaying
requested size versus configured limit.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->2 files changed
Lines changed: 150 additions & 18 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
5 | 6 | | |
6 | 7 | | |
| 8 | + | |
| 9 | + | |
7 | 10 | | |
8 | 11 | | |
9 | 12 | | |
10 | 13 | | |
11 | | - | |
12 | | - | |
13 | | - | |
14 | | - | |
15 | | - | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
16 | 29 | | |
17 | | - | |
18 | | - | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
19 | 51 | | |
20 | 52 | | |
21 | 53 | | |
22 | 54 | | |
23 | | - | |
24 | | - | |
25 | | - | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
26 | 59 | | |
27 | 60 | | |
28 | 61 | | |
29 | 62 | | |
30 | 63 | | |
31 | 64 | | |
32 | 65 | | |
33 | | - | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
34 | 71 | | |
35 | 72 | | |
36 | 73 | | |
37 | | - | |
38 | | - | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
39 | 81 | | |
40 | 82 | | |
41 | 83 | | |
| |||
49 | 91 | | |
50 | 92 | | |
51 | 93 | | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
56 | 102 | | |
57 | 103 | | |
58 | | - | |
| 104 | + | |
59 | 105 | | |
60 | 106 | | |
61 | 107 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
| 8 | + | |
7 | 9 | | |
8 | 10 | | |
| 11 | + | |
9 | 12 | | |
10 | 13 | | |
11 | 14 | | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
12 | 98 | | |
13 | 99 | | |
14 | 100 | | |
| |||
0 commit comments