fix(raft): bump snapshot spool cap to 16 GiB + env override (#746)

bootjp · web-flow · commit 6a5e4935b509 · 2026-05-09T00:04:59.000+09:00
## Summary

Receive-side snapshot spool was hardcoded to 1 GiB. Production FSM
snapshots at 1.35 GiB exceeded that ceiling: `snapshotSpool.Write`
returned `errSnapshotPayloadTooLarge` mid-stream, the gRPC
`SendSnapshot` stream broke, and etcd raft retried the snapshot
indefinitely.

This PR raises the default to 16 GiB and adds an
`ELASTICKV_RAFT_MAX_SNAPSHOT_PAYLOAD_BYTES` env override.

## Production incident — 2026-05-08

Two followers (192.168.0.211 and 192.168.0.213) fell behind the leader's
log during an earlier OOM cascade. The leader truncated past their match
indices, so catch-up required a full FSM snapshot. Each transfer
attempt:

1. Leader streams 1.35 GiB FSM via `streamFSMSnapshot` (no send-side
cap)
2. Receiver writes chunks into `snapshotSpool`
3. At ~1 GiB the spool returns `errSnapshotPayloadTooLarge`
4. Receive returns error → gRPC stream closed → leader sees EOF
5. etcd raft fires `Progress.PendingSnapshot` retry → loop

Symptoms observed:
- Follower 213 frozen at `applied=26,459,962` (over 1.16M entries
behind, never moved for 4+ hours)
- Leader 210 sustained ~100 MB/s outbound for hours
- Host disks at 73-99% util, ~125 MB/s sustained
- Container 211 receive dir contained `elastickv-etcd-snapshot-&lt;random&gt;`
files whose IDs changed every probe — visual confirmation of the
receive-then-discard loop
- Goroutine 1573 on leader stuck in `streamFSMSnapshot` →
`sendSnapshotChunk` → gRPC `writeQuota.get` (HTTP/2 flow-control),
waiting for receiver acks that never came because the receive had
already errored out

Cluster impact: 4/5 voters caught up was sufficient for write quorum, so
the cluster stayed up; but two followers were perpetually stale and the
leader's CPU + disk were burned on the futile retries.

## Fix

```go
const defaultMaxSnapshotPayloadBytes int64 = 16 &lt;&lt; 30 // 16 GiB
```

- **16 GiB** is sized as ~12× the production-observed FSM size, well
past the runway.
- **Per-spool capture**: `maxSize` is resolved at `newSnapshotSpool`
time and read-only thereafter, so a test (or future env flip) cannot
tear an in-flight receive.
- **`ELASTICKV_RAFT_MAX_SNAPSHOT_PAYLOAD_BYTES`** env override for
operators on extreme-data deployments. Invalid values fall back to the
default with a `slog.Warn` (fail-soft so a typo doesn't zero the cap and
break every receive).

The cap still exists — defense against a misbehaving / compromised peer
streaming unbounded data into the spool dir survives — but at a
magnitude that is realistic.

## Self-review (5 lenses)

1. **Data loss** — none. The cap was rejecting valid snapshots; raising
it lets receivers accept FSM transfers they should already have been
accepting. No persisted state changes.
2. **Concurrency** — `maxSize` captured at construction, read-only
thereafter. No new locks. The env resolver is plain `os.Getenv` +
`ParseInt`; no shared state.
3. **Performance** — one `Getenv` + `ParseInt` per snapshot creation.
Snapshots are infrequent (hours-scale on a stable cluster), so
negligible. The 16 GiB default does NOT pre-allocate; the spool grows on
disk only as bytes arrive.
4. **Data consistency** — snapshot integrity unchanged. The fix only
widens the reception envelope; the same chunk-validation, metadata, and
final-flag handling apply.
5. **Test coverage**:
- `TestSnapshotSpool_DefaultCapAcceptsRealisticFSM` writes 1.5 GiB
through `Write` (skipped under `-short` to keep `make test` fast).
- `TestSnapshotSpool_OverrideViaEnv` exercises a lowered-cap value to
confirm the env knob actually moves the cap and the
`errSnapshotPayloadTooLarge` sentinel still surfaces past it.
- `TestSnapshotSpool_OverrideInvalidFallsBack` pins fail-soft on
malformed env input so a typo doesn't zero the cap.

## Test plan

- [x] `go test -race -count=1 -short ./internal/raftengine/etcd` —
11.4s, all green
- [x] `go test -race -count=1 -run
TestSnapshotSpool_DefaultCapAcceptsRealisticFSM
./internal/raftengine/etcd` — 1.96s, green (1.5 GiB write succeeds)
- [ ] After merge: deploy to 192.168.0.x cluster, verify 213 receives a
fresh snapshot and `applied_index` advances to match the leader

## Follow-up (separate PRs)

- `snapshotSpool.Bytes()` materializes the entire payload as `[]byte`
for `RawNode.Step`. With 16 GiB allowed this is a real OOM risk on
memory-constrained nodes. Streaming snapshot apply (the FSM-side path
bypassing `raftpb` materialization) is the next step.
- Make the leader respect a follower-advertised receive cap so a cluster
running mixed binaries can negotiate a safe value.
- 211/213 formal recovery: now that this PR unblocks snapshot
completion, plan the operational steps to re-add 211 (currently stopped,
data wiped) via a Learner path.

&lt;!-- This is an auto-generated comment: release notes by coderabbit.ai
--&gt;

## Summary by CodeRabbit

* **New Features**
* Snapshot payload size limit is now configurable via
`ELASTICKV_RAFT_MAX_SNAPSHOT_PAYLOAD_BYTES` environment variable
(default: 16 GiB).
* Invalid environment values gracefully fall back to default
configuration.

* **Bug Fixes**
* Enhanced error messages when snapshots exceed limits, displaying
requested size versus configured limit.

&lt;!-- end of auto-generated comment: release notes by coderabbit.ai --&gt;
diff --git a/internal/raftengine/etcd/snapshot_spool.go b/internal/raftengine/etcd/snapshot_spool.go
@@ -2,40 +2,82 @@ package etcd
 
 import (
 	"io"
+	"log/slog"
 	"os"
 	"path/filepath"
+	"strconv"
+	"strings"
 
 	"github.com/cockroachdb/errors"
 )
 
-var (
-	// The current raftpb snapshot APIs still materialize payloads as []byte, so
-	// the prototype cannot stream snapshots end-to-end yet. Keep the payload on
-	// disk while assembling it and fail fast before unbounded growth.
-	maxSnapshotPayloadBytes int64 = 1 << 30 // 1 GiB
+// defaultMaxSnapshotPayloadBytes is the receive-side cap on a single snapshot
+// stream's spooled payload. Production hit a 1 GiB ceiling here that was
+// silently rejecting real-world FSM transfers (1.35 GiB+), so the receiver
+// returned errSnapshotPayloadTooLarge mid-stream, the gRPC stream broke,
+// and etcd raft retried — indefinitely, since each retry hit the same wall.
+// Followers stuck at stale applied indices, leader sustained ~100 MB/s
+// outbound, host disks saturated for hours.
+//
+// 16 GiB is sized as ~12× the production-observed FSM size so the limit
+// does not drift back into the runway as data grows. The cap still exists
+// so a misbehaving / compromised peer cannot stream unbounded data into
+// the spool dir; operators can raise it further via
+// ELASTICKV_RAFT_MAX_SNAPSHOT_PAYLOAD_BYTES if a real FSM ever exceeds
+// even this default.
+const defaultMaxSnapshotPayloadBytes int64 = 16 << 30 // 16 GiB
 
-	errSnapshotPayloadTooLarge = errors.New("etcd raft snapshot payload exceeds limit")
-)
+const maxSnapshotPayloadBytesEnvVar = "ELASTICKV_RAFT_MAX_SNAPSHOT_PAYLOAD_BYTES"
+
+// resolveMaxSnapshotPayloadBytes evaluates the env override once per spool
+// creation. Snapshots are infrequent enough that one Getenv + ParseInt per
+// spool is invisible in profiles, and resolving at construction means tests
+// that flip the env via t.Setenv don't have to mutate process-wide globals.
+func resolveMaxSnapshotPayloadBytes() int64 {
+	v := strings.TrimSpace(os.Getenv(maxSnapshotPayloadBytesEnvVar))
+	if v == "" {
+		return defaultMaxSnapshotPayloadBytes
+	}
+	n, err := strconv.ParseInt(v, 10, 64)
+	if err != nil || n <= 0 {
+		slog.Warn("invalid ELASTICKV_RAFT_MAX_SNAPSHOT_PAYLOAD_BYTES; using default",
+			"value", v, "default_bytes", defaultMaxSnapshotPayloadBytes)
+		return defaultMaxSnapshotPayloadBytes
+	}
+	return n
+}
+
+var errSnapshotPayloadTooLarge = errors.New("etcd raft snapshot payload exceeds limit")
 
 const snapshotSpoolPattern = "elastickv-etcd-snapshot-*"
 
 type snapshotSpool struct {
-	file *os.File
-	path string
-	size int64
+	file    *os.File
+	path    string
+	size    int64
+	maxSize int64
 }
 
 func newSnapshotSpool(dir string) (*snapshotSpool, error) {
 	file, err := os.CreateTemp(dir, snapshotSpoolPattern)
 	if err != nil {
 		return nil, errors.WithStack(err)
 	}
-	return &snapshotSpool{file: file, path: file.Name()}, nil
+	return &snapshotSpool{
+		file:    file,
+		path:    file.Name(),
+		maxSize: resolveMaxSnapshotPayloadBytes(),
+	}, nil
 }
 
 func (s *snapshotSpool) Write(p []byte) (int, error) {
-	if int64(len(p))+s.size > maxSnapshotPayloadBytes {
-		return 0, errors.Wrapf(errSnapshotPayloadTooLarge, "%d > %d", int64(len(p))+s.size, maxSnapshotPayloadBytes)
+	// Subtraction-based comparison so the cap check stays correct even when
+	// s.maxSize is set to a value near math.MaxInt64 via the env override:
+	// `int64(len(p))+s.size > s.maxSize` would overflow into a negative number
+	// at large maxSize and let the write through. `int64(len(p)) > s.maxSize-s.size`
+	// stays in [0, maxSize] and rejects the same payloads correctly.
+	if int64(len(p)) > s.maxSize-s.size {
+		return 0, errors.Wrapf(errSnapshotPayloadTooLarge, "adding %d bytes to current %d would exceed limit %d", len(p), s.size, s.maxSize)
 	}
 	n, err := s.file.Write(p)
 	s.size += int64(n)
@@ -49,13 +91,17 @@ func (s *snapshotSpool) Bytes() ([]byte, error) {
 	if _, err := s.file.Seek(0, io.SeekStart); err != nil {
 		return nil, errors.WithStack(err)
 	}
-	// Read incrementally instead of sizing a buffer from s.size so malformed
-	// inputs stay bounded by maxSnapshotPayloadBytes and file-backed I/O.
-	data, err := io.ReadAll(s.file)
-	if err != nil {
+	// Pre-allocate from the bytes we have already accepted past Write's
+	// per-call cap check, instead of letting io.ReadAll grow the buffer
+	// through several power-of-two doublings (a 1.35 GiB receive would
+	// trigger ~30 reallocs and copy the running total each time). s.size
+	// is the truth-of-record for what's on disk because Write only
+	// increments it on successful os.File.Write returns.
+	buf := make([]byte, s.size)
+	if _, err := io.ReadFull(s.file, buf); err != nil {
 		return nil, errors.WithStack(err)
 	}
-	return data, nil
+	return buf, nil
 }
 
 func (s *snapshotSpool) Reader() (io.Reader, error) {
diff --git a/internal/raftengine/etcd/snapshot_spool_test.go b/internal/raftengine/etcd/snapshot_spool_test.go
@@ -1,14 +1,100 @@
 package etcd
 
 import (
+	"bytes"
 	"fmt"
 	"os"
 	"path/filepath"
+	"strconv"
 	"testing"
 
+	"github.com/cockroachdb/errors"
 	"github.com/stretchr/testify/require"
 )
 
+// TestSnapshotSpool_DefaultCapAcceptsRealisticFSM pins the regression behind
+// the 2026-05-08 incident: with the prior 1 GiB hardcoded cap, any real-world
+// FSM (production observed 1.35 GiB) failed mid-stream with
+// errSnapshotPayloadTooLarge, breaking the gRPC snapshot stream and locking
+// the leader/follower into a retransmit loop. The default cap must accept at
+// least 1.5 GiB without env override.
+func TestSnapshotSpool_DefaultCapAcceptsRealisticFSM(t *testing.T) {
+	if testing.Short() {
+		t.Skip("skipping: writes 1.5 GiB to a temp file")
+	}
+	dir := t.TempDir()
+	spool, err := newSnapshotSpool(dir)
+	require.NoError(t, err)
+	t.Cleanup(func() { _ = spool.Close() })
+
+	// 1.5 GiB exceeds the legacy 1 GiB ceiling and matches realistic
+	// production FSM sizes within the same order of magnitude.
+	const target = int64(1536) << 20 // 1.5 GiB
+	const chunk = 8 << 20            // 8 MiB writes mirror the gRPC snapshot chunk size order
+	buf := bytes.Repeat([]byte{0xAB}, chunk)
+
+	var written int64
+	for written < target {
+		toWrite := chunk
+		if remaining := target - written; remaining < int64(chunk) {
+			toWrite = int(remaining)
+		}
+		n, err := spool.Write(buf[:toWrite])
+		require.NoError(t, err, "write at offset %d unexpectedly failed", written)
+		require.Equal(t, toWrite, n)
+		written += int64(n)
+	}
+	require.Equal(t, target, spool.size)
+
+	// Round-trip through the materialization path (the io.ReadAll →
+	// io.ReadFull refactor) to lock down behaviour at 1.5 GiB. The
+	// returned slice MUST match s.size exactly: a short read here would
+	// indicate the pre-allocation drifted out of sync with what Write
+	// actually persisted to the spool file.
+	got, err := spool.Bytes()
+	require.NoError(t, err)
+	require.Equal(t, int(target), len(got), "Bytes() returned %d, want %d", len(got), target)
+	// Spot-check first/last bytes match the 0xAB fill; full byte-equality
+	// would double the test's memory cost without adding signal.
+	require.Equal(t, byte(0xAB), got[0])
+	require.Equal(t, byte(0xAB), got[len(got)-1])
+}
+
+// TestSnapshotSpool_OverrideViaEnv confirms the env knob actually moves the
+// cap. Tests deliberately *lower* it (cheap to write past) instead of
+// raising — the upper-bound test above already proves a generous cap works.
+func TestSnapshotSpool_OverrideViaEnv(t *testing.T) {
+	const spoolCap = int64(4096)
+	t.Setenv(maxSnapshotPayloadBytesEnvVar, strconv.FormatInt(spoolCap, 10))
+
+	spool, err := newSnapshotSpool(t.TempDir())
+	require.NoError(t, err)
+	t.Cleanup(func() { _ = spool.Close() })
+
+	require.Equal(t, spoolCap, spool.maxSize)
+
+	// Write up to the cap — succeeds.
+	_, err = spool.Write(bytes.Repeat([]byte{0x01}, int(spoolCap)))
+	require.NoError(t, err)
+
+	// One byte past — fails with the documented sentinel so callers can
+	// errors.Is against errSnapshotPayloadTooLarge for telemetry.
+	_, err = spool.Write([]byte{0x02})
+	require.Error(t, err)
+	require.True(t, errors.Is(err, errSnapshotPayloadTooLarge), "got %v", err)
+}
+
+// TestSnapshotSpool_OverrideInvalidFallsBack pins the resolver's
+// fail-soft behaviour: a malformed env value must NOT zero the cap (which
+// would make every receive fail) — it falls back to the default.
+func TestSnapshotSpool_OverrideInvalidFallsBack(t *testing.T) {
+	t.Setenv(maxSnapshotPayloadBytesEnvVar, "not-a-number")
+	spool, err := newSnapshotSpool(t.TempDir())
+	require.NoError(t, err)
+	t.Cleanup(func() { _ = spool.Close() })
+	require.Equal(t, defaultMaxSnapshotPayloadBytes, spool.maxSize)
+}
+
 func TestCleanupStaleSnapshotSpools(t *testing.T) {
 	dir := t.TempDir()