Skip to content

Commit 2bcd177

Browse files
committed
fix(raft): bump snapshot spool cap to 16 GiB + env override
Receive-side snapshot spool was hardcoded to 1 GiB (maxSnapshotPayloadBytes). Production FSM snapshots at 1.35 GiB exceeded that ceiling: snapshotSpool.Write returned errSnapshotPayloadTooLarge mid-stream, the gRPC SendSnapshot stream broke, and etcd raft retried the snapshot indefinitely because each retry hit the same wall. Followers stuck at stale applied indices (the 213 follower in the 2026-05-08 incident never moved past applied=26,459,962 — over a million entries behind), the leader sustained ~100 MB/s outbound for hours sending the same 1.35 GiB snapshot over and over, and host disks saturated at 73-99% util. Each receive cycle re-created an elastickv-etcd-snapshot-* spool file with a fresh random suffix, making the loop visible from the outside as continuously-changing in-progress filenames. Fix: - Default cap raised to 16 GiB (~12x the production-observed FSM size) so it does not drift back into the runway as data grows. - Cap is now resolved per spool creation via ELASTICKV_RAFT_MAX_SNAPSHOT_PAYLOAD_BYTES, so an operator can raise it without a binary rebuild if even the new default is ever insufficient. - Each spool instance captures its own maxSize at construction rather than reading a package-level var on every Write, so a test or env flip cannot tear an in-flight receive. The cap still exists -- defense against a misbehaving / compromised peer streaming unbounded data into the spool dir is the original intent, and that intent survives -- but the magnitude is now realistic. Self-review (5 lenses): 1. Data loss -- none. The cap was rejecting valid snapshots; raising it lets receivers actually accept FSM transfers they should already have been accepting. No persisted state changes. 2. Concurrency -- maxSize is captured at newSnapshotSpool time and read-only thereafter. No new locks. The env resolver is plain os.Getenv + ParseInt; no shared state. 3. Performance -- one Getenv + ParseInt per snapshot creation. Snapshots are infrequent (hours-scale on a stable cluster), so negligible. The 16 GiB default does NOT pre-allocate; the spool grows on disk only as bytes arrive. 4. Data consistency -- snapshot integrity unchanged. The fix only widens the reception envelope; the same chunk-validation, metadata, and final-flag handling apply. 5. Test coverage -- TestSnapshotSpool_DefaultCapAcceptsRealisticFSM pins the regression by writing 1.5 GiB through Write (skipped under -short to keep `make test` fast). TestSnapshotSpool_OverrideViaEnv exercises a lowered-cap value to confirm the env knob actually moves the cap and the errSnapshotPayloadTooLarge sentinel still surfaces past it. TestSnapshotSpool_OverrideInvalidFallsBack pins fail-soft on malformed env input so a typo doesn't zero the cap. Test: go test -race -count=1 -short ./internal/raftengine/etcd -- 11.4s, all green. go test -race -count=1 \ -run TestSnapshotSpool_DefaultCapAcceptsRealisticFSM \ ./internal/raftengine/etcd -- 1.96s, green (1.5 GiB write succeeds). Future work (separate PRs): - snapshotSpool.Bytes() materializes the entire payload as []byte for RawNode.Step. With 16 GiB allowed, this is a real OOM risk on memory-constrained nodes. Streaming snapshot apply (RawNode.Step accepts an io.Reader, or the FSM-side path bypasses raftpb materialization) is the next step. - Make the leader respect a follower-advertised receive cap so a cluster running mixed binaries can negotiate a safe value.
1 parent 5f12d5d commit 2bcd177

2 files changed

Lines changed: 124 additions & 14 deletions

File tree

internal/raftengine/etcd/snapshot_spool.go

Lines changed: 51 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -2,40 +2,77 @@ package etcd
22

33
import (
44
"io"
5+
"log/slog"
56
"os"
67
"path/filepath"
8+
"strconv"
9+
"strings"
710

811
"github.com/cockroachdb/errors"
912
)
1013

11-
var (
12-
// The current raftpb snapshot APIs still materialize payloads as []byte, so
13-
// the prototype cannot stream snapshots end-to-end yet. Keep the payload on
14-
// disk while assembling it and fail fast before unbounded growth.
15-
maxSnapshotPayloadBytes int64 = 1 << 30 // 1 GiB
14+
// defaultMaxSnapshotPayloadBytes is the receive-side cap on a single snapshot
15+
// stream's spooled payload. Production hit a 1 GiB ceiling here that was
16+
// silently rejecting real-world FSM transfers (1.35 GiB+), so the receiver
17+
// returned errSnapshotPayloadTooLarge mid-stream, the gRPC stream broke,
18+
// and etcd raft retried — indefinitely, since each retry hit the same wall.
19+
// Followers stuck at stale applied indices, leader sustained ~100 MB/s
20+
// outbound, host disks saturated for hours.
21+
//
22+
// 16 GiB is sized as ~12× the production-observed FSM size so the limit
23+
// does not drift back into the runway as data grows. The cap still exists
24+
// so a misbehaving / compromised peer cannot stream unbounded data into
25+
// the spool dir; operators can raise it further via
26+
// ELASTICKV_RAFT_MAX_SNAPSHOT_PAYLOAD_BYTES if a real FSM ever exceeds
27+
// even this default.
28+
const defaultMaxSnapshotPayloadBytes int64 = 16 << 30 // 16 GiB
1629

17-
errSnapshotPayloadTooLarge = errors.New("etcd raft snapshot payload exceeds limit")
18-
)
30+
const maxSnapshotPayloadBytesEnvVar = "ELASTICKV_RAFT_MAX_SNAPSHOT_PAYLOAD_BYTES"
31+
32+
// resolveMaxSnapshotPayloadBytes evaluates the env override once per spool
33+
// creation. Snapshots are infrequent enough that one Getenv + ParseInt per
34+
// spool is invisible in profiles, and resolving at construction means tests
35+
// that flip the env via t.Setenv don't have to mutate process-wide globals.
36+
func resolveMaxSnapshotPayloadBytes() int64 {
37+
v := strings.TrimSpace(os.Getenv(maxSnapshotPayloadBytesEnvVar))
38+
if v == "" {
39+
return defaultMaxSnapshotPayloadBytes
40+
}
41+
n, err := strconv.ParseInt(v, 10, 64)
42+
if err != nil || n <= 0 {
43+
slog.Warn("invalid ELASTICKV_RAFT_MAX_SNAPSHOT_PAYLOAD_BYTES; using default",
44+
"value", v, "default_bytes", defaultMaxSnapshotPayloadBytes)
45+
return defaultMaxSnapshotPayloadBytes
46+
}
47+
return n
48+
}
49+
50+
var errSnapshotPayloadTooLarge = errors.New("etcd raft snapshot payload exceeds limit")
1951

2052
const snapshotSpoolPattern = "elastickv-etcd-snapshot-*"
2153

2254
type snapshotSpool struct {
23-
file *os.File
24-
path string
25-
size int64
55+
file *os.File
56+
path string
57+
size int64
58+
maxSize int64
2659
}
2760

2861
func newSnapshotSpool(dir string) (*snapshotSpool, error) {
2962
file, err := os.CreateTemp(dir, snapshotSpoolPattern)
3063
if err != nil {
3164
return nil, errors.WithStack(err)
3265
}
33-
return &snapshotSpool{file: file, path: file.Name()}, nil
66+
return &snapshotSpool{
67+
file: file,
68+
path: file.Name(),
69+
maxSize: resolveMaxSnapshotPayloadBytes(),
70+
}, nil
3471
}
3572

3673
func (s *snapshotSpool) Write(p []byte) (int, error) {
37-
if int64(len(p))+s.size > maxSnapshotPayloadBytes {
38-
return 0, errors.Wrapf(errSnapshotPayloadTooLarge, "%d > %d", int64(len(p))+s.size, maxSnapshotPayloadBytes)
74+
if int64(len(p))+s.size > s.maxSize {
75+
return 0, errors.Wrapf(errSnapshotPayloadTooLarge, "%d > %d", int64(len(p))+s.size, s.maxSize)
3976
}
4077
n, err := s.file.Write(p)
4178
s.size += int64(n)
@@ -50,7 +87,7 @@ func (s *snapshotSpool) Bytes() ([]byte, error) {
5087
return nil, errors.WithStack(err)
5188
}
5289
// Read incrementally instead of sizing a buffer from s.size so malformed
53-
// inputs stay bounded by maxSnapshotPayloadBytes and file-backed I/O.
90+
// inputs stay bounded by s.maxSize and file-backed I/O.
5491
data, err := io.ReadAll(s.file)
5592
if err != nil {
5693
return nil, errors.WithStack(err)

internal/raftengine/etcd/snapshot_spool_test.go

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,87 @@
11
package etcd
22

33
import (
4+
"bytes"
45
"fmt"
56
"os"
67
"path/filepath"
8+
"strconv"
79
"testing"
810

11+
"github.com/cockroachdb/errors"
912
"github.com/stretchr/testify/require"
1013
)
1114

15+
// TestSnapshotSpool_DefaultCapAcceptsRealisticFSM pins the regression behind
16+
// the 2026-05-08 incident: with the prior 1 GiB hardcoded cap, any real-world
17+
// FSM (production observed 1.35 GiB) failed mid-stream with
18+
// errSnapshotPayloadTooLarge, breaking the gRPC snapshot stream and locking
19+
// the leader/follower into a retransmit loop. The default cap must accept at
20+
// least 1.5 GiB without env override.
21+
func TestSnapshotSpool_DefaultCapAcceptsRealisticFSM(t *testing.T) {
22+
if testing.Short() {
23+
t.Skip("skipping: writes 1.5 GiB to a temp file")
24+
}
25+
dir := t.TempDir()
26+
spool, err := newSnapshotSpool(dir)
27+
require.NoError(t, err)
28+
t.Cleanup(func() { _ = spool.Close() })
29+
30+
// 1.5 GiB exceeds the legacy 1 GiB ceiling and matches realistic
31+
// production FSM sizes within the same order of magnitude.
32+
const target = int64(1536) << 20 // 1.5 GiB
33+
const chunk = 8 << 20 // 8 MiB writes mirror the gRPC snapshot chunk size order
34+
buf := bytes.Repeat([]byte{0xAB}, chunk)
35+
36+
var written int64
37+
for written < target {
38+
toWrite := chunk
39+
if remaining := target - written; remaining < int64(chunk) {
40+
toWrite = int(remaining)
41+
}
42+
n, err := spool.Write(buf[:toWrite])
43+
require.NoError(t, err, "write at offset %d unexpectedly failed", written)
44+
require.Equal(t, toWrite, n)
45+
written += int64(n)
46+
}
47+
require.Equal(t, target, spool.size)
48+
}
49+
50+
// TestSnapshotSpool_OverrideViaEnv confirms the env knob actually moves the
51+
// cap. Tests deliberately *lower* it (cheap to write past) instead of
52+
// raising — the upper-bound test above already proves a generous cap works.
53+
func TestSnapshotSpool_OverrideViaEnv(t *testing.T) {
54+
const cap = int64(4096)
55+
t.Setenv(maxSnapshotPayloadBytesEnvVar, strconv.FormatInt(cap, 10))
56+
57+
spool, err := newSnapshotSpool(t.TempDir())
58+
require.NoError(t, err)
59+
t.Cleanup(func() { _ = spool.Close() })
60+
61+
require.Equal(t, cap, spool.maxSize)
62+
63+
// Write up to the cap — succeeds.
64+
_, err = spool.Write(bytes.Repeat([]byte{0x01}, int(cap)))
65+
require.NoError(t, err)
66+
67+
// One byte past — fails with the documented sentinel so callers can
68+
// errors.Is against errSnapshotPayloadTooLarge for telemetry.
69+
_, err = spool.Write([]byte{0x02})
70+
require.Error(t, err)
71+
require.True(t, errors.Is(err, errSnapshotPayloadTooLarge), "got %v", err)
72+
}
73+
74+
// TestSnapshotSpool_OverrideInvalidFallsBack pins the resolver's
75+
// fail-soft behaviour: a malformed env value must NOT zero the cap (which
76+
// would make every receive fail) — it falls back to the default.
77+
func TestSnapshotSpool_OverrideInvalidFallsBack(t *testing.T) {
78+
t.Setenv(maxSnapshotPayloadBytesEnvVar, "not-a-number")
79+
spool, err := newSnapshotSpool(t.TempDir())
80+
require.NoError(t, err)
81+
t.Cleanup(func() { _ = spool.Close() })
82+
require.Equal(t, defaultMaxSnapshotPayloadBytes, spool.maxSize)
83+
}
84+
1285
func TestCleanupStaleSnapshotSpools(t *testing.T) {
1386
dir := t.TempDir()
1487

0 commit comments

Comments
 (0)