Skip to content

backup: snapshot_reader (Phase 0a foundation for snapshot-decode binary)#792

Open
bootjp wants to merge 1 commit into
mainfrom
backup/snapshot-reader
Open

backup: snapshot_reader (Phase 0a foundation for snapshot-decode binary)#792
bootjp wants to merge 1 commit into
mainfrom
backup/snapshot-reader

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 19, 2026

Summary

Adds internal/backup/snapshot_reader.go: the .fsm file parser
that underpins the cmd/elastickv-snapshot-decode binary (design
docs/design/2026_04_29_proposed_snapshot_logical_decoder.md lines
440-491). This is the foundational step toward Phase 0a's
operator-facing decoder; the dispatcher + cmd driver that consume
this reader come in a follow-up PR.

Mirrors the native Pebble snapshot format produced by
store/lsm_store.go::pebbleSnapshotMagic + restoreBatchLoopInto:

[8 bytes]   magic "EKVPBBL1"
[8 bytes]   lastCommitTS (LittleEndian uint64)
repeated:
  [8 bytes]   keyLen (LittleEndian uint64)
  [keyLen]    encoded key  = <userKey><invTS(8 BE)>
  [8 bytes]   valLen (LittleEndian uint64)
  [valLen]    encoded value = <flags(1)><expireAt(8 LE)><body>

Per-entry decoding strips:

  • The 8-byte inverted-TS suffix → recovers commit_ts via XOR
  • The 9-byte value header → surfaces tombstone, encryption_state, expireAt

API surface: ReadSnapshot(io.Reader, callback) + SnapshotHeader +
SnapshotEntry + sentinel errors. Callback-based so callers can
stream multi-GB snapshots without buffering; emitted byte slices
alias scratch buffers, so callers must bytes.Clone for retention.

Fail-closed contract (matches design §7.1)

  • ErrSnapshotBadMagic — first 8 bytes not "EKVPBBL1" (operator
    pointed the binary at an MVCC streaming snapshot, a tar, etc.)
  • ErrSnapshotTruncated — EOF mid-entry. A clean inter-entry EOF
    is the normal terminator and is NOT an error.
  • ErrSnapshotShortKey — encoded key < 8 bytes
  • ErrSnapshotShortValue — encoded value < 9 bytes
  • ErrSnapshotEncryptedReserved — reserved encryption_state
    bits (0b10, 0b11) or bits 3-7 set — fail-closed trip-wire
    parallel to store.ErrEncryptedValueReservedState
  • ErrSnapshotEncryptedEntryencState=0b01. Phase 0a does
    not link the decryption keyring; Stage 8 of the encryption
    rollout will add a keyring-aware variant.

Self-review (5 lenses)

  1. Data loss — tombstone flag surfaced verbatim (callers
    choose to suppress for backup output or include for
    diagnostic dumps). EOF distinguishes clean terminator from
    truncation.
  2. Concurrency / distributed — single-pass io.Reader; no
    shared state, no goroutines.
  3. Performance — per-entry slices alias a fixed 64 KiB key
    buffer + a growable value buffer. For a 10M-entry snapshot
    this keeps allocs O(1) outside the value-grow path. bufio
    amortises read syscalls.
  4. Data consistencyencryption_state is decoded into the
    identical 4-way switch (cleartext / encrypted / 0b10 /
    0b11) that store/lsm_store.go::decodeValue uses, so a
    future live-side enum extension fails closed here too.
  5. Test coverage — 17 table-driven tests in
    snapshot_reader_test.go:
    • header-only / single-entry round-trip
    • tombstone flag surfaced
    • multi-entry order preservation
    • bad magic / truncated header / truncated entry
    • short-key / short-value
    • encrypted entry rejection
    • all four reserved-encState cases
    • callback-error propagation
    • MVCC TS XOR-inversion across uint64 boundary
    • empty-value body

Caller audit (per /loop standing instruction)

This PR adds a NEW public API surface (ReadSnapshot,
SnapshotEntry, SnapshotHeader, PebbleSnapshotMagic, the
Err* sentinels). No prior callers exist. Verified via
grep -rn "ReadSnapshot|SnapshotEntry|SnapshotHeader"
across the tree — all matches are inside the two new files. The
follow-up PR (cmd driver + dispatcher) is this API's first consumer.

Test plan

  • go test -race ./internal/backup/ → ok
  • golangci-lint run ./internal/backup/... → 0 issues
  • go build ./... → ok
  • go vet ./internal/backup/... → ok

Adds internal/backup/snapshot_reader.go: the .fsm file parser that
underpins the cmd/elastickv-snapshot-decode binary (design
docs/design/2026_04_29_proposed_snapshot_logical_decoder.md
lines 440-491). Mirrors the native Pebble snapshot format produced
by store/lsm_store.go::pebbleSnapshotMagic + restoreBatchLoopInto.

File shape consumed:

  [8 bytes]   magic "EKVPBBL1"
  [8 bytes]   lastCommitTS (LittleEndian uint64)
  repeated:
    [8 bytes]   keyLen (LittleEndian uint64)
    [keyLen]    encoded key  = <userKey><invTS(8 BE)>
    [8 bytes]   valLen (LittleEndian uint64)
    [valLen]    encoded value = <flags(1)><expireAt(8 LE)><body>

Per-entry decoding strips:
- The 8-byte inverted-TS suffix from the key (recovers commit_ts
  via XOR-inversion across the math.MaxUint64 boundary).
- The 9-byte value header from the value (surfaces tombstone flag,
  encryption_state, expireAt).

API surface (ReadSnapshot + SnapshotHeader + SnapshotEntry +
sentinel errors). Callback-based so callers can stream
multi-GB snapshots without buffering the whole file in memory; the
emitted SnapshotEntry's byte slices alias scratch buffers and the
caller must bytes.Clone for retention.

Fail-closed contract (matches design 7.1):
- ErrSnapshotBadMagic    - first 8 bytes not "EKVPBBL1" (operator
                           pointed at the wrong file)
- ErrSnapshotTruncated   - EOF mid-entry (a clean inter-entry EOF
                           is the normal terminator and is NOT
                           an error)
- ErrSnapshotShortKey    - encoded key < 8 bytes (no room for TS)
- ErrSnapshotShortValue  - encoded value < 9 bytes (no room for
                           value header)
- ErrSnapshotEncryptedReserved - reserved encryption_state bits
                           (0b10, 0b11) or bits 3-7 set; same
                           fail-closed trip-wire as
                           store.ErrEncryptedValueReservedState
- ErrSnapshotEncryptedEntry - encState=0b01 (encrypted).
                           Phase 0a does not link the decryption
                           keyring; Stage 8 of the encryption
                           rollout adds a keyring-aware variant.

Self-review:
1. Data loss   - tombstone flag surfaced verbatim so callers can
   choose to suppress (default Phase 0a backup behavior) or
   include (a future diagnostic dump might want both). EOF
   distinction between clean terminator and truncation prevents
   silently dropping a partial entry.
2. Concurrency - reader is single-pass over an io.Reader; no
   shared state, no goroutines.
3. Performance - per-entry slices alias a fixed 64 KiB key buffer
   and a growable value buffer. For a 10M-entry snapshot this
   keeps allocs O(1) outside the value-grow boundary. bufio
   wrapper amortises read syscalls.
4. Consistency - encryption_state bits are decoded into the
   identical 4-way switch (cleartext / encrypted / 0b10 / 0b11)
   that store/lsm_store.go::decodeValue uses, so a future
   live-side enum extension fails closed here too.
5. Coverage - 17 table-driven tests in snapshot_reader_test.go
   covering: header-only, single-entry round-trip, tombstone
   flag, multi-entry order preservation, bad magic, truncated
   header, truncated entry, short-key, short-value, encrypted
   entry, all four reserved-encState cases, callback-error
   propagation, MVCC TS XOR-inversion across uint64 boundary,
   empty-value body.

Caller audit (per /loop standing instruction): this PR adds a
NEW public API surface (ReadSnapshot, SnapshotEntry,
SnapshotHeader, PebbleSnapshotMagic, the Err* sentinels). No
prior callers exist. Verified via
'grep -rn "ReadSnapshot|SnapshotEntry|SnapshotHeader" --include=*.go'
- all matches are inside internal/backup/snapshot_reader{,test}.go.
The follow-up PR (cmd/elastickv-snapshot-decode driver +
dispatcher) will be this API's first consumer.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 19, 2026

@claude review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 41 minutes and 11 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 34cfcdff-93f7-48af-8f69-57b2433f97a0

📥 Commits

Reviewing files that changed from the base of the PR and between 8c0d07f and 42fb7ba.

📒 Files selected for processing (2)
  • internal/backup/snapshot_reader.go
  • internal/backup/snapshot_reader_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch backup/snapshot-reader

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude Bot commented May 19, 2026

Claude finished @bootjp's task in 2m 53s —— View job


Code Review — internal/backup/snapshot_reader.go

Overall: This is clean, well-documented code and the format constants are correctly mirrored from store/lsm_store.go. I found one real bug (wrong sentinel returned on a specific truncation path), one robustness gap (unbounded allocation), and a few smaller issues.


  • Read snapshot_reader.go
  • Read snapshot_reader_test.go
  • Cross-reference with store/lsm_store.go (format source of truth)
  • Review: correctness, safety, performance, consistency
  • Post findings

Bug: discarded eof bool in readOneEntry (lines 198–200)

// current code
vLen, _, err := readEntryLen(r)
if err != nil {
    if cockroachdberr.Is(err, io.EOF) { // ← dead code
        return false, cockroachdberr.WithStack(ErrSnapshotTruncated)
    }
    return false, err
}

readEntryLen never returns io.EOF as an error — it converts a clean EOF at the start of an 8-byte field to (0, true, nil). By discarding the eof bool with _, the clean-EOF case (snapshot ends exactly at the value-length field) falls through with vLen=0. Then readExactGrow succeeds reading 0 bytes, and decodeSnapshotEntry(key, []byte{}) returns ErrSnapshotShortValue instead of the correct ErrSnapshotTruncated.

Fix:

vLen, vEOF, err := readEntryLen(r)
if err != nil {
    return false, err
}
if vEOF {
    return false, cockroachdberr.WithStack(ErrSnapshotTruncated)
}

Missing test: TestReadSnapshot_RejectsTruncatedEntry covers truncation at the key-length field, but there is no test for a snapshot that ends cleanly right after the key bytes (i.e., at the value-length field boundary). This specific case would reproduce the bug above.

Fix this →


Robustness: No length bounds check before make([]byte, n)

readExactGrow (line 276) calls make([]byte, n) where n is a raw uint64 from the file. A corrupt snapshot with kLen = 0xFFFFFFFFFFFFFFFF causes OOM or a runtime panic.

The live-side restoreFieldLenInt in lsm_store.go validates against maxPebbleEncodedKeySize and maxSnapshotValueSize before making any allocation. The offline reader needs similar guards:

const (
    maxSnapshotKeyLen   = maxSnapshotKeySize + snapshotTSSize    // mirror live constant
    maxSnapshotValueLen = maxSnapshotValueSize + snapshotValueHeaderSize + encEnvelopeOverhead
)

func readExactGrow(r io.Reader, dst []byte, n uint64, maxN uint64) ([]byte, error) {
    if n > maxN {
        return nil, cockroachdberr.Newf("backup: field length %d exceeds max %d", n, maxN)
    }
    // ...
}

For a Phase 0a offline tool the blast radius is process-local, but a panicking decoder is worse than a clean error. Fix this →


Naming: typo in snapshotEncStateCleartx

snapshotEncStateCleartx (tx suffix) vs encStateCleartext in the live store. The truncated suffix doesn't carry its weight — Cleartext is the obvious word. The test file uses this constant directly, so a future reader would need to mentally map the typo.


Minor: double WithStack on key-read truncation

readExact already wraps ErrSnapshotTruncated with WithStack. Then readOneEntry line 189 wraps again:

key, err := readExact(r, keyScratch[:0], kLen)
if err != nil {
    return false, cockroachdberr.WithStack(err) // ← redundant
}

The redundant WithStack adds a spurious frame pointing to the caller rather than the actual error site. Propagate err directly.


API design note: SnapshotHeader not surfaced for empty snapshots

SnapshotHeader is only delivered via the per-entry callback, so for a snapshot with zero entries the caller never receives LastCommitTS. The PR description and test both acknowledge this, but the design note at lines 440-491 says the header should be recorded in MANIFEST.json — if the backup library sees an empty shard snapshot it silently drops the commit-horizon. Consider returning SnapshotHeader as a direct return value from ReadSnapshot:

func ReadSnapshot(r io.Reader, fn func(SnapshotEntry) error) (SnapshotHeader, error)

Format constants — all correct ✓

Constant store/lsm_store.go snapshot_reader.go
timestamp suffix size timestampSize = 8 snapshotTSSize = 8
value header size valueHeaderSize = 9 snapshotValueHeaderSize = 9
tombstone mask 0b0000_0001 0b0000_0001
encState mask/shift 0b0000_0110 / 1 0b0000_0110 / 1
reserved high-bit mask 0b1111_1000 0b1111_1000
magic bytes {'E','K','V','P','B','B','L','1'} identical ✓

Key encoding (userKey || BigEndian(^ts)) and value encoding (flags(1) || expireAt(8 LE) || body) match fillEncodedKey / fillEncodedValue exactly.


Summary

Severity Issue
Bug Discarded eof bool → wrong sentinel (ErrSnapshotShortValue vs ErrSnapshotTruncated) + dead io.EOF branch + missing test
Robustness Unbounded allocation in readExactGrow on corrupt length field
Minor Typo snapshotEncStateCleartxsnapshotEncStateCleartext
Minor Double WithStack on key-read error path
Design SnapshotHeader invisible for empty snapshots

The format compatibility and the fail-closed encryption/reserved-state logic are solid. The bug above is the only one that changes which sentinel callers see, and it only manifests on a specific truncation pattern not covered by the existing tests.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42fb7ba3db

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if err != nil {
return false, cockroachdberr.WithStack(err)
}
vLen, _, err := readEntryLen(r)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return truncation when value length is missing

When a file ends immediately after the key bytes, readEntryLen returns (0, true, nil), but this call discards the eof flag and proceeds as if the entry had a zero-length value. That makes this mid-entry truncation surface as ErrSnapshotShortValue rather than the advertised ErrSnapshotTruncated, so callers cannot reliably distinguish a cut-off snapshot from a malformed encoded value. Please check the eof result here and return ErrSnapshotTruncated for this case.

Useful? React with 👍 / 👎.

Comment on lines +277 to +278
if uint64(cap(dst)) < n {
dst = make([]byte, n)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Bound snapshot length prefixes before allocating

For a corrupt or adversarial snapshot with a huge key/value length prefix, this path allocates make([]byte, n) directly from the untrusted uint64 length. That can panic with len out of range or OOM the operator decoder instead of failing closed; the live restore path guards the same fields with maxPebbleEncodedKeySize / maxSnapshotValueSize before allocation. Please validate kLen and vLen against equivalent bounds before calling this helper.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a native Pebble snapshot reader and associated tests to decode entries while stripping MVCC metadata. The reviewer identified several critical issues: a lack of bounds checking on key and value lengths that could lead to resource exhaustion or panics, improper handling of truncated snapshots when reading value lengths, and an inefficient use of reflection in the header parsing logic.

valBuf *[]byte,
fn func(SnapshotHeader, SnapshotEntry) error,
) (bool, error) {
kLen, eof, err := readEntryLen(r)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The reader lacks bounds checking for kLen and vLen. Since these lengths are read directly from the input stream as uint64, a malformed or malicious snapshot file could specify extremely large lengths, leading to massive memory allocations and potential Out-Of-Memory (OOM) crashes in readExact and readExactGrow. Additionally, on 32-bit systems, converting a large uint64 to a slice index or length (which are int) will cause a panic.

It is recommended to enforce maximum size limits for keys and values, similar to the limits used in store/lsm_store.go (e.g., maxPebbleEncodedKeySize and maxSnapshotValueSize).

References
  1. When deserializing data with a size header, pre-allocating a buffer is acceptable if a strictly enforced maximum size limit is checked before allocation and the entire data must be materialized in memory.

Comment on lines +191 to +200
vLen, _, err := readEntryLen(r)
if err != nil {
// A clean EOF here means the snapshot truncated between
// the key bytes and the value-length field — not the
// same as a clean inter-entry EOF.
if cockroachdberr.Is(err, io.EOF) {
return false, cockroachdberr.WithStack(ErrSnapshotTruncated)
}
return false, err
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The readOneEntry function ignores the eof return value from readEntryLen when reading the value length (vLen). If the snapshot is truncated exactly before the vLen field, readEntryLen returns (0, true, nil). In this case, err is nil, and the code proceeds to call readExactGrow with vLen=0, which succeeds. Subsequently, decodeSnapshotEntry will fail with ErrSnapshotShortValue because the value header is missing. This should instead return ErrSnapshotTruncated to accurately reflect the state of the stream.

Additionally, the check cockroachdberr.Is(err, io.EOF) is redundant because readEntryLen already handles io.EOF and io.ErrUnexpectedEOF, returning either (0, true, nil) or ErrSnapshotTruncated.

	vLen, vEof, err := readEntryLen(r)
	if err != nil {
		return false, err
	}
	if vEof {
		return false, cockroachdberr.WithStack(ErrSnapshotTruncated)
	}

Comment on lines +226 to +229
var ts uint64
if err := binary.Read(r, binary.LittleEndian, &ts); err != nil {
return SnapshotHeader{}, cockroachdberr.WithStack(err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using binary.Read with a pointer to a scalar value is inefficient in Go because it uses reflection. For performance-sensitive code like a snapshot reader, it is better to read into a small byte array and use binary.LittleEndian.Uint64 directly, as done in readEntryLen.

Suggested change
var ts uint64
if err := binary.Read(r, binary.LittleEndian, &ts); err != nil {
return SnapshotHeader{}, cockroachdberr.WithStack(err)
}
var tsBuf [8]byte
if _, err := io.ReadFull(r, tsBuf[:]); err != nil {
return SnapshotHeader{}, cockroachdberr.WithStack(err)
}
ts := binary.LittleEndian.Uint64(tsBuf[:])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant