Skip to content

Commit 9cac82a

Browse files
authored
docs/design: propose Phase 0b snapshot logical encoder (#823)
## Summary Doc-first PR for **Phase 0b** of the snapshot logical backup work (CLAUDE.md requires a `*_proposed_*` design doc to land before any wire-format implementation). Two commits: 1. **Promote** `snapshot_logical_decoder` `proposed`→`partial` (Phase 0a fully shipped: PRs #790/#791/#792/#806/#810) in a dedicated commit per the `docs/design/README` lifecycle convention. 2. **Propose** `2026_05_25_proposed_snapshot_logical_encoder.md` — `cmd/elastickv-snapshot-encode`, the inverse of the Phase 0a decoder. ## Why a separate design doc The encoder is not a mechanical mirror of the decoder. Three decisions arise only on the encode side, each a wire-format decision the parent doc left at sketch level: - **Internal-index reconstruction** (the load-bearing decision): the decoder *drops* every re-derivable internal index (Redis TTL scan index, DynamoDB GSI rows, SQS vis/dedup/group/by-age side records, per-scope generation counters). A *loadable* `.fsm` must contain them or the restored node serves wrong results (TTL'd keys never expire, GSI queries return nothing). The encoder must rebuild the full internal keyspace, mirroring the live adapter index builders (duplicated into `internal/backup` behind the same offline-tool boundary + staleness-review discipline already used for the snapshot-reader constants). - **MVCC re-encoding**: the directory tree carries no per-key `commit_ts` (decode discards it), so the encoder stamps every key with `invTS = ^last_commit_ts` from `MANIFEST.json`. Keeps every restored row at-or-below the HLC ceiling seeded from the snapshot header. - **No CRC32C footer**: the parent doc's format sketch showed a trailing CRC; that framing is the *MVCC streaming-restore* path, not the native EKVPBBL1 snapshot the decoder reads / the encoder must emit. Authoritative target format pinned down against `store/snapshot_pebble.go` `WriteTo` + `internal/backup/snapshot_reader.go` `ReadSnapshot`. ## Contents - Authoritative `.fsm` target format (sorted entries, size caps, cleartext-only). - Per-adapter reverse-encoder breakdown (Redis / DynamoDB / S3 / SQS), route-for-route against `internal/backup/decode.go`. - Directory-level round-trip self-test (`dir -> encode -> .fsm -> decode -> dir'`, exact; reverse `.fsm`-byte-identical is explicitly a non-goal). - Version/format gate + `ENCODE_INFO.json` provenance (`cluster_id`, key-format version). - Two **decision gates** flagged for review during implementation: GSI derivation and SQS side-record derivation (full reconstruction vs. lazy-rebuild fallback). - Per-adapter milestone plan mirroring Phase 0a. ## Risk Docs only — no code, no behavior change. ## Self-review (5 lenses) - **Data loss**: none — docs only. The doc itself hardens the *future* encoder's data-loss surface (index reconstruction, fail-closed on oversize entries, round-trip gate before finalize). - **Concurrency/distributed**: n/a (offline tool design). - **Performance**: doc notes the in-memory-sort memory bound and defers an external-sort follow-up. - **Consistency**: MVCC re-encoding section keeps restored rows at/below the HLC ceiling; documents why per-key `commit_ts` loss is invisible to the round-trip. - **Test coverage**: P0/P1/P2 test plan enumerated; per-adapter cross-check vs. live index builders required. ## Test plan - [ ] Design review of the two decision gates (GSI / SQS side records). - [ ] Confirm the MVCC re-encoding `last_commit_ts` stamping is acceptable for the restore HLC-ceiling seeding path. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Updated snapshot decoder design documentation; Phase 0a marked complete with Phase 0b encoder boundaries clarified * Added snapshot encoder design specification with requirements for data reconstruction and validation <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/823?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
2 parents 24e5538 + 074a8c4 commit 9cac82a

2 files changed

Lines changed: 487 additions & 8 deletions

File tree

docs/design/2026_04_29_proposed_snapshot_logical_decoder.md renamed to docs/design/2026_04_29_partial_snapshot_logical_decoder.md

Lines changed: 26 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,17 @@
11
# Snapshot ↔ Logical-Format Decoder (Phase 0)
22

3-
Status: Proposed
3+
Status: Partial
44
Author: bootjp
55
Date: 2026-04-29
66

7+
> **Lifecycle (2026-05-25):** Phase 0a (decoder) has fully shipped —
8+
> `internal/backup/` + `cmd/elastickv-snapshot-decode` (PRs #790,
9+
> #791, #792, #806, #810). Phase 0b (encoder) is specified in detail
10+
> in `2026_05_25_proposed_snapshot_logical_encoder.md` and is not yet
11+
> implemented. Phase 0c (operator integration) is open. This doc
12+
> remains the format owner; the encoder doc owns the reverse-direction
13+
> wire-format reconstruction.
14+
715
## Background
816

917
The existing FSM snapshot path (`store/snapshot_pebble.go`, see
@@ -14,9 +22,17 @@ entire keyspace as a single opaque stream:
1422
[magic "EKVPBBL1" :8]
1523
[lastCommitTS :8]
1624
([keyLen:8][key][valLen:8][val])*
17-
[CRC32C footer :4]
1825
```
1926

27+
The native Pebble snapshot stream has **no trailing checksum** — it
28+
terminates on a clean EOF at the start of a key-length field
29+
(`store/snapshot_pebble.go` `WriteTo`, `internal/backup/snapshot_reader.go`
30+
`ReadSnapshot`). A CRC32C footer exists only on the *MVCC streaming
31+
restore* path (`store/lsm_store.go` `readStreamingMVCCRestoreHeader`),
32+
which is a different framing the decoder/encoder do not touch. See
33+
`2026_05_25_proposed_snapshot_logical_encoder.md` §"Why a separate
34+
design doc" item 3.
35+
2036
Snapshots are taken automatically every `defaultSnapshotEvery = 10000`
2137
log entries (`internal/raftengine/etcd/engine.go:92`) and stored under
2238
`{dataDir}/fsm-snap/<index>.fsm`. They are crash-consistent by
@@ -458,7 +474,7 @@ elastickv-snapshot-decode \
458474
Pipeline:
459475

460476
```text
461-
open .fsm # verifies CRC32C footer
477+
open .fsm # verifies EKVPBBL1 magic
462478
parse EKVPBBL1 magic + lastCommitTS
463479
stream ([keyLen:8][key][valLen:8][val])* entries:
464480
dispatch by leading prefix:
@@ -495,7 +511,7 @@ current.
495511
elastickv-snapshot-encode \
496512
--input <directory-root> \
497513
--output <fsm-file> \
498-
[--last-commit-ts <unix-ms>]
514+
[--last-commit-ts <hlc-uint64>] # 64-bit HLC: 48-bit phys || 16-bit logical, not Unix-ms
499515
```
500516

501517
Pipeline:
@@ -505,12 +521,14 @@ read MANIFEST.json (refuse on unknown major format_version)
505521
walk per-adapter subtrees:
506522
DynamoDB → emit !ddb|meta|table| then !ddb|item| KV pairs
507523
S3 → emit !s3|bucket|meta| then !s3|obj|head| then !s3|blob| pairs
508-
(split assembled bodies into chunks at the same chunk-size
509-
the live cluster uses; chunk_size from MANIFEST.json)
524+
(split assembled bodies into chunks at the canonical
525+
s3ChunkSize from adapter/s3.go — the public sidecar does
526+
not carry per-chunk sizes; reassembly is sequential by
527+
chunkNo so object bytes are identical regardless)
510528
Redis → emit per-type wide-column or simple keys
511529
SQS → emit !sqs|queue|meta| then !sqs|msg|data| pairs
512530
verify the resulting key-set has no duplicates
513-
write EKVPBBL1 header + lastCommitTS + sorted KV stream + CRC32C footer
531+
write EKVPBBL1 header + lastCommitTS + sorted KV stream (no checksum footer)
514532
```
515533

516534
Output is a valid `.fsm` file in the same wire format the live FSM
@@ -683,7 +701,7 @@ bespoke parser, the format has failed its goal.
683701
| `TestS3SidecarSuffixCollision` | A user S3 object key ending in `.elastickv-meta.json` is rejected without `--rename-collisions`; with the flag the rename is recorded |
684702
| `TestEncodeDecodeRoundTrip` | Encoded `.fsm` decodes back to a directory tree byte-identical to the original (wall-time fields excluded) |
685703
| `TestManifestVersionGate` | Decoder refuses inputs with `format_version > current_major`; same-major-newer-minor allowed; older-major refused with a clear message |
686-
| `TestDecoderRejectsTruncatedFSM` | A `.fsm` whose CRC32C footer fails verification is rejected with a typed error before any record is emitted |
704+
| `TestDecoderRejectsTruncatedFSM` | A `.fsm` truncated mid-entry (or carrying a bad magic header) is rejected with a typed error (`ErrSnapshotTruncated` / `ErrSnapshotBadMagic`) before a partial record is emitted — the native format has no checksum footer, so corruption is caught at the magic header and the per-entry length-prefix EOF boundary |
687705

688706
### P1
689707

0 commit comments

Comments
 (0)