@@ -307,9 +307,77 @@ ciphertext from §4.1, the FSM snapshot stream is ciphertext by
307307construction. No additional wrapping is required at the snapshot
308308layer.
309309
310- The snapshot file header (the ` pebbleSnapshotMagic ` 8-byte magic plus
311- ` lastCommitTS ` ) stays cleartext so a snapshot reader can identify the
312- format before attempting decryption. This is metadata, not user data.
310+ #### Snapshot header format evolution (versioned)
311+
312+ Today's ` kv/snapshot.go ` writes a fixed 16-byte header — 8-byte
313+ ` pebbleSnapshotMagic ` + 8-byte ` lastCommitTS ` (HLC ceiling). To
314+ carry the §7.1 Phase 2 ` raft_envelope_cutover_index ` (and to leave
315+ room for later metadata) without breaking restore on existing
316+ clusters, the header is versioned ** explicitly** rather than
317+ silently lengthened. Layout going forward:
318+
319+ ``` text
320+ +------------+--------+--------+----------+----------+----------+
321+ | magic(8) | ver(1) | len(2) | ceiling | cutover | reserved |
322+ | EKVPBBL1 | 0x02 | 0x40 | 8B | 8B | ... |
323+ +------------+--------+--------+----------+----------+----------+
324+ ```
325+
326+ - ` magic ` — unchanged 8-byte ` pebbleSnapshotMagic ` (` EKVPBBL1 ` )
327+ so a v1 reader still recognises the stream.
328+ - ` ver ` — header format version. ` 0x01 ` is the implicit v1
329+ format that has no ` ver ` /` len ` fields and is exactly 16 bytes
330+ total (legacy detection: read 9 bytes, if byte 8 looks like
331+ the high byte of a plausible HLC ` ceiling ` rather than a
332+ small version number, fall back to v1 reader). ` 0x02 ` is the
333+ versioned format above.
334+ - ` len ` — uint16 little-endian total payload length ** after**
335+ the ` len ` field itself, so a v3+ reader can skip unknown
336+ trailing fields without a forward-compat fault.
337+ - ` ceiling ` — same 8-byte HLC ceiling as v1; preserves the
338+ existing field semantics so the v2 path is a strict superset.
339+ - ` cutover ` — ` raft_envelope_cutover_index ` as a fixed
340+ big-endian uint64. ` 0 ` means "Phase 2 not yet enabled at
341+ snapshot time", which is the correct value for a snapshot
342+ taken in Phase 0 or Phase 1.
343+ - ` reserved ` — zero-padded for alignment / future fields. Any
344+ v2 reader that sees non-zero bytes here errors with
345+ ` ErrSnapshotHeaderReserved ` rather than silently ignoring
346+ them.
347+
348+ Read path (` ReadSnapshotHeader ` ):
349+
350+ 1 . Read the 8-byte magic. Mismatch → ` ErrSnapshotMagic ` .
351+ 2 . Peek the next byte. If it falls in the documented v1
352+ ` ceiling ` high-byte range (i.e., the implicit-v1 detection
353+ above) treat the stream as v1, parse the remaining 8 bytes
354+ as ` ceiling ` , and return ` cutover = 0 ` .
355+ 3 . Otherwise read it as ` ver ` . If ` ver == 0x02 ` , read ` len `
356+ and exactly that many bytes; parse ` ceiling ` , ` cutover ` .
357+ Trailing reserved bytes are validated as zero.
358+ 4 . If ` ver ` is unknown (` > 0x02 ` from a future writer), refuse
359+ the snapshot with ` ErrSnapshotHeaderVersion ` — restoring a
360+ forward-version snapshot under an older binary is ** never**
361+ safe because we cannot prove the unknown extension is not
362+ semantically required.
363+
364+ Write path (` WriteSnapshotHeader ` ):
365+
366+ - Snapshots produced in Phase 0 / Phase 1 with no Phase-2
367+ cluster flag observed locally write ** v1** for byte-for-byte
368+ compatibility with existing readers.
369+ - Snapshots produced after the local node has applied the
370+ ` enable-raft-envelope ` entry (i.e., the sidecar's
371+ ` raft_envelope_cutover_index != 0 ` ) write ** v2** .
372+ - Once ` enable-raft-envelope ` has been observed once,
373+ subsequent snapshots stay on v2 forever; the writer never
374+ silently downgrades.
375+
376+ This evolution rule keeps every existing v1 snapshot file
377+ restorable under the new build with no migration step, makes
378+ v2 the steady state once Phase 2 is enabled, and pushes
379+ forward-compat decisions to a hard error rather than silent
380+ data corruption.
313381
314382### 4.5 Distribution catalog and HLC ceiling entries
315383
@@ -600,26 +668,59 @@ retiring DEK.
600668
601669### 5.5 Sidecar / Raft-log reconciliation
602670
603- The DEK rotation flow in §5.2 says the FSM apply persists the new
604- wrapped DEK into ` keys.json ` on every node. That apply is two
605- operations on the local node — a Pebble write of the rotation log
606- entry, then a sidecar file rewrite — and they are ** not** atomic
607- with respect to crash. A node that crashes between the two
608- restarts with the rotation entry committed in its Raft log but the
609- sidecar still on the previous generation, which would make a
610- freshly-active DEK unloadable on that node.
611-
612- To detect and repair this:
613-
614- 1 . The sidecar carries ` raft_applied_index ` (§5.1), updated in the
615- same ` os.WriteFile ` + ` os.Rename ` that persists the new wrapped
616- DEKs.
671+ The DEK rotation flow in §5.2 (and the format-cutover entries in
672+ §7.1) says the FSM apply persists new encryption state — wrapped
673+ DEKs and the ` raft_envelope_cutover_index ` — into ` keys.json ` on
674+ every node. That apply is two operations on the local node — a
675+ Pebble write of the log entry, then a sidecar file rewrite — and
676+ they are ** not** atomic with respect to crash. A node that
677+ crashes between the two restarts with the entry committed in its
678+ Raft log but the sidecar still on the previous generation, which
679+ would make a freshly-active DEK unloadable on that node, or
680+ worse, would make the node decode post-cutover Raft entries with
681+ the wrong format.
682+
683+ The set of "encryption-relevant" Raft entries that the lag check
684+ must cover is therefore broader than just rotation:
685+
686+ - ` rotate-dek ` — adds a new wrapped DEK to the sidecar.
687+ - ` rewrap-deks ` — re-wraps existing DEKs under the current KEK
688+ (full-set, per §5.5 step 3); also the auto-recovery proposal
689+ the leader uses to bring drifted nodes back in sync.
690+ - ` enable-storage-envelope ` — flips the §7.1 Phase 1 cluster
691+ flag and persists ` storage_envelope_active = true ` into the
692+ sidecar.
693+ - ` enable-raft-envelope ` — flips the §7.1 Phase 2 cluster flag
694+ AND persists the ` raft_envelope_cutover_index ` into the
695+ sidecar. This entry is the highest-stakes of the four:
696+ missing it on restart causes the FSM to decode every
697+ post-cutover entry through the legacy first-byte tag space
698+ (treating an opaque raft envelope as if it were an
699+ ` 0x00 ` /` 0x01 ` /` 0x02 ` tagged operation), which is a
700+ guaranteed mis-dispatch leading to apply failure or, worse,
701+ silent wrong-answer on ` 0x02 ` -shaped envelope bytes.
702+ - ` retire-dek ` — drops a DEK from the sidecar's ` keys ` map.
703+
704+ Any entry that mutates ` keys.json ` content is in this set. The
705+ implementation should derive the set from a single
706+ ` isEncryptionRelevant(entry) ` predicate so a future encryption
707+ admin command cannot quietly slip past the lag check.
708+
709+ To detect and repair sidecar-vs-log divergence:
710+
711+ 1 . The sidecar carries ` raft_applied_index ` (§5.1), updated in
712+ the same ` os.WriteFile ` + ` os.Rename ` that persists every
713+ encryption-relevant entry above.
6177142 . On startup, the encryption package reads the sidecar's
618715 ` raft_applied_index ` , then reads the raftengine's persisted
619716 applied index. If the raftengine has progressed past the
620- sidecar's index AND the gap covers any rotation entries, the
621- node refuses to start with ` ErrSidecarBehindRaftLog ` ,
622- pointing at the operator runbook.
717+ sidecar's index AND ** the gap covers any encryption-relevant
718+ entry** (per the list above — not just rotation), the node
719+ refuses to start with ` ErrSidecarBehindRaftLog ` , pointing at
720+ the operator runbook. Restricting the check to rotations
721+ only would let a missed ` enable-raft-envelope ` slip past
722+ and cause silent mis-dispatch on the next post-cutover
723+ entry.
6237243 . Recovery rewraps ** every unretired DEK** , not just the active
624725 one. A node that missed multiple rotations needs * every*
625726 intermediate ` key_id ` to decrypt MVCC history that still
0 commit comments