@@ -419,16 +419,45 @@ bytes consumed match `len` exactly.
419419
420420Read path (` ReadSnapshotHeader ` ):
421421
422- 1 . Read the 8-byte magic.
423- 2 . If it equals ` hlcSnapshotMagic ` (` EKVTHLC1 ` ): parse the next
424- 8 bytes as ` ceiling ` , return ` cutover = 0 ` .
425- 3 . If it equals ` hlcSnapshotMagicV2 ` (` EKVTHLC2 ` ): read ` len ` ,
426- then exactly ` len ` payload bytes; parse ` ceiling ` from the
427- first 8, ` cutover ` from the next 8, ignore any trailing
428- bytes (forward-compat within v2).
429- 4 . Otherwise refuse the snapshot with
430- ` ErrSnapshotHeaderUnknownMagic ` . Restoring a snapshot whose
431- format we cannot identify is never safe.
422+ 1 . Peek the 8-byte magic ** without consuming it from the
423+ underlying stream** (` bufio.Reader.Peek ` or equivalent), so
424+ the bytes can be replayed back into the payload reader if
425+ they turn out not to be a magic.
426+ 2 . If it equals ` hlcSnapshotMagicV2 ` (` EKVTHLC2 ` ): consume the
427+ 8 magic bytes, read ` len ` , then exactly ` len ` payload
428+ bytes; parse ` ceiling ` from the first 8, ` cutover ` from the
429+ next 8, ignore any trailing bytes (forward-compat within v2).
430+ 3 . Else if it equals ` hlcSnapshotMagic ` (` EKVTHLC1 ` ): consume
431+ the 8 magic bytes, parse the next 8 bytes as ` ceiling ` ,
432+ return ` cutover = 0 ` .
433+ 4 . ** Else — headerless legacy snapshot.** Today's
434+ ` kv/fsm.go::Restore ` already supports pre-HLC snapshots
435+ that have no header at all and treat the entire stream as
436+ inner store payload (` TestFSMSnapshotRestoreOldFormat ` /
437+ ` TestFSMSnapshotRestoreSmallLegacy ` in ` kv/snapshot_test.go `
438+ are the regression tests). Preserve that compatibility:
439+ leave the peeked 8 bytes in the underlying stream (do not
440+ consume them), return ` ceiling = 0 ` and ` cutover = 0 ` , and
441+ pass the entire byte sequence — including the 8 peeked
442+ bytes — to the inner store-payload restore path. The
443+ restore path is responsible for whatever framing the
444+ legacy snapshot used internally.
445+
446+ This case is ** not** an error and ** must not** return
447+ ` ErrSnapshotHeaderUnknownMagic ` . The error variant is
448+ reserved for future-version snapshots whose magic is in
449+ the ` EKVTHLC* ` family but with a version byte we do not
450+ recognise (e.g., a hypothetical ` EKVTHLC3 ` written by a
451+ newer build); those are genuinely unsafe to restore under
452+ the current binary.
453+
454+ Concretely the discriminator for "headerless vs.
455+ future-incompatible" is whether the 8 peeked bytes start
456+ with the 7-byte prefix ` "EKVTHLC" ` . If they do and the
457+ version byte is unknown → ` ErrSnapshotHeaderUnknownMagic ` .
458+ If they do not match the prefix at all → headerless legacy
459+ fallback per step 4. The existing v1 regression tests must
460+ continue to pass under the new reader.
432461
433462Write path (` WriteSnapshotHeader ` ):
434463
@@ -501,10 +530,18 @@ Two-tier hierarchy:
501530 No default. If ` --encryption-enabled ` is set without a KEK source,
502531 the process refuses to start.
503532
504- - ** DEK (Data Encryption Key).** 32-byte AES key generated locally
505- with ` crypto/rand ` . Two DEKs are issued in v1: ` dek_storage ` (used
506- by §4.1) and ` dek_raft ` (used by §4.2). Both are wrapped by the KEK
507- and persisted in a sidecar file:
533+ - ** DEK (Data Encryption Key).** 32-byte AES key. Two DEKs are
534+ issued in v1: ` dek_storage ` (used by §4.1) and ` dek_raft `
535+ (used by §4.2). DEKs are ** generated on the leader with
536+ ` crypto/rand ` and committed via Raft** — never independently
537+ on each node — so every replica observes the same ` key_id `
538+ for the same DEK at the same log index. The bootstrap and
539+ rotation flow that enforces this is in §5.6; locally
540+ generating a DEK without going through Raft would let
541+ different replicas pick different ` key_id ` s for what is
542+ conceptually the same key, breaking
543+ snapshot/catch-up decryption. Each node persists the
544+ KEK-wrapped DEK in a sidecar file:
508545
509546 ``` text
510547 <dataDir>/encryption/keys.json
@@ -836,6 +873,68 @@ section only protects against the narrower failure of "DEK was
836873rotated cluster-wide but this one node missed the sidecar
837874rewrite."
838875
876+ ### 5.6 Initial DEK bootstrap
877+
878+ Both rotation (§5.2) and the writer registry (§4.1) assume that
879+ every replica observes the same `(active_storage_dek_id,
880+ active_raft_dek_id)` tuple at every Raft log index. If each
881+ node generated its initial DEKs locally with ` crypto/rand ` at
882+ process start, three replicas would pick three different
883+ ` key_id ` s for what should be one cluster-wide active DEK.
884+ Storage encryption would then start producing values whose
885+ ciphertext is decryptable only on the writing node; snapshot
886+ streaming and follower catch-up would fail with
887+ ` unknown_key_id ` the moment the first encrypted write left the
888+ writing node. Bootstrap is therefore ** never** a local-only
889+ operation — the first DEKs are always created via Raft, the
890+ same way as every subsequent rotation:
891+
892+ 1 . ** Trigger.** The bootstrap is implicit in
893+ ` enable-storage-envelope ` (§7.1 Phase 1). The admin command,
894+ on the leader, checks whether the sidecar's ` active.storage `
895+ and ` active.raft ` fields are unset (` key_id == 0 ` ). If so, it
896+ first proposes a ` bootstrap-encryption ` Raft entry containing
897+ freshly-generated ` dek_storage ` + ` dek_raft ` (CSPRNG via
898+ ` crypto/rand ` , wrapped under the current KEK), then proposes
899+ the actual ` enable-storage-envelope ` flag entry. The two
900+ entries are pipelined but not atomic; the order is fixed
901+ so the flag entry is always preceded by a bootstrap that
902+ has already populated every node's sidecar.
903+ 2 . ** Idempotency.** ` bootstrap-encryption ` is ** rejected** by
904+ FSM apply if the sidecar already has an active storage DEK
905+ (the leader's pre-check above is a fast path; the FSM check
906+ is the load-bearing one). This prevents an operator who
907+ accidentally re-runs ` enable-storage-envelope ` from
908+ replacing the existing DEKs with a fresh pair and orphaning
909+ every value already encrypted under the old DEKs.
910+ 3 . ** Phase 2 reuse.** ` enable-raft-envelope ` does ** not**
911+ re-bootstrap — it relies on ` dek_raft ` being already
912+ present from step 1. If the operator somehow skips Phase 1
913+ and tries Phase 2 directly, the leader-side pre-check
914+ refuses with ` ErrEncryptionNotBootstrapped ` .
915+ 4 . ** Writer registry interaction.** A node performs its first
916+ ` RegisterEncryptionWriter ` (§4.1) ** after** it has
917+ observed both the bootstrap entry and the
918+ ` enable-storage-envelope ` flag entry in its applied log,
919+ not at process start. Until both are present the storage
920+ layer writes cleartext (` encryption_state = 0b00 ` ) and the
921+ registry is untouched. This prevents a brand-new node from
922+ trying to register against a DEK that does not yet exist
923+ cluster-wide.
924+ 5 . ** Sidecar reconciliation interaction.** The
925+ ` bootstrap-encryption ` entry is ` isEncryptionRelevant() `
926+ per §5.5, so the sidecar lag check covers it; a node that
927+ crashes between Raft apply and sidecar fsync of the
928+ bootstrap entry refuses to start with
929+ ` ErrSidecarBehindRaftLog ` and recovers via the same
930+ automatic-leader-rewrap or manual ` resync-sidecar ` paths.
931+
932+ The asymmetry with §5.2 rotation is deliberate. Rotation
933+ appends a * new* DEK to a non-empty ` keys ` map (and shifts
934+ ` active ` ); bootstrap creates the * first* DEK in an empty map.
935+ They share the FSM apply path but the leader-side
936+ pre-conditions and idempotency rules are different.
937+
839938---
840939
841940## 6. Implementation plan
@@ -1134,12 +1233,23 @@ flag becomes a *capability* assertion, not a *behaviour* trigger.
11341233
113512344 . Operator runs ` elastickv-admin encryption enable-storage-envelope ` .
11361235 The leader rechecks the membership-snapshot capability gate
1137- from step 3 and then proposes a single Raft entry with the
1138- cluster-wide flag ` storage_envelope_active = true ` . This
1139- entry's ` Data []byte ` is the ** legacy framing** (i.e., the
1140- existing ` kv/fsm.go ` first-byte tag space; see Phase 2's note
1141- below) so every replica — even one that somehow missed the
1142- capability advert — can still decode and apply it.
1236+ from step 3, then performs the ** DEK bootstrap if necessary**
1237+ per §5.6: if the leader's sidecar shows ` active.storage == 0 `
1238+ / ` active.raft == 0 ` , it first proposes a
1239+ ` bootstrap-encryption ` Raft entry that creates the initial
1240+ ` (dek_storage, dek_raft) ` pair and commits them to every
1241+ node's sidecar through the standard FSM apply path. Only
1242+ after that bootstrap entry has been applied does the leader
1243+ propose the actual ` enable-storage-envelope ` flag entry.
1244+ This sequencing is mandatory — flipping the flag without a
1245+ committed cluster-wide DEK pair would let each replica
1246+ encrypt under its own locally-generated ` key_id ` and break
1247+ snapshot/catch-up decryption. The flag entry's ` Data []byte `
1248+ is the ** legacy framing** (i.e., the existing ` kv/fsm.go `
1249+ first-byte tag space; see Phase 2's note below) so every
1250+ replica — even one that somehow missed the capability
1251+ advert — can still decode and apply it. The bootstrap entry
1252+ is similarly in legacy framing for the same reason.
114312535 . From the apply index of that flag entry onward, every storage
11441254 layer Put writes ` encryption_state = 0b01 ` and the §4.1
11451255 envelope. MVCC versions written before that index keep
@@ -1400,6 +1510,11 @@ The process refuses to start if any of the following hold:
14001510 that joined mid-Phase-2 from a snapshot taken before Phase 2
14011511 was enabled, then later replayed Raft entries that crossed
14021512 the cutover; resolved by ` encryption resync-sidecar ` .
1513+ - ` enable-raft-envelope ` was issued before the §5.6
1514+ ` bootstrap-encryption ` entry committed (` ErrEncryptionNotBootstrapped ` ).
1515+ Operator must run ` enable-storage-envelope ` first; the
1516+ bootstrap is implicit in that command and creates both the
1517+ storage and raft DEKs in a single Raft entry.
14031518
14041519Each refusal logs a single, unambiguous error pointing at the
14051520relevant flag and runbook section.
0 commit comments