Skip to content

Commit 86a2e33

Browse files
committed
docs(encryption): address PR707 round-9 codex P1x2 (headerless legacy snapshot + DEK bootstrap)
P1 (codex on 51ffb79 line 431): §4.4 v2 read path was hard-failing any snapshot whose first 8 bytes were not EKVTHLC1/EKVTHLC2 with ErrSnapshotHeaderUnknownMagic, but kvFSM.Restore today supports headerless legacy snapshots that must not be rejected (covered by TestFSMSnapshotRestoreOldFormat / TestFSMSnapshotRestoreSmallLegacy in kv/snapshot_test.go). Add a fourth read-path branch: if magic does not match either EKVTHLC1 or EKVTHLC2, leave the peeked 8 bytes in the underlying stream (do not consume) and fall through to the inner store-payload restore path with ceiling=cutover=0. The error variant is now reserved for genuinely future-incompatible snapshots whose magic starts with the EKVTHLC prefix but carries an unknown version byte. P1 (codex on 51ffb79 line 507): DEKs were specified as locally generated with crypto/rand on every node, which would let three replicas pick three different key_ids for the conceptually same active DEK. The first encrypted write would then be undecryptable on every other node (unknown_key_id), breaking snapshot transfer and follower catch-up. Add §5.6 explicit bootstrap protocol: the leader generates the initial (dek_storage, dek_raft) pair and proposes a bootstrap-encryption Raft entry that commits them via FSM apply to every node before enable-storage-envelope can succeed. The cutover command in §7.1 Phase 1 step 4 is updated to perform the bootstrap implicitly when active.storage / active.raft are unset, with a fixed ordering (bootstrap entry committed first, then the flag entry). enable-raft-envelope refuses with ErrEncryptionNotBootstrapped if no bootstrap has been observed.
1 parent 51ffb79 commit 86a2e33

1 file changed

Lines changed: 135 additions & 20 deletions

File tree

docs/design/2026_04_29_proposed_data_at_rest_encryption.md

Lines changed: 135 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -419,16 +419,45 @@ bytes consumed match `len` exactly.
419419

420420
Read path (`ReadSnapshotHeader`):
421421

422-
1. Read the 8-byte magic.
423-
2. If it equals `hlcSnapshotMagic` (`EKVTHLC1`): parse the next
424-
8 bytes as `ceiling`, return `cutover = 0`.
425-
3. If it equals `hlcSnapshotMagicV2` (`EKVTHLC2`): read `len`,
426-
then exactly `len` payload bytes; parse `ceiling` from the
427-
first 8, `cutover` from the next 8, ignore any trailing
428-
bytes (forward-compat within v2).
429-
4. Otherwise refuse the snapshot with
430-
`ErrSnapshotHeaderUnknownMagic`. Restoring a snapshot whose
431-
format we cannot identify is never safe.
422+
1. Peek the 8-byte magic **without consuming it from the
423+
underlying stream** (`bufio.Reader.Peek` or equivalent), so
424+
the bytes can be replayed back into the payload reader if
425+
they turn out not to be a magic.
426+
2. If it equals `hlcSnapshotMagicV2` (`EKVTHLC2`): consume the
427+
8 magic bytes, read `len`, then exactly `len` payload
428+
bytes; parse `ceiling` from the first 8, `cutover` from the
429+
next 8, ignore any trailing bytes (forward-compat within v2).
430+
3. Else if it equals `hlcSnapshotMagic` (`EKVTHLC1`): consume
431+
the 8 magic bytes, parse the next 8 bytes as `ceiling`,
432+
return `cutover = 0`.
433+
4. **Else — headerless legacy snapshot.** Today's
434+
`kv/fsm.go::Restore` already supports pre-HLC snapshots
435+
that have no header at all and treat the entire stream as
436+
inner store payload (`TestFSMSnapshotRestoreOldFormat` /
437+
`TestFSMSnapshotRestoreSmallLegacy` in `kv/snapshot_test.go`
438+
are the regression tests). Preserve that compatibility:
439+
leave the peeked 8 bytes in the underlying stream (do not
440+
consume them), return `ceiling = 0` and `cutover = 0`, and
441+
pass the entire byte sequence — including the 8 peeked
442+
bytes — to the inner store-payload restore path. The
443+
restore path is responsible for whatever framing the
444+
legacy snapshot used internally.
445+
446+
This case is **not** an error and **must not** return
447+
`ErrSnapshotHeaderUnknownMagic`. The error variant is
448+
reserved for future-version snapshots whose magic is in
449+
the `EKVTHLC*` family but with a version byte we do not
450+
recognise (e.g., a hypothetical `EKVTHLC3` written by a
451+
newer build); those are genuinely unsafe to restore under
452+
the current binary.
453+
454+
Concretely the discriminator for "headerless vs.
455+
future-incompatible" is whether the 8 peeked bytes start
456+
with the 7-byte prefix `"EKVTHLC"`. If they do and the
457+
version byte is unknown → `ErrSnapshotHeaderUnknownMagic`.
458+
If they do not match the prefix at all → headerless legacy
459+
fallback per step 4. The existing v1 regression tests must
460+
continue to pass under the new reader.
432461

433462
Write path (`WriteSnapshotHeader`):
434463

@@ -501,10 +530,18 @@ Two-tier hierarchy:
501530
No default. If `--encryption-enabled` is set without a KEK source,
502531
the process refuses to start.
503532

504-
- **DEK (Data Encryption Key).** 32-byte AES key generated locally
505-
with `crypto/rand`. Two DEKs are issued in v1: `dek_storage` (used
506-
by §4.1) and `dek_raft` (used by §4.2). Both are wrapped by the KEK
507-
and persisted in a sidecar file:
533+
- **DEK (Data Encryption Key).** 32-byte AES key. Two DEKs are
534+
issued in v1: `dek_storage` (used by §4.1) and `dek_raft`
535+
(used by §4.2). DEKs are **generated on the leader with
536+
`crypto/rand` and committed via Raft** — never independently
537+
on each node — so every replica observes the same `key_id`
538+
for the same DEK at the same log index. The bootstrap and
539+
rotation flow that enforces this is in §5.6; locally
540+
generating a DEK without going through Raft would let
541+
different replicas pick different `key_id`s for what is
542+
conceptually the same key, breaking
543+
snapshot/catch-up decryption. Each node persists the
544+
KEK-wrapped DEK in a sidecar file:
508545

509546
```text
510547
<dataDir>/encryption/keys.json
@@ -836,6 +873,68 @@ section only protects against the narrower failure of "DEK was
836873
rotated cluster-wide but this one node missed the sidecar
837874
rewrite."
838875

876+
### 5.6 Initial DEK bootstrap
877+
878+
Both rotation (§5.2) and the writer registry (§4.1) assume that
879+
every replica observes the same `(active_storage_dek_id,
880+
active_raft_dek_id)` tuple at every Raft log index. If each
881+
node generated its initial DEKs locally with `crypto/rand` at
882+
process start, three replicas would pick three different
883+
`key_id`s for what should be one cluster-wide active DEK.
884+
Storage encryption would then start producing values whose
885+
ciphertext is decryptable only on the writing node; snapshot
886+
streaming and follower catch-up would fail with
887+
`unknown_key_id` the moment the first encrypted write left the
888+
writing node. Bootstrap is therefore **never** a local-only
889+
operation — the first DEKs are always created via Raft, the
890+
same way as every subsequent rotation:
891+
892+
1. **Trigger.** The bootstrap is implicit in
893+
`enable-storage-envelope` (§7.1 Phase 1). The admin command,
894+
on the leader, checks whether the sidecar's `active.storage`
895+
and `active.raft` fields are unset (`key_id == 0`). If so, it
896+
first proposes a `bootstrap-encryption` Raft entry containing
897+
freshly-generated `dek_storage` + `dek_raft` (CSPRNG via
898+
`crypto/rand`, wrapped under the current KEK), then proposes
899+
the actual `enable-storage-envelope` flag entry. The two
900+
entries are pipelined but not atomic; the order is fixed
901+
so the flag entry is always preceded by a bootstrap that
902+
has already populated every node's sidecar.
903+
2. **Idempotency.** `bootstrap-encryption` is **rejected** by
904+
FSM apply if the sidecar already has an active storage DEK
905+
(the leader's pre-check above is a fast path; the FSM check
906+
is the load-bearing one). This prevents an operator who
907+
accidentally re-runs `enable-storage-envelope` from
908+
replacing the existing DEKs with a fresh pair and orphaning
909+
every value already encrypted under the old DEKs.
910+
3. **Phase 2 reuse.** `enable-raft-envelope` does **not**
911+
re-bootstrap — it relies on `dek_raft` being already
912+
present from step 1. If the operator somehow skips Phase 1
913+
and tries Phase 2 directly, the leader-side pre-check
914+
refuses with `ErrEncryptionNotBootstrapped`.
915+
4. **Writer registry interaction.** A node performs its first
916+
`RegisterEncryptionWriter` (§4.1) **after** it has
917+
observed both the bootstrap entry and the
918+
`enable-storage-envelope` flag entry in its applied log,
919+
not at process start. Until both are present the storage
920+
layer writes cleartext (`encryption_state = 0b00`) and the
921+
registry is untouched. This prevents a brand-new node from
922+
trying to register against a DEK that does not yet exist
923+
cluster-wide.
924+
5. **Sidecar reconciliation interaction.** The
925+
`bootstrap-encryption` entry is `isEncryptionRelevant()`
926+
per §5.5, so the sidecar lag check covers it; a node that
927+
crashes between Raft apply and sidecar fsync of the
928+
bootstrap entry refuses to start with
929+
`ErrSidecarBehindRaftLog` and recovers via the same
930+
automatic-leader-rewrap or manual `resync-sidecar` paths.
931+
932+
The asymmetry with §5.2 rotation is deliberate. Rotation
933+
appends a *new* DEK to a non-empty `keys` map (and shifts
934+
`active`); bootstrap creates the *first* DEK in an empty map.
935+
They share the FSM apply path but the leader-side
936+
pre-conditions and idempotency rules are different.
937+
839938
---
840939

841940
## 6. Implementation plan
@@ -1134,12 +1233,23 @@ flag becomes a *capability* assertion, not a *behaviour* trigger.
11341233

11351234
4. Operator runs `elastickv-admin encryption enable-storage-envelope`.
11361235
The leader rechecks the membership-snapshot capability gate
1137-
from step 3 and then proposes a single Raft entry with the
1138-
cluster-wide flag `storage_envelope_active = true`. This
1139-
entry's `Data []byte` is the **legacy framing** (i.e., the
1140-
existing `kv/fsm.go` first-byte tag space; see Phase 2's note
1141-
below) so every replica — even one that somehow missed the
1142-
capability advert — can still decode and apply it.
1236+
from step 3, then performs the **DEK bootstrap if necessary**
1237+
per §5.6: if the leader's sidecar shows `active.storage == 0`
1238+
/ `active.raft == 0`, it first proposes a
1239+
`bootstrap-encryption` Raft entry that creates the initial
1240+
`(dek_storage, dek_raft)` pair and commits them to every
1241+
node's sidecar through the standard FSM apply path. Only
1242+
after that bootstrap entry has been applied does the leader
1243+
propose the actual `enable-storage-envelope` flag entry.
1244+
This sequencing is mandatory — flipping the flag without a
1245+
committed cluster-wide DEK pair would let each replica
1246+
encrypt under its own locally-generated `key_id` and break
1247+
snapshot/catch-up decryption. The flag entry's `Data []byte`
1248+
is the **legacy framing** (i.e., the existing `kv/fsm.go`
1249+
first-byte tag space; see Phase 2's note below) so every
1250+
replica — even one that somehow missed the capability
1251+
advert — can still decode and apply it. The bootstrap entry
1252+
is similarly in legacy framing for the same reason.
11431253
5. From the apply index of that flag entry onward, every storage
11441254
layer Put writes `encryption_state = 0b01` and the §4.1
11451255
envelope. MVCC versions written before that index keep
@@ -1400,6 +1510,11 @@ The process refuses to start if any of the following hold:
14001510
that joined mid-Phase-2 from a snapshot taken before Phase 2
14011511
was enabled, then later replayed Raft entries that crossed
14021512
the cutover; resolved by `encryption resync-sidecar`.
1513+
- `enable-raft-envelope` was issued before the §5.6
1514+
`bootstrap-encryption` entry committed (`ErrEncryptionNotBootstrapped`).
1515+
Operator must run `enable-storage-envelope` first; the
1516+
bootstrap is implicit in that command and creates both the
1517+
storage and raft DEKs in a single Raft entry.
14031518

14041519
Each refusal logs a single, unambiguous error pointing at the
14051520
relevant flag and runbook section.

0 commit comments

Comments
 (0)