Skip to content

Commit e887e9f

Browse files
committed
docs(encryption): address PR707 round-6 (codex P1 + coderabbit Major)
P1 (codex on 2768cb3 line 621): §5.5 startup lag check only flagged rotation entries in the gap, so a missed enable-raft-envelope cutover would silently slip past and the FSM would decode post-cutover entries through the legacy first-byte tag space. Generalise the lag check to cover every encryption-relevant entry: rotate-dek, rewrap-deks, enable-storage-envelope, enable-raft-envelope, retire-dek. Implementation derives the set from a single isEncryptionRelevant predicate so future commands cannot bypass the check. Major (coderabbitai on 2768cb3 line 1000): §7.1 Phase 2 step 9 said snapshots carry the cutover index but kv/snapshot.go currently writes a fixed 16-byte header with no version field. Specify a versioned header in §4.4: magic||ver||len||ceiling||cutover||reserved with implicit-v1 fallback for legacy snapshots, v2 only emitted after enable-raft-envelope has been observed locally, ErrSnapshotHeaderVersion on forward-unknown versions, and zero-validation on reserved bytes to prevent silent forward-compat slips.
1 parent 2768cb3 commit e887e9f

1 file changed

Lines changed: 121 additions & 20 deletions

File tree

docs/design/2026_04_29_proposed_data_at_rest_encryption.md

Lines changed: 121 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -307,9 +307,77 @@ ciphertext from §4.1, the FSM snapshot stream is ciphertext by
307307
construction. No additional wrapping is required at the snapshot
308308
layer.
309309

310-
The snapshot file header (the `pebbleSnapshotMagic` 8-byte magic plus
311-
`lastCommitTS`) stays cleartext so a snapshot reader can identify the
312-
format before attempting decryption. This is metadata, not user data.
310+
#### Snapshot header format evolution (versioned)
311+
312+
Today's `kv/snapshot.go` writes a fixed 16-byte header — 8-byte
313+
`pebbleSnapshotMagic` + 8-byte `lastCommitTS` (HLC ceiling). To
314+
carry the §7.1 Phase 2 `raft_envelope_cutover_index` (and to leave
315+
room for later metadata) without breaking restore on existing
316+
clusters, the header is versioned **explicitly** rather than
317+
silently lengthened. Layout going forward:
318+
319+
```text
320+
+------------+--------+--------+----------+----------+----------+
321+
| magic(8) | ver(1) | len(2) | ceiling | cutover | reserved |
322+
| EKVPBBL1 | 0x02 | 0x40 | 8B | 8B | ... |
323+
+------------+--------+--------+----------+----------+----------+
324+
```
325+
326+
- `magic` — unchanged 8-byte `pebbleSnapshotMagic` (`EKVPBBL1`)
327+
so a v1 reader still recognises the stream.
328+
- `ver` — header format version. `0x01` is the implicit v1
329+
format that has no `ver`/`len` fields and is exactly 16 bytes
330+
total (legacy detection: read 9 bytes, if byte 8 looks like
331+
the high byte of a plausible HLC `ceiling` rather than a
332+
small version number, fall back to v1 reader). `0x02` is the
333+
versioned format above.
334+
- `len` — uint16 little-endian total payload length **after**
335+
the `len` field itself, so a v3+ reader can skip unknown
336+
trailing fields without a forward-compat fault.
337+
- `ceiling` — same 8-byte HLC ceiling as v1; preserves the
338+
existing field semantics so the v2 path is a strict superset.
339+
- `cutover``raft_envelope_cutover_index` as a fixed
340+
big-endian uint64. `0` means "Phase 2 not yet enabled at
341+
snapshot time", which is the correct value for a snapshot
342+
taken in Phase 0 or Phase 1.
343+
- `reserved` — zero-padded for alignment / future fields. Any
344+
v2 reader that sees non-zero bytes here errors with
345+
`ErrSnapshotHeaderReserved` rather than silently ignoring
346+
them.
347+
348+
Read path (`ReadSnapshotHeader`):
349+
350+
1. Read the 8-byte magic. Mismatch → `ErrSnapshotMagic`.
351+
2. Peek the next byte. If it falls in the documented v1
352+
`ceiling` high-byte range (i.e., the implicit-v1 detection
353+
above) treat the stream as v1, parse the remaining 8 bytes
354+
as `ceiling`, and return `cutover = 0`.
355+
3. Otherwise read it as `ver`. If `ver == 0x02`, read `len`
356+
and exactly that many bytes; parse `ceiling`, `cutover`.
357+
Trailing reserved bytes are validated as zero.
358+
4. If `ver` is unknown (`> 0x02` from a future writer), refuse
359+
the snapshot with `ErrSnapshotHeaderVersion` — restoring a
360+
forward-version snapshot under an older binary is **never**
361+
safe because we cannot prove the unknown extension is not
362+
semantically required.
363+
364+
Write path (`WriteSnapshotHeader`):
365+
366+
- Snapshots produced in Phase 0 / Phase 1 with no Phase-2
367+
cluster flag observed locally write **v1** for byte-for-byte
368+
compatibility with existing readers.
369+
- Snapshots produced after the local node has applied the
370+
`enable-raft-envelope` entry (i.e., the sidecar's
371+
`raft_envelope_cutover_index != 0`) write **v2**.
372+
- Once `enable-raft-envelope` has been observed once,
373+
subsequent snapshots stay on v2 forever; the writer never
374+
silently downgrades.
375+
376+
This evolution rule keeps every existing v1 snapshot file
377+
restorable under the new build with no migration step, makes
378+
v2 the steady state once Phase 2 is enabled, and pushes
379+
forward-compat decisions to a hard error rather than silent
380+
data corruption.
313381

314382
### 4.5 Distribution catalog and HLC ceiling entries
315383

@@ -600,26 +668,59 @@ retiring DEK.
600668

601669
### 5.5 Sidecar / Raft-log reconciliation
602670

603-
The DEK rotation flow in §5.2 says the FSM apply persists the new
604-
wrapped DEK into `keys.json` on every node. That apply is two
605-
operations on the local node — a Pebble write of the rotation log
606-
entry, then a sidecar file rewrite — and they are **not** atomic
607-
with respect to crash. A node that crashes between the two
608-
restarts with the rotation entry committed in its Raft log but the
609-
sidecar still on the previous generation, which would make a
610-
freshly-active DEK unloadable on that node.
611-
612-
To detect and repair this:
613-
614-
1. The sidecar carries `raft_applied_index` (§5.1), updated in the
615-
same `os.WriteFile` + `os.Rename` that persists the new wrapped
616-
DEKs.
671+
The DEK rotation flow in §5.2 (and the format-cutover entries in
672+
§7.1) says the FSM apply persists new encryption state — wrapped
673+
DEKs and the `raft_envelope_cutover_index` — into `keys.json` on
674+
every node. That apply is two operations on the local node — a
675+
Pebble write of the log entry, then a sidecar file rewrite — and
676+
they are **not** atomic with respect to crash. A node that
677+
crashes between the two restarts with the entry committed in its
678+
Raft log but the sidecar still on the previous generation, which
679+
would make a freshly-active DEK unloadable on that node, or
680+
worse, would make the node decode post-cutover Raft entries with
681+
the wrong format.
682+
683+
The set of "encryption-relevant" Raft entries that the lag check
684+
must cover is therefore broader than just rotation:
685+
686+
- `rotate-dek` — adds a new wrapped DEK to the sidecar.
687+
- `rewrap-deks` — re-wraps existing DEKs under the current KEK
688+
(full-set, per §5.5 step 3); also the auto-recovery proposal
689+
the leader uses to bring drifted nodes back in sync.
690+
- `enable-storage-envelope` — flips the §7.1 Phase 1 cluster
691+
flag and persists `storage_envelope_active = true` into the
692+
sidecar.
693+
- `enable-raft-envelope` — flips the §7.1 Phase 2 cluster flag
694+
AND persists the `raft_envelope_cutover_index` into the
695+
sidecar. This entry is the highest-stakes of the four:
696+
missing it on restart causes the FSM to decode every
697+
post-cutover entry through the legacy first-byte tag space
698+
(treating an opaque raft envelope as if it were an
699+
`0x00`/`0x01`/`0x02` tagged operation), which is a
700+
guaranteed mis-dispatch leading to apply failure or, worse,
701+
silent wrong-answer on `0x02`-shaped envelope bytes.
702+
- `retire-dek` — drops a DEK from the sidecar's `keys` map.
703+
704+
Any entry that mutates `keys.json` content is in this set. The
705+
implementation should derive the set from a single
706+
`isEncryptionRelevant(entry)` predicate so a future encryption
707+
admin command cannot quietly slip past the lag check.
708+
709+
To detect and repair sidecar-vs-log divergence:
710+
711+
1. The sidecar carries `raft_applied_index` (§5.1), updated in
712+
the same `os.WriteFile` + `os.Rename` that persists every
713+
encryption-relevant entry above.
617714
2. On startup, the encryption package reads the sidecar's
618715
`raft_applied_index`, then reads the raftengine's persisted
619716
applied index. If the raftengine has progressed past the
620-
sidecar's index AND the gap covers any rotation entries, the
621-
node refuses to start with `ErrSidecarBehindRaftLog`,
622-
pointing at the operator runbook.
717+
sidecar's index AND **the gap covers any encryption-relevant
718+
entry** (per the list above — not just rotation), the node
719+
refuses to start with `ErrSidecarBehindRaftLog`, pointing at
720+
the operator runbook. Restricting the check to rotations
721+
only would let a missed `enable-raft-envelope` slip past
722+
and cause silent mis-dispatch on the next post-cutover
723+
entry.
623724
3. Recovery rewraps **every unretired DEK**, not just the active
624725
one. A node that missed multiple rotations needs *every*
625726
intermediate `key_id` to decrypt MVCC history that still

0 commit comments

Comments
 (0)