Skip to content

Commit 51ffb79

Browse files
committed
docs(encryption): address PR707 round-8 (codex P1x2 + claude[bot] r6 critical/medium)
P1 (codex on 0a82420 line 190): nonce uniqueness check covered only voting members. Learners apply writes too, and historical members may have written under the still-current DEK -- with local_epoch/write_count restarting from low values, a colliding new node would reuse (key, nonce) pairs (catastrophic AES-GCM). Add a Raft-replicated writer registry !encryption|writers|<dek_id>|<uint16(node_id)> tracking full_node_id; first write under a DEK requires a RegisterEncryptionWriter proposal that rejects collisions cluster-wide and lifetime-wide. Retiring a DEK drops its registry slice. Startup membership check stays as a fast pre-check. P1 (codex on 0a82420 line 332) + claude[bot] r6 #1/#5/#7: snapshot v1/v2 detection used a value-range heuristic on the post-magic byte that could misclassify v1 streams (HLC ceiling high-byte can legitimately be small). Replace with distinct-magic discrimination: v1 keeps the existing hlcSnapshotMagic (EKVTHLC1), v2 uses new hlcSnapshotMagicV2 (EKVTHLC2). Read path is bytes.Equal on the leading 8 bytes. Drop the unexplained reserved-bytes block and the bogus len=0x40 (now len=0x10 for the two defined fields). Critical (claude[bot] r6 #1): §4.4 named the wrong magic constant -- pebbleSnapshotMagic=EKVPBBL1 belongs to store/snapshot_pebble.go, but the snapshot carrying lastCommitTS lives in kv/snapshot.go and uses hlcSnapshotMagic=EKVTHLC1. Corrected throughout §4.4 (an implementation following the doc would have failed v1 detection). Medium (claude[bot] r6 #4): §5.5 step 3 manual recovery said "rotation and rewrap-deks entries" but the lag check above already defines isEncryptionRelevant() covering five entry types. Use the same predicate in the recovery description so a missed enable-storage-envelope or enable-raft-envelope is replayed.
1 parent 0a82420 commit 51ffb79

1 file changed

Lines changed: 137 additions & 60 deletions

File tree

docs/design/2026_04_29_proposed_data_at_rest_encryption.md

Lines changed: 137 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -184,17 +184,68 @@ Envelope format (single byte stream, stored as the Pebble value):
184184
`internal/raftengine/etcd/peers.go::DeriveNodeID(--raftId)`,
185185
which today returns the FNV-64a hash of the raft-id string
186186
as a `uint64`. Truncation rule: `node_id := uint16(DeriveNodeID(raftID))`.
187-
On every process start the encryption package reads the
188-
current Raft membership snapshot and asserts that **no two
189-
voting members share the same lower-16-bit value of their
190-
derived node ID** — refuses to start with
191-
`ErrNodeIDCollision` otherwise (see §9.1). The check is at
192-
most O(num_voting_members²) comparisons (single-digit
193-
cluster sizes today); the cost is paid once at boot. For
194-
elastickv's current `"n1"`/`"n2"`/`"n3"` ID convention the
195-
16-bit values are well-distributed and collisions are
196-
practically impossible, but the assertion makes the nonce
197-
uniqueness invariant unconditional rather than statistical.
187+
188+
**Cluster-wide, lifetime-wide uniqueness via a Raft-replicated
189+
writer registry.** Checking only the current voting
190+
membership at boot is not enough — learners also apply
191+
writes (so they encrypt under the storage DEK), and
192+
membership churn during a DEK's lifetime can let a new
193+
node share a truncated `node_id` with a former member that
194+
already wrote under the same DEK. With `local_epoch` and
195+
`write_count` both restarting from low values on the new
196+
node, that overlap reuses `(DEK, nonce)` pairs the former
197+
member already produced — catastrophic AES-GCM failure.
198+
199+
To make the uniqueness invariant unconditional, the
200+
encryption package maintains a Raft-replicated **writer
201+
registry** keyed per DEK:
202+
203+
```text
204+
!encryption|writers|<dek_id>|<uint16(node_id)> →
205+
{ full_node_id: uint64,
206+
first_seen_local_epoch: uint16,
207+
last_seen_local_epoch: uint16 }
208+
```
209+
210+
Before a node performs **its first write under a DEK**
211+
(whether at process start under an existing DEK, or
212+
immediately after a rotation that created a new active
213+
DEK), it proposes a `RegisterEncryptionWriter(dek_id,
214+
full_node_id, local_epoch)` Raft entry. FSM apply checks:
215+
216+
1. If no entry exists at the registry key → insert,
217+
allowing the registration. The node may now write under
218+
this DEK.
219+
2. If an entry exists with the **same** `full_node_id` →
220+
this is a re-registration (post-restart). Bump
221+
`last_seen_local_epoch` and allow. The node was already
222+
a legitimate writer.
223+
3. If an entry exists with a **different** `full_node_id`
224+
→ reject the registration with `ErrNodeIDCollision`.
225+
The proposing node refuses to start; the operator must
226+
choose a different `--raftId` whose hash truncates
227+
differently before the node can join the cluster as a
228+
writer.
229+
230+
Retiring a DEK (per §5.4 retirement criterion) drops the
231+
entire `!encryption|writers|<dek_id>|*` slice in the same
232+
Raft entry that retires the DEK, so the registry never
233+
grows unboundedly. The check covers voters, learners,
234+
historical members, and any future replica that ever
235+
applies a write under any unretired DEK.
236+
237+
The startup membership-snapshot check is retained as a
238+
fast pre-check (one O(num_members²) comparison at boot to
239+
catch obvious operator mistakes before the first
240+
`RegisterEncryptionWriter` round-trip), but the registry
241+
is the load-bearing mechanism for the safety guarantee.
242+
243+
For elastickv's current `"n1"`/`"n2"`/`"n3"` ID convention
244+
the 16-bit values are well-distributed and registry
245+
collisions are practically impossible — the registry's
246+
role is to make the impossibility audit-able and to fail
247+
safely in the rare event that ID strings happen to
248+
collide.
198249
- `local_epoch` — 16-bit per-DEK process-load counter,
199250
persisted in the local sidecar (§5.1). Incremented and
200251
fsync'd **before** the new process performs any encryption,
@@ -310,56 +361,74 @@ layer.
310361
#### Snapshot header format evolution (versioned)
311362
312363
Today's `kv/snapshot.go` writes a fixed 16-byte header — 8-byte
313-
`pebbleSnapshotMagic` + 8-byte `lastCommitTS` (HLC ceiling). To
314-
carry the §7.1 Phase 2 `raft_envelope_cutover_index` (and to leave
315-
room for later metadata) without breaking restore on existing
316-
clusters, the header is versioned **explicitly** rather than
317-
silently lengthened. Layout going forward:
364+
`hlcSnapshotMagic` (`EKVTHLC1`) followed by an 8-byte
365+
`lastCommitTS` (HLC ceiling). To carry the §7.1 Phase 2
366+
`raft_envelope_cutover_index` (and to leave room for later
367+
metadata) without breaking restore on existing clusters, the
368+
header is versioned via a **distinct magic per version** rather
369+
than a value-range heuristic on a shared magic. Discrimination is
370+
then unambiguous: read the 8 magic bytes, switch on the exact
371+
match. No "is this small value a ceiling-high-byte or a version
372+
number?" guesswork — that heuristic was retracted because HLC
373+
ceilings can legitimately have a low high-byte (e.g. an early-
374+
epoch cluster), which would misclassify v1 streams as v2.
375+
376+
Layout going forward:
318377
319378
```text
320-
+------------+--------+--------+----------+----------+----------+
321-
| magic(8) | ver(1) | len(2) | ceiling | cutover | reserved |
322-
| EKVPBBL1 | 0x02 | 0x40 | 8B | 8B | ... |
323-
+------------+--------+--------+----------+----------+----------+
379+
v1 (legacy, unchanged):
380+
+------------+----------+
381+
| magic(8) | ceiling |
382+
| EKVTHLC1 | 8B |
383+
+------------+----------+
384+
385+
v2 (Phase-2-aware):
386+
+------------+--------+----------+----------+
387+
| magic(8) | len(2) | ceiling | cutover |
388+
| EKVTHLC2 | 0x10 | 8B | 8B |
389+
+------------+--------+----------+----------+
324390
```
325391

326-
- `magic` — unchanged 8-byte `pebbleSnapshotMagic` (`EKVPBBL1`)
327-
so a v1 reader still recognises the stream.
328-
- `ver` — header format version. `0x01` is the implicit v1
329-
format that has no `ver`/`len` fields and is exactly 16 bytes
330-
total (legacy detection: read 9 bytes, if byte 8 looks like
331-
the high byte of a plausible HLC `ceiling` rather than a
332-
small version number, fall back to v1 reader). `0x02` is the
333-
versioned format above.
334-
- `len` — uint16 little-endian total payload length **after**
335-
the `len` field itself, so a v3+ reader can skip unknown
336-
trailing fields without a forward-compat fault.
392+
- v1 magic remains the byte-for-byte
393+
`hlcSnapshotMagic = EKVTHLC1` defined in `kv/snapshot.go`, so
394+
every existing snapshot file restores under the new build with
395+
no migration step.
396+
- v2 introduces a new constant `hlcSnapshotMagicV2 = EKVTHLC2`
397+
(last byte bumped from `'1'` to `'2'`). The two magics share
398+
no overlap, so a `bytes.Equal` of the leading 8 bytes
399+
unambiguously selects the format.
400+
- `len` is uint16 big-endian, total payload length **after** the
401+
`len` field itself. v2 ships with the two defined fields
402+
(`ceiling`, `cutover`) for `len = 0x0010` (16 bytes); a future
403+
v2 writer may extend the payload by appending fields and
404+
bumping `len`. Older v2 readers ignore any bytes past the
405+
fields they understand (i.e., they read the first 16 bytes
406+
of payload and skip the rest using `len`). This is the
407+
forward-compat hatch within v2.
337408
- `ceiling` — same 8-byte HLC ceiling as v1; preserves the
338409
existing field semantics so the v2 path is a strict superset.
339-
- `cutover``raft_envelope_cutover_index` as a fixed
340-
big-endian uint64. `0` means "Phase 2 not yet enabled at
341-
snapshot time", which is the correct value for a snapshot
342-
taken in Phase 0 or Phase 1.
343-
- `reserved` — zero-padded for alignment / future fields. Any
344-
v2 reader that sees non-zero bytes here errors with
345-
`ErrSnapshotHeaderReserved` rather than silently ignoring
346-
them.
410+
- `cutover``raft_envelope_cutover_index` as a big-endian
411+
uint64. `0` means "Phase 2 not yet enabled at snapshot time",
412+
which is the correct value for a snapshot taken in Phase 0
413+
or Phase 1.
414+
415+
There is no `reserved`-bytes field; the earlier draft's
416+
`len = 0x40` (64 bytes) was a documentation error with no
417+
matching field count. The implementation must validate that the
418+
bytes consumed match `len` exactly.
347419

348420
Read path (`ReadSnapshotHeader`):
349421

350-
1. Read the 8-byte magic. Mismatch → `ErrSnapshotMagic`.
351-
2. Peek the next byte. If it falls in the documented v1
352-
`ceiling` high-byte range (i.e., the implicit-v1 detection
353-
above) treat the stream as v1, parse the remaining 8 bytes
354-
as `ceiling`, and return `cutover = 0`.
355-
3. Otherwise read it as `ver`. If `ver == 0x02`, read `len`
356-
and exactly that many bytes; parse `ceiling`, `cutover`.
357-
Trailing reserved bytes are validated as zero.
358-
4. If `ver` is unknown (`> 0x02` from a future writer), refuse
359-
the snapshot with `ErrSnapshotHeaderVersion` — restoring a
360-
forward-version snapshot under an older binary is **never**
361-
safe because we cannot prove the unknown extension is not
362-
semantically required.
422+
1. Read the 8-byte magic.
423+
2. If it equals `hlcSnapshotMagic` (`EKVTHLC1`): parse the next
424+
8 bytes as `ceiling`, return `cutover = 0`.
425+
3. If it equals `hlcSnapshotMagicV2` (`EKVTHLC2`): read `len`,
426+
then exactly `len` payload bytes; parse `ceiling` from the
427+
first 8, `cutover` from the next 8, ignore any trailing
428+
bytes (forward-compat within v2).
429+
4. Otherwise refuse the snapshot with
430+
`ErrSnapshotHeaderUnknownMagic`. Restoring a snapshot whose
431+
format we cannot identify is never safe.
363432

364433
Write path (`WriteSnapshotHeader`):
365434

@@ -376,8 +445,9 @@ Write path (`WriteSnapshotHeader`):
376445
This evolution rule keeps every existing v1 snapshot file
377446
restorable under the new build with no migration step, makes
378447
v2 the steady state once Phase 2 is enabled, and pushes
379-
forward-compat decisions to a hard error rather than silent
380-
data corruption.
448+
forward-compat decisions to a hard error on unknown magic
449+
rather than silent data corruption from a misclassified
450+
heuristic.
381451

382452
### 4.5 Distribution catalog and HLC ceiling entries
383453

@@ -736,12 +806,19 @@ To detect and repair sidecar-vs-log divergence:
736806
sidecar to the full set in one apply.
737807
- **Manual on a stuck follower** — operator runs
738808
`elastickv-admin encryption resync-sidecar`. The command
739-
replays *all* rotation and `rewrap-deks` entries between
740-
the sidecar's `raft_applied_index` and the FSM's applied
741-
index into the local sidecar (not just the most recent
742-
one), then exits. Replaying only the active DEK would
743-
leave intermediate `key_id`s missing and silently break
744-
historical reads on that node — explicitly rejected.
809+
replays **every entry that satisfies the
810+
`isEncryptionRelevant()` predicate** (the same set used by
811+
the §5.5 lag check above: `rotate-dek`, `rewrap-deks`,
812+
`enable-storage-envelope`, `enable-raft-envelope`,
813+
`retire-dek`) between the sidecar's `raft_applied_index`
814+
and the FSM's applied index into the local sidecar (not
815+
just the most recent rotation, and not just rotation /
816+
rewrap entries). Restricting the replay to a subset would
817+
leave intermediate `key_id`s missing OR leave the
818+
`raft_envelope_cutover_index` unset — silently breaking
819+
historical reads or causing post-cutover Raft entries to
820+
be decoded through the legacy first-byte tag space.
821+
Explicitly rejected.
745822
Refusing to start until recovery completes is deliberate:
746823
silently serving with an incomplete sidecar would let the
747824
node write under one `key_id` while failing to decrypt

0 commit comments

Comments
 (0)