@@ -184,17 +184,68 @@ Envelope format (single byte stream, stored as the Pebble value):
184184 ` internal/raftengine/etcd/peers.go::DeriveNodeID(--raftId) ` ,
185185 which today returns the FNV-64a hash of the raft-id string
186186 as a ` uint64 ` . Truncation rule: ` node_id := uint16(DeriveNodeID(raftID)) ` .
187- On every process start the encryption package reads the
188- current Raft membership snapshot and asserts that ** no two
189- voting members share the same lower-16-bit value of their
190- derived node ID** — refuses to start with
191- ` ErrNodeIDCollision ` otherwise (see §9.1). The check is at
192- most O(num_voting_members²) comparisons (single-digit
193- cluster sizes today); the cost is paid once at boot. For
194- elastickv's current ` "n1" ` /` "n2" ` /` "n3" ` ID convention the
195- 16-bit values are well-distributed and collisions are
196- practically impossible, but the assertion makes the nonce
197- uniqueness invariant unconditional rather than statistical.
187+
188+ ** Cluster-wide, lifetime-wide uniqueness via a Raft-replicated
189+ writer registry.** Checking only the current voting
190+ membership at boot is not enough — learners also apply
191+ writes (so they encrypt under the storage DEK), and
192+ membership churn during a DEK's lifetime can let a new
193+ node share a truncated ` node_id ` with a former member that
194+ already wrote under the same DEK. With ` local_epoch ` and
195+ ` write_count ` both restarting from low values on the new
196+ node, that overlap reuses ` (DEK, nonce) ` pairs the former
197+ member already produced — catastrophic AES-GCM failure.
198+
199+ To make the uniqueness invariant unconditional, the
200+ encryption package maintains a Raft-replicated ** writer
201+ registry** keyed per DEK:
202+
203+ ``` text
204+ !encryption|writers|<dek_id>|<uint16(node_id)> →
205+ { full_node_id: uint64,
206+ first_seen_local_epoch: uint16,
207+ last_seen_local_epoch: uint16 }
208+ ```
209+
210+ Before a node performs **its first write under a DEK**
211+ (whether at process start under an existing DEK, or
212+ immediately after a rotation that created a new active
213+ DEK), it proposes a `RegisterEncryptionWriter(dek_id,
214+ full_node_id, local_epoch)` Raft entry. FSM apply checks:
215+
216+ 1. If no entry exists at the registry key → insert,
217+ allowing the registration. The node may now write under
218+ this DEK.
219+ 2. If an entry exists with the **same** `full_node_id` →
220+ this is a re-registration (post-restart). Bump
221+ `last_seen_local_epoch` and allow. The node was already
222+ a legitimate writer.
223+ 3. If an entry exists with a **different** `full_node_id`
224+ → reject the registration with `ErrNodeIDCollision`.
225+ The proposing node refuses to start; the operator must
226+ choose a different `--raftId` whose hash truncates
227+ differently before the node can join the cluster as a
228+ writer.
229+
230+ Retiring a DEK (per §5.4 retirement criterion) drops the
231+ entire `!encryption|writers|<dek_id>|*` slice in the same
232+ Raft entry that retires the DEK, so the registry never
233+ grows unboundedly. The check covers voters, learners,
234+ historical members, and any future replica that ever
235+ applies a write under any unretired DEK.
236+
237+ The startup membership-snapshot check is retained as a
238+ fast pre-check (one O(num_members²) comparison at boot to
239+ catch obvious operator mistakes before the first
240+ `RegisterEncryptionWriter` round-trip), but the registry
241+ is the load-bearing mechanism for the safety guarantee.
242+
243+ For elastickv's current `"n1"`/`"n2"`/`"n3"` ID convention
244+ the 16-bit values are well-distributed and registry
245+ collisions are practically impossible — the registry's
246+ role is to make the impossibility audit-able and to fail
247+ safely in the rare event that ID strings happen to
248+ collide.
198249 - `local_epoch` — 16-bit per-DEK process-load counter,
199250 persisted in the local sidecar (§5.1). Incremented and
200251 fsync'd **before** the new process performs any encryption,
@@ -310,56 +361,74 @@ layer.
310361#### Snapshot header format evolution (versioned)
311362
312363Today's `kv/snapshot.go` writes a fixed 16-byte header — 8-byte
313- ` pebbleSnapshotMagic ` + 8-byte ` lastCommitTS ` (HLC ceiling). To
314- carry the §7.1 Phase 2 ` raft_envelope_cutover_index ` (and to leave
315- room for later metadata) without breaking restore on existing
316- clusters, the header is versioned ** explicitly** rather than
317- silently lengthened. Layout going forward:
364+ `hlcSnapshotMagic` (`EKVTHLC1`) followed by an 8-byte
365+ `lastCommitTS` (HLC ceiling). To carry the §7.1 Phase 2
366+ `raft_envelope_cutover_index` (and to leave room for later
367+ metadata) without breaking restore on existing clusters, the
368+ header is versioned via a **distinct magic per version** rather
369+ than a value-range heuristic on a shared magic. Discrimination is
370+ then unambiguous: read the 8 magic bytes, switch on the exact
371+ match. No "is this small value a ceiling-high-byte or a version
372+ number?" guesswork — that heuristic was retracted because HLC
373+ ceilings can legitimately have a low high-byte (e.g. an early-
374+ epoch cluster), which would misclassify v1 streams as v2.
375+
376+ Layout going forward:
318377
319378```text
320- +------------+--------+--------+----------+----------+----------+
321- | magic(8) | ver(1) | len(2) | ceiling | cutover | reserved |
322- | EKVPBBL1 | 0x02 | 0x40 | 8B | 8B | ... |
323- +------------+--------+--------+----------+----------+----------+
379+ v1 (legacy, unchanged):
380+ +------------+----------+
381+ | magic(8) | ceiling |
382+ | EKVTHLC1 | 8B |
383+ +------------+----------+
384+
385+ v2 (Phase-2-aware):
386+ +------------+--------+----------+----------+
387+ | magic(8) | len(2) | ceiling | cutover |
388+ | EKVTHLC2 | 0x10 | 8B | 8B |
389+ +------------+--------+----------+----------+
324390```
325391
326- - ` magic ` — unchanged 8-byte ` pebbleSnapshotMagic ` (` EKVPBBL1 ` )
327- so a v1 reader still recognises the stream.
328- - ` ver ` — header format version. ` 0x01 ` is the implicit v1
329- format that has no ` ver ` /` len ` fields and is exactly 16 bytes
330- total (legacy detection: read 9 bytes, if byte 8 looks like
331- the high byte of a plausible HLC ` ceiling ` rather than a
332- small version number, fall back to v1 reader). ` 0x02 ` is the
333- versioned format above.
334- - ` len ` — uint16 little-endian total payload length ** after**
335- the ` len ` field itself, so a v3+ reader can skip unknown
336- trailing fields without a forward-compat fault.
392+ - v1 magic remains the byte-for-byte
393+ ` hlcSnapshotMagic = EKVTHLC1 ` defined in ` kv/snapshot.go ` , so
394+ every existing snapshot file restores under the new build with
395+ no migration step.
396+ - v2 introduces a new constant ` hlcSnapshotMagicV2 = EKVTHLC2 `
397+ (last byte bumped from ` '1' ` to ` '2' ` ). The two magics share
398+ no overlap, so a ` bytes.Equal ` of the leading 8 bytes
399+ unambiguously selects the format.
400+ - ` len ` is uint16 big-endian, total payload length ** after** the
401+ ` len ` field itself. v2 ships with the two defined fields
402+ (` ceiling ` , ` cutover ` ) for ` len = 0x0010 ` (16 bytes); a future
403+ v2 writer may extend the payload by appending fields and
404+ bumping ` len ` . Older v2 readers ignore any bytes past the
405+ fields they understand (i.e., they read the first 16 bytes
406+ of payload and skip the rest using ` len ` ). This is the
407+ forward-compat hatch within v2.
337408- ` ceiling ` — same 8-byte HLC ceiling as v1; preserves the
338409 existing field semantics so the v2 path is a strict superset.
339- - ` cutover ` — ` raft_envelope_cutover_index ` as a fixed
340- big-endian uint64. ` 0 ` means "Phase 2 not yet enabled at
341- snapshot time", which is the correct value for a snapshot
342- taken in Phase 0 or Phase 1.
343- - ` reserved ` — zero-padded for alignment / future fields. Any
344- v2 reader that sees non-zero bytes here errors with
345- ` ErrSnapshotHeaderReserved ` rather than silently ignoring
346- them.
410+ - ` cutover ` — ` raft_envelope_cutover_index ` as a big-endian
411+ uint64. ` 0 ` means "Phase 2 not yet enabled at snapshot time",
412+ which is the correct value for a snapshot taken in Phase 0
413+ or Phase 1.
414+
415+ There is no ` reserved ` -bytes field; the earlier draft's
416+ ` len = 0x40 ` (64 bytes) was a documentation error with no
417+ matching field count. The implementation must validate that the
418+ bytes consumed match ` len ` exactly.
347419
348420Read path (` ReadSnapshotHeader ` ):
349421
350- 1 . Read the 8-byte magic. Mismatch → ` ErrSnapshotMagic ` .
351- 2 . Peek the next byte. If it falls in the documented v1
352- ` ceiling ` high-byte range (i.e., the implicit-v1 detection
353- above) treat the stream as v1, parse the remaining 8 bytes
354- as ` ceiling ` , and return ` cutover = 0 ` .
355- 3 . Otherwise read it as ` ver ` . If ` ver == 0x02 ` , read ` len `
356- and exactly that many bytes; parse ` ceiling ` , ` cutover ` .
357- Trailing reserved bytes are validated as zero.
358- 4 . If ` ver ` is unknown (` > 0x02 ` from a future writer), refuse
359- the snapshot with ` ErrSnapshotHeaderVersion ` — restoring a
360- forward-version snapshot under an older binary is ** never**
361- safe because we cannot prove the unknown extension is not
362- semantically required.
422+ 1 . Read the 8-byte magic.
423+ 2 . If it equals ` hlcSnapshotMagic ` (` EKVTHLC1 ` ): parse the next
424+ 8 bytes as ` ceiling ` , return ` cutover = 0 ` .
425+ 3 . If it equals ` hlcSnapshotMagicV2 ` (` EKVTHLC2 ` ): read ` len ` ,
426+ then exactly ` len ` payload bytes; parse ` ceiling ` from the
427+ first 8, ` cutover ` from the next 8, ignore any trailing
428+ bytes (forward-compat within v2).
429+ 4 . Otherwise refuse the snapshot with
430+ ` ErrSnapshotHeaderUnknownMagic ` . Restoring a snapshot whose
431+ format we cannot identify is never safe.
363432
364433Write path (` WriteSnapshotHeader ` ):
365434
@@ -376,8 +445,9 @@ Write path (`WriteSnapshotHeader`):
376445This evolution rule keeps every existing v1 snapshot file
377446restorable under the new build with no migration step, makes
378447v2 the steady state once Phase 2 is enabled, and pushes
379- forward-compat decisions to a hard error rather than silent
380- data corruption.
448+ forward-compat decisions to a hard error on unknown magic
449+ rather than silent data corruption from a misclassified
450+ heuristic.
381451
382452### 4.5 Distribution catalog and HLC ceiling entries
383453
@@ -736,12 +806,19 @@ To detect and repair sidecar-vs-log divergence:
736806 sidecar to the full set in one apply.
737807 - ** Manual on a stuck follower** — operator runs
738808 ` elastickv-admin encryption resync-sidecar ` . The command
739- replays * all* rotation and ` rewrap-deks ` entries between
740- the sidecar's ` raft_applied_index ` and the FSM's applied
741- index into the local sidecar (not just the most recent
742- one), then exits. Replaying only the active DEK would
743- leave intermediate ` key_id ` s missing and silently break
744- historical reads on that node — explicitly rejected.
809+ replays ** every entry that satisfies the
810+ ` isEncryptionRelevant() ` predicate** (the same set used by
811+ the §5.5 lag check above: ` rotate-dek ` , ` rewrap-deks ` ,
812+ ` enable-storage-envelope ` , ` enable-raft-envelope ` ,
813+ ` retire-dek ` ) between the sidecar's ` raft_applied_index `
814+ and the FSM's applied index into the local sidecar (not
815+ just the most recent rotation, and not just rotation /
816+ rewrap entries). Restricting the replay to a subset would
817+ leave intermediate ` key_id ` s missing OR leave the
818+ ` raft_envelope_cutover_index ` unset — silently breaking
819+ historical reads or causing post-cutover Raft entries to
820+ be decoded through the legacy first-byte tag space.
821+ Explicitly rejected.
745822 Refusing to start until recovery completes is deliberate:
746823 silently serving with an incomplete sidecar would let the
747824 node write under one ` key_id ` while failing to decrypt
0 commit comments