@@ -180,10 +180,21 @@ Envelope format (single byte stream, stored as the Pebble value):
180180 +-------------+----------------+----------------+
181181 ```
182182
183- - ` node_id ` — the 16-bit Raft member ID assigned at cluster
184- bootstrap. Never reused for a different node within the
185- cluster's lifetime; verified at process start against the
186- membership snapshot.
183+ - ` node_id ` — the lower 16 bits of
184+ ` internal/raftengine/etcd/peers.go::DeriveNodeID(--raftId) ` ,
185+ which today returns the FNV-64a hash of the raft-id string
186+ as a ` uint64 ` . Truncation rule: ` node_id := uint16(DeriveNodeID(raftID)) ` .
187+ On every process start the encryption package reads the
188+ current Raft membership snapshot and asserts that ** no two
189+ voting members share the same lower-16-bit value of their
190+ derived node ID** — refuses to start with
191+ ` ErrNodeIDCollision ` otherwise (see §9.1). The check is at
192+ most O(num_voting_members²) comparisons (single-digit
193+ cluster sizes today); the cost is paid once at boot. For
194+ elastickv's current ` "n1" ` /` "n2" ` /` "n3" ` ID convention the
195+ 16-bit values are well-distributed and collisions are
196+ practically impossible, but the assertion makes the nonce
197+ uniqueness invariant unconditional rather than statistical.
187198 - ` local_epoch ` — 16-bit per-DEK process-load counter,
188199 persisted in the local sidecar (§5.1). Incremented and
189200 fsync'd ** before** the new process performs any encryption,
@@ -193,7 +204,16 @@ Envelope format (single byte stream, stored as the Pebble value):
193204 triggers (§5.2) reset it to 0 with the new DEK. The
194205 encryption package logs a warning when `local_epoch >
195206 0xff00` so an operator has a 256-restart cushion to rotate
196- before wrap.
207+ before wrap; if the operator ignores all 256 warnings and the
208+ epoch reaches ` 0xFFFF ` , the process ** refuses to start**
209+ with ` ErrLocalEpochExhausted ` (see §9.1). The hard stop is
210+ load-bearing: a wrap to 0 would re-issue the nonce
211+ ` node_id ‖ 0 ‖ 0 ` for the current DEK's first write of the
212+ new process load, exactly matching the very first nonce ever
213+ issued under that DEK — a catastrophic AES-GCM nonce reuse
214+ (key recovery + plaintext XOR). DEK rotation (§5.2) is the
215+ only recovery; it allocates a new DEK with a fresh
216+ ` local_epoch = 0 ` .
197217 - ` write_count ` — 64-bit ` atomic.Uint64 ` , incremented per
198218 write within the current process. Resets to 0 on process
199219 start (the ` local_epoch ` bump is what makes that safe).
@@ -407,14 +427,22 @@ Two-tier hierarchy:
407427 written under a DEK whose wrap is then lost, the sidecar write
408428 protocol is:
409429
410- 1 . Write the new contents to ` <dataDir>/encryption/keys.json.tmp ` .
411- 2 . ` file.Sync() ` on the temp file (fsync the data + metadata).
430+ 1 . Build the new sidecar contents in memory, including
431+ `raft_applied_index = <log index of the rotation entry being
432+ applied right now>` — this is part of the same JSON payload
433+ as the new wrapped DEKs, written in the same step 2 below.
434+ 2 . Write the new contents to ` <dataDir>/encryption/keys.json.tmp ` ,
435+ then ` file.Sync() ` (fsync the data + metadata).
412436 3 . ` os.Rename ` to ` keys.json ` .
413437 4 . ` dir.Sync() ` on ` <dataDir>/encryption/ ` (fsync the directory
414438 entry so the rename is durable).
415439 5 . Only after step 4 does the FSM acknowledge the rotation entry
416- as applied (and update ` raft_applied_index ` in memory; that
417- value is then persisted on the next sidecar write).
440+ as applied. The in-memory ` raft_applied_index ` matches the
441+ value that has just been durably written; ** no deferred-write
442+ window** in which the on-disk sidecar carries new DEKs but a
443+ stale index. This eliminates the spurious-` ErrSidecarBehindRaftLog `
444+ race that an "update on next write" scheme would create on a
445+ crash between steps 4 and the next rotation.
418446
419447 Skipping step 2 or 4 turns a power loss into permanent data loss
420448 for any value written under the new DEK; the §10 self-review
@@ -643,9 +671,17 @@ rewrite."
643671- ` kek/ ` subpackage — pluggable KEK providers (` file.go ` , ` awskms.go ` ,
644672 ` gcpkms.go ` , ` vault.go ` , ` env.go ` ). Each implements
645673 ` Wrap(dek []byte) (wrapped []byte, error) ` and `Unwrap(wrapped
646- [ ] byte) (dek [ ] byte, error)`.
647- - ` sidecar.go ` — atomic read/write of ` keys.json ` (write to
648- ` keys.json.tmp ` + ` os.Rename ` ).
674+ [ ] byte) (dek [ ] byte, error)` . The ` env.go` provider must call
675+ ` os.Unsetenv("ELASTICKV_KEK_BASE64") ` immediately after reading
676+ and decoding the value to narrow the window during which the
677+ raw KEK material is visible via ` /proc/<pid>/environ ` /
678+ container env inspection. (The recommendation in §5.1 stands:
679+ ` env.go ` is for tests and CI only, not production.)
680+ - ` sidecar.go ` — read/write of ` keys.json ` per the ** §5.1
681+ crash-durable write protocol** (write tmp → ` file.Sync() ` →
682+ ` os.Rename ` → ` dir.Sync() ` → ack). Skipping any of these
683+ fsyncs is a data-loss-class bug per §10 lens 1; do not
684+ shortcut to a plain ` os.Rename ` .
649685
650686### 6.2 Hooks into the storage layer
651687
@@ -666,11 +702,23 @@ coordinator does not pre-encrypt; the FSM does not bypass this path.
666702 (` [tombstone(1)] [expireAt(8)] [user_value...] ` , see
667703 ` encodeValue ` ) is extended so the leading byte encodes both the
668704 tombstone bit (bit 0) and the encryption-state bits (bits 1–2),
669- matching the 2-bit field in §7.1. Existing builds use only bit 0,
670- so an old build reading a value written by an encryption-enabled
671- build sees an unrecognised tombstone byte and refuses to serve it
672- rather than returning ciphertext as cleartext. ` valueHeaderSize `
673- stays at 9 bytes; no on-disk layout shift.
705+ matching the 2-bit field in §7.1. ` valueHeaderSize ` stays at 9
706+ bytes; no on-disk layout shift. Note: today's ` decodeValue `
707+ reads ` tombstone := data[0] != 0 ` , so an encrypted live value
708+ (` data[0] = 0b00000010 ` ) seen by an unmodified pre-encryption
709+ build would silently appear as a tombstone (key-not-found),
710+ not an error. The 3-phase rollout in §7.1 prevents this from
711+ ever happening on a real cluster — encrypted writes only begin
712+ after every replica has applied the
713+ ` enable-storage-envelope ` cluster-flag entry, which by
714+ precondition requires every voting member to be on a binary
715+ that knows the new bit layout. The §9.1 startup refusal "data
716+ dir contains a sidecar but ` --encryption-enabled ` is not set"
717+ is the second backstop. The implementation PR must add a
718+ comment to ` encodeValue ` / ` decodeValue ` pinning bit 0 =
719+ tombstone, bits 1–2 = ` encryption_state ` , bits 3–7 reserved,
720+ so future bit additions don't accidentally clobber the
721+ layout.
674722- ` store/snapshot_pebble.go ` — no change. It iterates Pebble values
675723 byte-for-byte, which are already ciphertext (with the
676724 ` encryption_state ` bit travelling in the value header).
@@ -859,13 +907,29 @@ flag becomes a *capability* assertion, not a *behaviour* trigger.
859907 cleartext Raft entries (raft envelope is gated on a separate
860908 flag in Phase 2).
8619092 . Each upgraded node advertises ` encryption_capable = true ` in
862- its periodic Raft heartbeat metadata (a new field on the
863- peer-metadata payload added by ` peer_metadata.go ` ). The flag
864- is purely informational at this point.
865- 3 . When the membership snapshot of every Raft group shows
866- ` encryption_capable = true ` for every voting member, Phase 0
867- is complete. ` encryption status ` reports the capability
868- coverage so the operator can wait for it before moving on.
910+ the ** peer-metadata extension** that today already piggy-backs
911+ on the etcd-raft heartbeat path
912+ (` internal/raftengine/etcd/peer_metadata.go ` ). A new
913+ ` encryption_capable bool ` field is added to the existing
914+ ` PeerMetadata ` proto; old binaries omit the field, which
915+ protobuf decodes as the zero value (` false ` ), so unscanned
916+ non-upgraded peers default to ` not capable ` without any
917+ special handling. The flag is purely informational at this
918+ point.
919+ 3 . The Phase-1 and Phase-2 cutover commands
920+ (` enable-storage-envelope ` , ` enable-raft-envelope ` ) ** do not**
921+ read cached ` encryption_capable ` state from the local
922+ keystore. Instead they force a fresh poll: the leader sends a
923+ heartbeat to every voting member of every Raft group and
924+ blocks until it has received a heartbeat response carrying
925+ ` encryption_capable = true ` from every member, all within one
926+ election timeout (configurable; default the existing
927+ ` RaftElectionTimeout ` ). If any member is unreachable or
928+ reports ` false ` within the window, the command returns
929+ ` ErrCapabilityCheckFailed ` with the offending node IDs and
930+ the operator must re-run after fixing them. Stale cached
931+ capability state therefore cannot trigger a premature
932+ cutover.
869933
870934** Phase 1 — Storage envelope cluster-flag flip.**
871935
@@ -903,14 +967,33 @@ flag becomes a *capability* assertion, not a *behaviour* trigger.
903967 Replicas dispatch on the ** Raft log index** of the entry
904968 relative to the persisted ` raft_envelope_cutover_index ` (the
905969 apply index of the flag entry, recorded in the local sidecar
906- on apply). Entries below the cutover index are decoded via
907- the existing first-byte tag space (` 0x00 ` single-request,
908- ` 0x01 ` batch, ` 0x02 ` HLC-lease, etc.); entries at-or-above
909- the cutover index are unwrapped through the raft envelope
910- first. This explicitly avoids re-using ` kv/fsm.go ` 's existing
911- first-byte tag values for the "is this an envelope?"
912- discriminator — that re-use would make pre-cutover batch
913- entries (` 0x01 ` ) ambiguous during WAL replay.
970+ on apply). Concretely, the FSM apply path becomes:
971+
972+ ``` text
973+ Apply(index, data):
974+ if index >= raft_envelope_cutover_index:
975+ data = raftDEK.Unwrap(data) // strip §4.2 envelope
976+ // existing dispatch on data[0], unchanged from today:
977+ switch data[0] {
978+ case raftEncodeSingle (0x00): decodeRaftRequests(data)
979+ case raftEncodeBatch (0x01): decodeRaftRequests(data)
980+ case raftEncodeHLCLease (0x02): applyHLCLease(data)
981+ }
982+ ```
983+
984+ The unwrap happens ** before** the existing
985+ ` kv/fsm.go::Apply ` dispatch on ` raftEncodeSingle ` /
986+ ` raftEncodeBatch ` / ` raftEncodeHLCLease ` , so HLC ceiling
987+ proposals (§4.5) are transparently decrypted then routed
988+ through the existing HLC path with no special-casing in the
989+ HLC handler. Crucially, this dispatch scheme ** does not
990+ reuse any of the existing ` kv/fsm.go ` first-byte tag
991+ values** for the encryption discriminator — an earlier
992+ draft proposed ` 0x00 ` /` 0x01 ` framing tags but those collide
993+ with ` raftEncodeSingle ` / ` raftEncodeBatch ` and would make
994+ every pre-cutover batch entry (` 0x01 ` ) look like a raft
995+ envelope to the FSM. Index gating sidesteps that collision
996+ by construction.
9149979 . Snapshots taken during Phase 2 carry the cutover index in
915998 their metadata header so a fresh follower joining
916999 mid-Phase-2 reconstructs the same dispatch boundary on
@@ -1058,6 +1141,16 @@ The process refuses to start if any of the following hold:
10581141- The encryption package's startup membership check fails to
10591142 resolve a stable 16-bit ` node_id ` (the §4.1 nonce construction
10601143 needs one; without it nonce uniqueness cannot be guaranteed).
1144+ - Two voting members of any Raft group share the same
1145+ ` uint16(DeriveNodeID(raftID)) ` (` ErrNodeIDCollision ` ). One of
1146+ them must be re-bootstrapped under a different ` --raftId ` whose
1147+ hash truncates differently before encryption can be enabled on
1148+ the cluster.
1149+ - Any active DEK in the sidecar has ` local_epoch == 0xFFFF `
1150+ (` ErrLocalEpochExhausted ` ). The DEK must be rotated (§5.2)
1151+ before the process can resume; without rotation the next
1152+ ` local_epoch ` would wrap to 0 and re-issue a nonce that has
1153+ already been used under the same DEK.
10611154- The local sidecar's ` raft_envelope_cutover_index ` disagrees
10621155 with the value carried in the most-recent ingested snapshot
10631156 header (` ErrEnvelopeCutoverDivergence ` ). This catches a node
0 commit comments