@@ -169,16 +169,45 @@ Envelope format (single byte stream, stored as the Pebble value):
169169- ` key_id ` — 32-bit identifier of the DEK that produced this
170170 ciphertext. Required so a rotated DEK can decrypt entries written
171171 before the rotation (see §5.2).
172- - ` nonce ` — 12-byte AES-GCM nonce. To stay clear of the NIST SP
173- 800-38D limit on randomly-generated 96-bit nonces (§5.2 derives
174- the budget), elastickv issues nonces from a ** per-DEK 96-bit
175- counter** rather than a fresh ` crypto/rand.Reader ` draw per write.
176- The counter's high 32 bits are a per-process random prefix
177- (re-drawn whenever the DEK is loaded, so two processes that share
178- a DEK never collide), and the low 64 bits are an ` atomic.Uint64 `
179- incremented per write. This makes nonce reuse impossible within
180- the lifetime of a (DEK, process-load) pair and removes the
181- birthday-bound budget entirely.
172+ - ` nonce ` — 12-byte AES-GCM nonce, structured as ** three
173+ deterministic fields** to eliminate the birthday bound entirely.
174+ No bits of the nonce are random; nonce uniqueness is by
175+ construction across ` (node, process_load, write) ` :
176+
177+ ``` text
178+ +-------------+----------------+----------------+
179+ | node_id(2B) | local_epoch(2B)| write_count(8B)|
180+ +-------------+----------------+----------------+
181+ ```
182+
183+ - ` node_id ` — the 16-bit Raft member ID assigned at cluster
184+ bootstrap. Never reused for a different node within the
185+ cluster's lifetime; verified at process start against the
186+ membership snapshot.
187+ - ` local_epoch ` — 16-bit per-DEK process-load counter,
188+ persisted in the local sidecar (§5.1). Incremented and
189+ fsync'd ** before** the new process performs any encryption,
190+ so a crash between increment and the first write still leaves
191+ the on-disk counter ahead of any nonce ever used. Wraps at
192+ 65,536 process restarts per node per DEK lifetime; rotation
193+ triggers (§5.2) reset it to 0 with the new DEK. The
194+ encryption package logs a warning when `local_epoch >
195+ 0xff00` so an operator has a 256-restart cushion to rotate
196+ before wrap.
197+ - ` write_count ` — 64-bit ` atomic.Uint64 ` , incremented per
198+ write within the current process. Resets to 0 on process
199+ start (the ` local_epoch ` bump is what makes that safe).
200+ Wrap is unreachable in any realistic deployment (2⁶⁴ ≈
201+ 1.8 × 10¹⁹ writes per process-load).
202+
203+ An earlier draft used a 32-bit CSPRNG-drawn prefix for the
204+ high half of the nonce; that has been retracted because a
205+ random 32-bit value across N process loads carries an N²/2³³
206+ birthday-collision probability (~ 1 in 10⁷ at N=30 over a
207+ multi-year DEK lifetime), and a single nonce collision under
208+ one DEK is a catastrophic AES-GCM failure (key-recovery + XOR
209+ of two plaintexts). The deterministic construction above has
210+ zero collision probability under its preconditions.
182211- ` ciphertext ` — AES-256-GCM(plaintext, nonce, AAD = envelope_version
183212 ‖ flag ‖ key_id). AAD binds the ciphertext to the ** entire**
184213 envelope header — including the compression flag — so a
@@ -332,9 +361,11 @@ Two-tier hierarchy:
332361 "active" : { "storage" : 305419896 , "raft" : 2596069104 },
333362 "keys" : {
334363 "305419896" : { "purpose" : " storage" , "wrapped" : " <base64>" ,
335- "created" : " 2026-04-29T10:00:00Z" },
364+ "created" : " 2026-04-29T10:00:00Z" ,
365+ "local_epoch" : 7 },
336366 "2596069104" : { "purpose" : " raft" , "wrapped" : " <base64>" ,
337- "created" : " 2026-04-29T10:00:00Z" }
367+ "created" : " 2026-04-29T10:00:00Z" ,
368+ "local_epoch" : 7 }
338369 }
339370 }
340371 ```
@@ -352,6 +383,35 @@ Two-tier hierarchy:
352383 recent rotation entry that has been persisted into this
353384 sidecar. It is the load-bearing field for the
354385 sidecar/log-index reconciliation protocol in §5.5.
386+ - ` local_epoch ` is the per-DEK process-load counter consumed by
387+ the §4.1 nonce construction. Bumped and durably persisted on
388+ every process start before any encryption happens with that
389+ DEK; reset to 0 when the DEK is created.
390+
391+ ** Crash-durable write protocol.** ` os.Rename ` is atomic for
392+ visibility but not crash-durable on its own — a power loss after
393+ the rename can roll back the file via the file system's metadata
394+ journal, leaving stale wrapped DEKs on disk while the rotation's
395+ Raft entry is already committed. To avoid stranding ciphertext
396+ written under a DEK whose wrap is then lost, the sidecar write
397+ protocol is:
398+
399+ 1 . Write the new contents to ` <dataDir>/encryption/keys.json.tmp ` .
400+ 2 . ` file.Sync() ` on the temp file (fsync the data + metadata).
401+ 3 . ` os.Rename ` to ` keys.json ` .
402+ 4 . ` dir.Sync() ` on ` <dataDir>/encryption/ ` (fsync the directory
403+ entry so the rename is durable).
404+ 5 . Only after step 4 does the FSM acknowledge the rotation entry
405+ as applied (and update ` raft_applied_index ` in memory; that
406+ value is then persisted on the next sidecar write).
407+
408+ Skipping step 2 or 4 turns a power loss into permanent data loss
409+ for any value written under the new DEK; the §10 self-review
410+ treats sidecar non-durability as a data-loss-class bug. On
411+ filesystems that lack ` dir.Sync() ` semantics (NFS, some FUSE
412+ mounts) the encryption package refuses to start with
413+ ` ErrUnsupportedFilesystem ` rather than silently degrading the
414+ durability guarantee.
355415
356416 The sidecar is ** safe to leak** : every entry in ` keys ` is wrapped
357417 by the KEK. Without the KEK, the file unwraps to nothing.
@@ -666,6 +726,11 @@ elastickv-admin encryption rewrap-deks
666726elastickv-admin encryption rewrite --rate=10MiB/s
667727elastickv-admin encryption retire-dek --key-id=<uint32>
668728elastickv-admin encryption resync-sidecar # §5.5 follower repair
729+ elastickv-admin encryption enable-raft-envelope
730+ # §7.1 Phase 2 cutover;
731+ # refuses unless every
732+ # voting member reports
733+ # encryption_capable
669734elastickv-admin encryption disable # refuses; documents the
670735 # dump-and-reload path
671736elastickv-admin backup verify --backup-dir=...
@@ -719,34 +784,93 @@ envelope) is impossible by construction — the header layout is
719784fixed and the encryption state is read before the value bytes are
720785ever interpreted.
721786
722- #### Rolling enablement
787+ #### Rolling enablement: two phases, not one
788+
789+ The §4.1 storage envelope and the §4.2 raft envelope have very
790+ different rollout constraints:
791+
792+ - ** §4.1 (storage envelope)** is per-version, dispatched on the
793+ per-MVCC-version ` encryption_state ` bit. A node that has not yet
794+ been upgraded continues to serve old cleartext versions; only
795+ upgraded nodes write encrypted versions. Mixed-mode is safe.
796+ - ** §4.2 (raft envelope)** wraps ** every** committed Raft entry's
797+ ` Data []byte ` . ** Every replica must be able to decrypt every
798+ committed entry** because Raft apply is deterministic and runs
799+ on every node — there is no "skip this entry on this node"
800+ escape hatch. If an upgraded leader proposes an encrypted
801+ entry while a non-upgraded follower is still in the cluster,
802+ the follower's ` Apply ` fails for every such entry. The
803+ follower stops making progress, drifts behind on commit index,
804+ and eventually triggers leader-side throttling.
805+
806+ This rules out enabling §4.2 at the same instant as §4.1. The
807+ rollout is therefore explicitly two-phase, controlled by separate
808+ flags / Raft-replicated cluster state:
809+
810+ ** Phase 1 — Storage envelope only.**
723811
724- 1 . Operator provisions the KEK in their KMS.
812+ 1 . Operator provisions the KEK in the KMS.
7258132 . Operator restarts each node with ` --encryption-enabled --kekUri=... ` .
726- Restart is rolling; mixed-mode clusters are supported because
727- each MVCC version carries its own ` encryption_state ` bit and
728- reads dispatch on that bit, not on the value bytes.
729- 3 . New writes from the upgraded node are encrypted (`encryption_state
730- = 0b01` ). Old MVCC versions retain ` encryption_state = 0b00` and
731- continue to be returned as cleartext to the storage layer's
732- decryption shim, which short-circuits when the bit is ` 0b00 ` .
733- 4 . Operator runs ` elastickv-admin encryption rewrite ` to walk the
734- MVCC versions and re-encrypt every ` encryption_state = 0b00 `
735- version in place (per §5.4: same-` commit_ts ` rewrite, MVCC
736- history preserved, rate-limited, resumable). The cursor lives at
737- ` !encryption|rewrite|cursor|cleartext ` .
738- 5 . When ` encryption status --verify ` reports zero `encryption_state
739- = 0b00` versions across every node (NOT counting tombstones)
740- AND ` minRetainedTS ` has advanced past the youngest cleartext
741- ` commit_ts ` , the migration is complete and the cleartext code
742- path can be retired in a follow-up release.
814+ Restart is rolling; mixed-mode is safe under §4.1 because each
815+ MVCC version carries its own ` encryption_state ` bit and reads
816+ dispatch on that bit, not on the value bytes.
817+ 3 . Upgraded nodes write encrypted MVCC versions (`encryption_state
818+ = 0b01` ). Old MVCC versions retain ` encryption_state = 0b00`
819+ and continue to be returned as cleartext.
820+ 4 . Raft proposals during Phase 1 carry ** cleartext** ` Data []byte `
821+ — i.e., §4.2 is not yet active. Lookup keys and operation tags
822+ in the WAL are still cleartext during this window.
823+ 5 . Operator runs ` elastickv-admin encryption rewrite ` to convert
824+ all ` encryption_state = 0b00 ` MVCC versions in place
825+ (per §5.4). When ` encryption status --verify ` reports zero
826+ cleartext versions across every node AND ` minRetainedTS ` has
827+ advanced past the youngest cleartext ` commit_ts ` , Phase 1 is
828+ complete.
829+
830+ ** Phase 2 — Raft envelope.**
831+
832+ 6 . Operator runs ` elastickv-admin encryption enable-raft-envelope ` .
833+ The admin client RPCs into the leader, which:
834+ - Verifies via the route catalog / membership snapshot that
835+ ** every** voting member of every Raft group has reported
836+ ` encryption_capable = true ` in its periodic heartbeat
837+ metadata (a new field). Refuses to proceed otherwise.
838+ - Proposes a single Raft entry with the cluster-wide flag
839+ ` raft_envelope_active = true ` . This entry itself is sent
840+ ** cleartext** so non-upgraded replicas (defensive: should
841+ not exist after step 6's check, but treated as a safety
842+ net) can still apply it.
843+ 7 . From the apply index of that entry onward, every leader
844+ wraps new proposal ` Data []byte ` with the raft DEK (§4.2).
845+ Replicas dispatch on a 1-byte format tag at the start of the
846+ ` Data ` payload — ` 0x00 ` = cleartext (pre-flag entry, found
847+ only in WAL/snapshot history), ` 0x01 ` = raft envelope. The
848+ discriminator lives ** at the start of the application-level
849+ payload** , never as the first byte of an arbitrary user
850+ value, so the §7.1 byte-collision argument does not apply
851+ here: every value the storage layer sees is opaque bytes
852+ from a known framing, not raw user input.
853+ 8 . Snapshots taken during Phase 2 carry the new flag in their
854+ metadata header so a fresh follower joining mid-Phase-2 sees
855+ the right framing for every entry it will ever receive.
856+
857+ #### Why §4.2 cannot be turned off again
858+
859+ Once ` raft_envelope_active = true ` has been committed, the WAL on
860+ every node interleaves cleartext (pre-flag) and raft-envelope
861+ (post-flag) entries. Disabling §4.2 would require rewriting WAL
862+ entries, which etcd raft does not support. The flag is therefore
863+ one-way; the only way back is dump-and-reload (§7.2).
743864
744865#### Compatibility with snapshot streaming during migration
745866
746- A leader streaming a Pebble snapshot to a new follower mid-migration
747- will ship a mix of ` encryption_state = 0b00 ` and ` 0b01 ` versions.
748- Followers ingest both correctly because the MVCC metadata travels
749- with each version. No special handling at the snapshot layer.
867+ A leader streaming a Pebble snapshot to a new follower mid-Phase-1
868+ ships a mix of ` encryption_state = 0b00 ` and ` 0b01 ` MVCC versions;
869+ the receiving follower ingests both correctly because the
870+ metadata travels per-version. A snapshot taken in Phase 2 ships
871+ only encryption-state-` 0b01 ` MVCC versions plus a Phase-2-flagged
872+ header so the receiver knows to expect raft envelopes from there
873+ on.
750874
751875### 7.2 Why we will not support encrypted → cleartext
752876
@@ -866,6 +990,12 @@ The process refuses to start if any of the following hold:
866990 persisted applied index AND the gap covers any rotation entries
867991 (see §5.5: ` ErrSidecarBehindRaftLog ` , recovered via
868992 ` encryption resync-sidecar ` ).
993+ - The sidecar is on a filesystem whose ` dir.Sync() ` does not
994+ guarantee crash durability of the rename (` ErrUnsupportedFilesystem ` ,
995+ see §5.1's crash-durable write protocol).
996+ - The encryption package's startup membership check fails to
997+ resolve a stable 16-bit ` node_id ` (the §4.1 nonce construction
998+ needs one; without it nonce uniqueness cannot be guaranteed).
869999
8701000Each refusal logs a single, unambiguous error pointing at the
8711001relevant flag and runbook section.
@@ -924,10 +1054,20 @@ eventual code change.
9241054 value bytes — see §7.1 on why a leading byte is unsafe), so a
9251055 half-migrated database stays fully readable and a legacy value
9261056 that happens to start with ` 0x01 ` cannot be misclassified as an
927- envelope. The largest risk is a buggy ` compress-then-encrypt `
928- path that mis-frames the compressed payload; mitigation is a
929- round-trip property test in ` store/ ` using ` pgregory.net/rapid `
930- over arbitrary byte slices.
1057+ envelope. Two specific data-loss-class failure modes are
1058+ addressed by hard preconditions rather than by recovery code:
1059+ (a) sidecar non-durability — the §5.1 write protocol fsyncs
1060+ the file * and* the parent directory before the rotation is
1061+ acknowledged, so a power loss cannot strand ciphertext under a
1062+ wrap that has rolled back; (b) AES-GCM nonce reuse — the §4.1
1063+ nonce is built from ` node_id ‖ local_epoch ‖ write_count ` ,
1064+ each field deterministic, the epoch persisted-and-fsynced
1065+ before any encryption, so even a crash-restart loop cannot
1066+ cause two writes under one DEK to share a nonce. The remaining
1067+ open risk is a buggy ` compress-then-encrypt ` path that
1068+ mis-frames the compressed payload; mitigation is a round-trip
1069+ property test in ` store/ ` using ` pgregory.net/rapid ` over
1070+ arbitrary byte slices.
9311071
93210722 . ** Concurrency / distributed failures.** DEK rotation goes through
9331073 Raft so every replica observes the new DEK at the same log index.
@@ -941,7 +1081,14 @@ eventual code change.
9411081 unaffected because the bytes are already ciphertext; the
9421082 receiving node's keystore must contain the relevant DEK before
9431083 it can read the ingested data, which is guaranteed by the Raft
944- ordering of the rotation entry.
1084+ ordering of the rotation entry. The two-phase rollout in §7.1 is
1085+ the load-bearing piece for "node not yet upgraded" cases:
1086+ Phase 1 keeps Raft proposals cleartext so a non-upgraded
1087+ follower can still apply, and Phase 2 only flips on after a
1088+ membership-snapshot check confirms every voting member is
1089+ encryption-capable. Skipping the membership check would let
1090+ one upgraded leader produce raft envelopes that lock every
1091+ non-upgraded follower out of apply.
9451092
94610933 . ** Performance.** AES-NI puts encryption CPU below the existing
9471094 FSM apply CPU. The compress pass recovers most of Pebble's lost
0 commit comments