@@ -358,6 +358,8 @@ Two-tier hierarchy:
358358 {
359359 "version" : 1 ,
360360 "raft_applied_index" : 184273 ,
361+ "storage_envelope_active" : true ,
362+ "raft_envelope_cutover_index" : 184201 ,
361363 "active" : { "storage" : 305419896 , "raft" : 2596069104 },
362364 "keys" : {
363365 "305419896" : { "purpose" : " storage" , "wrapped" : " <base64>" ,
@@ -370,6 +372,15 @@ Two-tier hierarchy:
370372 }
371373 ```
372374
375+ - ` storage_envelope_active ` mirrors the §7.1 Phase 1 cluster
376+ flag once the FSM has seen the ` enable-storage-envelope `
377+ entry.
378+ - ` raft_envelope_cutover_index ` records the apply index at
379+ which §4.2 became active for this cluster. ` 0 ` means
380+ Phase 2 has not started; any non-zero value is the
381+ dispatch boundary used by FSM apply and WAL replay to
382+ decide which Raft entries are envelopes.
383+
373384 - ** ` key_id ` is a 32-bit unsigned integer** , the same value that
374385 appears in the §4.1 envelope ` key_id ` field. It is generated by
375386 a CSPRNG draw on the leader at the moment a new DEK is created;
@@ -581,16 +592,32 @@ To detect and repair this:
581592 sidecar's index AND the gap covers any rotation entries, the
582593 node refuses to start with ` ErrSidecarBehindRaftLog ` ,
583594 pointing at the operator runbook.
584- 3 . Recovery is ** automatic on the leader** (it re-proposes a
585- "rewrap current active DEK" entry that brings every node's
586- sidecar back in sync) and ** manual on a stuck follower**
587- (operator runs ` elastickv-admin encryption resync-sidecar `
588- which replays the rotation entries from the Raft log into the
589- local sidecar, then exits). Refusing to start is deliberate:
590- silently serving with a stale sidecar would let the node
591- write under an old ` key_id ` while peers write under a new one,
592- which is exactly the split-brain key state DEK rotation is
593- designed to prevent.
595+ 3 . Recovery rewraps ** every unretired DEK** , not just the active
596+ one. A node that missed multiple rotations needs * every*
597+ intermediate ` key_id ` to decrypt MVCC history that still
598+ carries those IDs (per §5.4 retirement criterion #4 : a DEK
599+ stays loaded until both the rewrite cursor has passed the
600+ keyspace AND ` minRetainedTS ` has advanced past the youngest
601+ ` commit_ts ` written under that DEK). Recovery comes in two
602+ shapes:
603+ - ** Automatic on the leader** — the leader's keystore already
604+ holds every unretired DEK in memory. It re-proposes a
605+ ` rewrap-deks ` entry that serialises ** all** unretired
606+ wrapped DEKs (active + retiring), bringing every node's
607+ sidecar to the full set in one apply.
608+ - ** Manual on a stuck follower** — operator runs
609+ ` elastickv-admin encryption resync-sidecar ` . The command
610+ replays * all* rotation and ` rewrap-deks ` entries between
611+ the sidecar's ` raft_applied_index ` and the FSM's applied
612+ index into the local sidecar (not just the most recent
613+ one), then exits. Replaying only the active DEK would
614+ leave intermediate ` key_id ` s missing and silently break
615+ historical reads on that node — explicitly rejected.
616+ Refusing to start until recovery completes is deliberate:
617+ silently serving with an incomplete sidecar would let the
618+ node write under one ` key_id ` while failing to decrypt
619+ historical reads under another, which is exactly the
620+ split-brain key state DEK rotation is designed to prevent.
5946214 . The reverse case — sidecar ahead of the raft log — cannot
595622 happen because the sidecar is only written from inside an FSM
596623 apply; an apply implies the entry is already in the Raft log.
@@ -726,11 +753,16 @@ elastickv-admin encryption rewrap-deks
726753elastickv-admin encryption rewrite --rate=10MiB/s
727754elastickv-admin encryption retire-dek --key-id=<uint32>
728755elastickv-admin encryption resync-sidecar # §5.5 follower repair
729- elastickv-admin encryption enable-raft -envelope
730- # §7.1 Phase 2 cutover;
756+ elastickv-admin encryption enable-storage -envelope
757+ # §7.1 Phase 1 cutover;
731758 # refuses unless every
732759 # voting member reports
733760 # encryption_capable
761+ elastickv-admin encryption enable-raft-envelope
762+ # §7.1 Phase 2 cutover;
763+ # same capability gate;
764+ # records cutover index
765+ # in the sidecar
734766elastickv-admin encryption disable # refuses; documents the
735767 # dump-and-reload path
736768elastickv-admin backup verify --backup-dir=...
@@ -784,75 +816,105 @@ envelope) is impossible by construction — the header layout is
784816fixed and the encryption state is read before the value bytes are
785817ever interpreted.
786818
787- #### Rolling enablement: two phases, not one
788-
789- The §4.1 storage envelope and the §4.2 raft envelope have very
790- different rollout constraints:
791-
792- - ** §4.1 (storage envelope)** is per-version, dispatched on the
793- per-MVCC-version ` encryption_state ` bit. A node that has not yet
794- been upgraded continues to serve old cleartext versions; only
795- upgraded nodes write encrypted versions. Mixed-mode is safe.
796- - ** §4.2 (raft envelope)** wraps ** every** committed Raft entry's
797- ` Data []byte ` . ** Every replica must be able to decrypt every
798- committed entry** because Raft apply is deterministic and runs
799- on every node — there is no "skip this entry on this node"
800- escape hatch. If an upgraded leader proposes an encrypted
801- entry while a non-upgraded follower is still in the cluster,
802- the follower's ` Apply ` fails for every such entry. The
803- follower stops making progress, drifts behind on commit index,
804- and eventually triggers leader-side throttling.
805-
806- This rules out enabling §4.2 at the same instant as §4.1. The
807- rollout is therefore explicitly two-phase, controlled by separate
808- flags / Raft-replicated cluster state:
809-
810- ** Phase 1 — Storage envelope only.**
811-
812- 1 . Operator provisions the KEK in the KMS.
813- 2 . Operator restarts each node with ` --encryption-enabled --kekUri=... ` .
814- Restart is rolling; mixed-mode is safe under §4.1 because each
815- MVCC version carries its own ` encryption_state ` bit and reads
816- dispatch on that bit, not on the value bytes.
817- 3 . Upgraded nodes write encrypted MVCC versions (`encryption_state
818- = 0b01` ). Old MVCC versions retain ` encryption_state = 0b00`
819- and continue to be returned as cleartext.
820- 4 . Raft proposals during Phase 1 carry ** cleartext** ` Data []byte `
821- — i.e., §4.2 is not yet active. Lookup keys and operation tags
822- in the WAL are still cleartext during this window.
823- 5 . Operator runs ` elastickv-admin encryption rewrite ` to convert
824- all ` encryption_state = 0b00 ` MVCC versions in place
825- (per §5.4). When ` encryption status --verify ` reports zero
826- cleartext versions across every node AND ` minRetainedTS ` has
827- advanced past the youngest cleartext ` commit_ts ` , Phase 1 is
828- complete.
829-
830- ** Phase 2 — Raft envelope.**
831-
832- 6 . Operator runs ` elastickv-admin encryption enable-raft-envelope ` .
833- The admin client RPCs into the leader, which:
834- - Verifies via the route catalog / membership snapshot that
835- ** every** voting member of every Raft group has reported
836- ` encryption_capable = true ` in its periodic heartbeat
837- metadata (a new field). Refuses to proceed otherwise.
838- - Proposes a single Raft entry with the cluster-wide flag
839- ` raft_envelope_active = true ` . This entry itself is sent
840- ** cleartext** so non-upgraded replicas (defensive: should
841- not exist after step 6's check, but treated as a safety
842- net) can still apply it.
843- 7 . From the apply index of that entry onward, every leader
819+ #### Rolling enablement: three phases, not one
820+
821+ Both the §4.1 storage envelope and the §4.2 raft envelope create
822+ mixed-binary safety problems that have to be ruled out ** before**
823+ any encrypted byte hits disk. The rollout is therefore explicitly
824+ three-phase:
825+
826+ ** Why an "obvious" rolling restart with ` --encryption-enabled ` is
827+ unsafe.** A naive plan — flip the flag node-by-node and rely on
828+ the per-version ` encryption_state ` bit — is broken at two layers:
829+
830+ - * At the storage layer:* the ` lsm_store.go ` value header today is
831+ 9 bytes ` [tombstone(1)] [expireAt(8)] ` , and ` decodeValue ` reads
832+ the first byte as ` tombstone := data[0] != 0 ` . The §6.2 plan
833+ packs ` encryption_state ` into bits 1–2 of that same byte. An
834+ encrypted live value (` tombstone=0 ` , ` encryption_state=0b01 ` )
835+ becomes ` data[0] = 0b00000010 = 2 ` . A node still on the old
836+ binary that ingests this value via Raft replication or snapshot
837+ catch-up reads ` data[0] != 0 ` → "tombstone" → drops the value
838+ from reads. Silent data loss, not a clean refusal.
839+ - * At the Raft layer:* §4.2 wraps the entire ` Data []byte ` per
840+ Raft entry. Apply is deterministic across replicas; a
841+ non-upgraded follower without the KEK fails every
842+ raft-envelope entry on apply, drifts behind, and is eventually
843+ removed.
844+
845+ Both problems share one fix: ** gate the encryption-active state on
846+ a Raft-replicated cluster flag, not on a per-node startup flag.**
847+ The cluster-flag flip happens only after a membership-snapshot
848+ check confirms every voting member is running an
849+ encryption-capable binary. The per-node ` --encryption-enabled `
850+ flag becomes a * capability* assertion, not a * behaviour* trigger.
851+
852+ ** Phase 0 — Capability rollout.**
853+
854+ 0 . Operator provisions the KEK in the KMS.
855+ 1 . Operator restarts each node with `--encryption-enabled
856+ --kekUri=...`. Rolling restart is safe in this phase because
857+ the new binary still writes cleartext values (storage
858+ envelope is gated on the cluster flag below) and proposes
859+ cleartext Raft entries (raft envelope is gated on a separate
860+ flag in Phase 2).
861+ 2 . Each upgraded node advertises ` encryption_capable = true ` in
862+ its periodic Raft heartbeat metadata (a new field on the
863+ peer-metadata payload added by ` peer_metadata.go ` ). The flag
864+ is purely informational at this point.
865+ 3 . When the membership snapshot of every Raft group shows
866+ ` encryption_capable = true ` for every voting member, Phase 0
867+ is complete. ` encryption status ` reports the capability
868+ coverage so the operator can wait for it before moving on.
869+
870+ ** Phase 1 — Storage envelope cluster-flag flip.**
871+
872+ 4 . Operator runs ` elastickv-admin encryption enable-storage-envelope ` .
873+ The leader rechecks the membership-snapshot capability gate
874+ from step 3 and then proposes a single Raft entry with the
875+ cluster-wide flag ` storage_envelope_active = true ` . This
876+ entry's ` Data []byte ` is the ** legacy framing** (i.e., the
877+ existing ` kv/fsm.go ` first-byte tag space; see Phase 2's note
878+ below) so every replica — even one that somehow missed the
879+ capability advert — can still decode and apply it.
880+ 5 . From the apply index of that flag entry onward, every storage
881+ layer Put writes ` encryption_state = 0b01 ` and the §4.1
882+ envelope. MVCC versions written before that index keep
883+ ` encryption_state = 0b00 ` (cleartext) and are read back through
884+ the dispatch path; mixing within a single key is safe per §5.4.
885+ 6 . Operator runs ` elastickv-admin encryption rewrite ` to convert
886+ the remaining cleartext MVCC versions in place (per §5.4).
887+ Phase 1 is complete when ` encryption status --verify ` reports
888+ zero ` encryption_state = 0b00 ` versions across every node
889+ (excluding tombstones) AND ` minRetainedTS ` has advanced past
890+ the youngest cleartext ` commit_ts ` .
891+
892+ ** Phase 2 — Raft envelope cluster-flag flip.**
893+
894+ 7 . Operator runs ` elastickv-admin encryption enable-raft-envelope ` .
895+ The leader rechecks the same capability gate, then proposes a
896+ single Raft entry with the cluster-wide flag
897+ ` raft_envelope_active = true ` . This entry itself uses the
898+ legacy first-byte tag space so it remains decodable by any
899+ replica that has applied entries up to it.
900+ 8 . From the apply index of that flag entry onward, every leader
844901 wraps new proposal ` Data []byte ` with the raft DEK (§4.2).
845- Replicas dispatch on a 1-byte format tag at the start of the
846- ` Data ` payload — ` 0x00 ` = cleartext (pre-flag entry, found
847- only in WAL/snapshot history), ` 0x01 ` = raft envelope. The
848- discriminator lives ** at the start of the application-level
849- payload** , never as the first byte of an arbitrary user
850- value, so the §7.1 byte-collision argument does not apply
851- here: every value the storage layer sees is opaque bytes
852- from a known framing, not raw user input.
853- 8 . Snapshots taken during Phase 2 carry the new flag in their
854- metadata header so a fresh follower joining mid-Phase-2 sees
855- the right framing for every entry it will ever receive.
902+ ** There is no in-band format tag in the proposal payload.**
903+ Replicas dispatch on the ** Raft log index** of the entry
904+ relative to the persisted ` raft_envelope_cutover_index ` (the
905+ apply index of the flag entry, recorded in the local sidecar
906+ on apply). Entries below the cutover index are decoded via
907+ the existing first-byte tag space (` 0x00 ` single-request,
908+ ` 0x01 ` batch, ` 0x02 ` HLC-lease, etc.); entries at-or-above
909+ the cutover index are unwrapped through the raft envelope
910+ first. This explicitly avoids re-using ` kv/fsm.go ` 's existing
911+ first-byte tag values for the "is this an envelope?"
912+ discriminator — that re-use would make pre-cutover batch
913+ entries (` 0x01 ` ) ambiguous during WAL replay.
914+ 9 . Snapshots taken during Phase 2 carry the cutover index in
915+ their metadata header so a fresh follower joining
916+ mid-Phase-2 reconstructs the same dispatch boundary on
917+ ingest.
856918
857919#### Why §4.2 cannot be turned off again
858920
@@ -996,6 +1058,12 @@ The process refuses to start if any of the following hold:
9961058- The encryption package's startup membership check fails to
9971059 resolve a stable 16-bit ` node_id ` (the §4.1 nonce construction
9981060 needs one; without it nonce uniqueness cannot be guaranteed).
1061+ - The local sidecar's ` raft_envelope_cutover_index ` disagrees
1062+ with the value carried in the most-recent ingested snapshot
1063+ header (` ErrEnvelopeCutoverDivergence ` ). This catches a node
1064+ that joined mid-Phase-2 from a snapshot taken before Phase 2
1065+ was enabled, then later replayed Raft entries that crossed
1066+ the cutover; resolved by ` encryption resync-sidecar ` .
9991067
10001068Each refusal logs a single, unambiguous error pointing at the
10011069relevant flag and runbook section.
@@ -1081,14 +1149,23 @@ eventual code change.
10811149 unaffected because the bytes are already ciphertext; the
10821150 receiving node's keystore must contain the relevant DEK before
10831151 it can read the ingested data, which is guaranteed by the Raft
1084- ordering of the rotation entry. The two-phase rollout in §7.1 is
1085- the load-bearing piece for "node not yet upgraded" cases:
1086- Phase 1 keeps Raft proposals cleartext so a non-upgraded
1087- follower can still apply, and Phase 2 only flips on after a
1088- membership-snapshot check confirms every voting member is
1089- encryption-capable. Skipping the membership check would let
1090- one upgraded leader produce raft envelopes that lock every
1091- non-upgraded follower out of apply.
1152+ ordering of the rotation entry. The three-phase rollout in §7.1
1153+ is the load-bearing piece for "node not yet upgraded" cases:
1154+ Phase 0 (capability rollout) gets every node onto the new
1155+ binary while still writing cleartext so the storage-layer
1156+ ` data[0] != 0 ` tombstone test on old code can never see a
1157+ value with ` encryption_state ` bits set; Phase 1 (storage
1158+ cluster-flag) only flips after a membership-snapshot check
1159+ confirms every voting member is ` encryption_capable ` ; Phase 2
1160+ (raft cluster-flag) reuses the same gate. The membership
1161+ check is ** not** advisory — skipping it would let one
1162+ upgraded leader produce raft envelopes that lock every
1163+ non-upgraded follower out of apply, or produce
1164+ ` encryption_state = 0b01 ` values that an old follower
1165+ silently drops as tombstones. Sidecar recovery in §5.5 is
1166+ correspondingly load-bearing for the "missed multiple
1167+ rotations" case: replaying only the active DEK would strand
1168+ intermediate ` key_id ` s and silently break historical reads.
10921169
109311703 . ** Performance.** AES-NI puts encryption CPU below the existing
10941171 FSM apply CPU. The compress pass recovers most of Pebble's lost
0 commit comments