Skip to content

Commit f34d9b7

Browse files
committed
docs(encryption): address PR707 round-4 P1x3 (3-phase rollout, full-DEK resync, index-gated framing)
P1 (codex on 1b8613f line 645): old-binary tombstone test data[0] != 0 reads encrypted live values (encryption_state in bits 1-2 of byte 0) as tombstones and silently drops them. Convert §7.1 from 2-phase to 3-phase: Phase 0 = capability rollout (new binary writes cleartext, advertises encryption_capable in heartbeats); Phase 1 = enable-storage-envelope cluster-flag flip after membership-snapshot capability check; Phase 2 = enable-raft-envelope under the same gate. Per-node --encryption-enabled is now a capability assertion, not a behaviour trigger. P1 (codex on 1b8613f line 586): §5.5 recovery rewrapped only the active DEK, leaving intermediate unretired DEKs missing on a node that missed multiple rotations and silently breaking historical reads. Recovery now rewraps every unretired DEK (active + retiring) in one apply; resync-sidecar replays all rotation/rewrap-deks entries, not just the most recent. P1 (codex on 1b8613f line 847): Phase 2 first-byte framing tag (0x00 cleartext / 0x01 envelope) collided with kv/fsm.go existing tag space (0x00 single, 0x01 batch, 0x02 HLC-lease) and would misroute pre-cutover batch entries during WAL replay. Replace with raft-log-index gating: sidecar persists raft_envelope_cutover_index; entries below it use the legacy first-byte tags, entries at-or-above unwrap through the raft envelope. Snapshot headers carry the cutover index so a follower joining mid-Phase-2 reconstructs the same boundary. Sidecar example expanded with storage_envelope_active and raft_envelope_cutover_index. §9.1 startup refusal list adds ErrEnvelopeCutoverDivergence. §10 self-review lens 2 updated to call out the old-binary tombstone path and the full-DEK resync requirement as load-bearing. §6.6 admin gains enable-storage-envelope.
1 parent 1b8613f commit f34d9b7

1 file changed

Lines changed: 165 additions & 88 deletions

File tree

docs/design/2026_04_29_proposed_data_at_rest_encryption.md

Lines changed: 165 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -358,6 +358,8 @@ Two-tier hierarchy:
358358
{
359359
"version": 1,
360360
"raft_applied_index": 184273,
361+
"storage_envelope_active": true,
362+
"raft_envelope_cutover_index": 184201,
361363
"active": { "storage": 305419896, "raft": 2596069104 },
362364
"keys": {
363365
"305419896": { "purpose": "storage", "wrapped": "<base64>",
@@ -370,6 +372,15 @@ Two-tier hierarchy:
370372
}
371373
```
372374

375+
- `storage_envelope_active` mirrors the §7.1 Phase 1 cluster
376+
flag once the FSM has seen the `enable-storage-envelope`
377+
entry.
378+
- `raft_envelope_cutover_index` records the apply index at
379+
which §4.2 became active for this cluster. `0` means
380+
Phase 2 has not started; any non-zero value is the
381+
dispatch boundary used by FSM apply and WAL replay to
382+
decide which Raft entries are envelopes.
383+
373384
- **`key_id` is a 32-bit unsigned integer**, the same value that
374385
appears in the §4.1 envelope `key_id` field. It is generated by
375386
a CSPRNG draw on the leader at the moment a new DEK is created;
@@ -581,16 +592,32 @@ To detect and repair this:
581592
sidecar's index AND the gap covers any rotation entries, the
582593
node refuses to start with `ErrSidecarBehindRaftLog`,
583594
pointing at the operator runbook.
584-
3. Recovery is **automatic on the leader** (it re-proposes a
585-
"rewrap current active DEK" entry that brings every node's
586-
sidecar back in sync) and **manual on a stuck follower**
587-
(operator runs `elastickv-admin encryption resync-sidecar`
588-
which replays the rotation entries from the Raft log into the
589-
local sidecar, then exits). Refusing to start is deliberate:
590-
silently serving with a stale sidecar would let the node
591-
write under an old `key_id` while peers write under a new one,
592-
which is exactly the split-brain key state DEK rotation is
593-
designed to prevent.
595+
3. Recovery rewraps **every unretired DEK**, not just the active
596+
one. A node that missed multiple rotations needs *every*
597+
intermediate `key_id` to decrypt MVCC history that still
598+
carries those IDs (per §5.4 retirement criterion #4: a DEK
599+
stays loaded until both the rewrite cursor has passed the
600+
keyspace AND `minRetainedTS` has advanced past the youngest
601+
`commit_ts` written under that DEK). Recovery comes in two
602+
shapes:
603+
- **Automatic on the leader** — the leader's keystore already
604+
holds every unretired DEK in memory. It re-proposes a
605+
`rewrap-deks` entry that serialises **all** unretired
606+
wrapped DEKs (active + retiring), bringing every node's
607+
sidecar to the full set in one apply.
608+
- **Manual on a stuck follower** — operator runs
609+
`elastickv-admin encryption resync-sidecar`. The command
610+
replays *all* rotation and `rewrap-deks` entries between
611+
the sidecar's `raft_applied_index` and the FSM's applied
612+
index into the local sidecar (not just the most recent
613+
one), then exits. Replaying only the active DEK would
614+
leave intermediate `key_id`s missing and silently break
615+
historical reads on that node — explicitly rejected.
616+
Refusing to start until recovery completes is deliberate:
617+
silently serving with an incomplete sidecar would let the
618+
node write under one `key_id` while failing to decrypt
619+
historical reads under another, which is exactly the
620+
split-brain key state DEK rotation is designed to prevent.
594621
4. The reverse case — sidecar ahead of the raft log — cannot
595622
happen because the sidecar is only written from inside an FSM
596623
apply; an apply implies the entry is already in the Raft log.
@@ -726,11 +753,16 @@ elastickv-admin encryption rewrap-deks
726753
elastickv-admin encryption rewrite --rate=10MiB/s
727754
elastickv-admin encryption retire-dek --key-id=<uint32>
728755
elastickv-admin encryption resync-sidecar # §5.5 follower repair
729-
elastickv-admin encryption enable-raft-envelope
730-
# §7.1 Phase 2 cutover;
756+
elastickv-admin encryption enable-storage-envelope
757+
# §7.1 Phase 1 cutover;
731758
# refuses unless every
732759
# voting member reports
733760
# encryption_capable
761+
elastickv-admin encryption enable-raft-envelope
762+
# §7.1 Phase 2 cutover;
763+
# same capability gate;
764+
# records cutover index
765+
# in the sidecar
734766
elastickv-admin encryption disable # refuses; documents the
735767
# dump-and-reload path
736768
elastickv-admin backup verify --backup-dir=...
@@ -784,75 +816,105 @@ envelope) is impossible by construction — the header layout is
784816
fixed and the encryption state is read before the value bytes are
785817
ever interpreted.
786818

787-
#### Rolling enablement: two phases, not one
788-
789-
The §4.1 storage envelope and the §4.2 raft envelope have very
790-
different rollout constraints:
791-
792-
- **§4.1 (storage envelope)** is per-version, dispatched on the
793-
per-MVCC-version `encryption_state` bit. A node that has not yet
794-
been upgraded continues to serve old cleartext versions; only
795-
upgraded nodes write encrypted versions. Mixed-mode is safe.
796-
- **§4.2 (raft envelope)** wraps **every** committed Raft entry's
797-
`Data []byte`. **Every replica must be able to decrypt every
798-
committed entry** because Raft apply is deterministic and runs
799-
on every node — there is no "skip this entry on this node"
800-
escape hatch. If an upgraded leader proposes an encrypted
801-
entry while a non-upgraded follower is still in the cluster,
802-
the follower's `Apply` fails for every such entry. The
803-
follower stops making progress, drifts behind on commit index,
804-
and eventually triggers leader-side throttling.
805-
806-
This rules out enabling §4.2 at the same instant as §4.1. The
807-
rollout is therefore explicitly two-phase, controlled by separate
808-
flags / Raft-replicated cluster state:
809-
810-
**Phase 1 — Storage envelope only.**
811-
812-
1. Operator provisions the KEK in the KMS.
813-
2. Operator restarts each node with `--encryption-enabled --kekUri=...`.
814-
Restart is rolling; mixed-mode is safe under §4.1 because each
815-
MVCC version carries its own `encryption_state` bit and reads
816-
dispatch on that bit, not on the value bytes.
817-
3. Upgraded nodes write encrypted MVCC versions (`encryption_state
818-
= 0b01`). Old MVCC versions retain `encryption_state = 0b00`
819-
and continue to be returned as cleartext.
820-
4. Raft proposals during Phase 1 carry **cleartext** `Data []byte`
821-
— i.e., §4.2 is not yet active. Lookup keys and operation tags
822-
in the WAL are still cleartext during this window.
823-
5. Operator runs `elastickv-admin encryption rewrite` to convert
824-
all `encryption_state = 0b00` MVCC versions in place
825-
(per §5.4). When `encryption status --verify` reports zero
826-
cleartext versions across every node AND `minRetainedTS` has
827-
advanced past the youngest cleartext `commit_ts`, Phase 1 is
828-
complete.
829-
830-
**Phase 2 — Raft envelope.**
831-
832-
6. Operator runs `elastickv-admin encryption enable-raft-envelope`.
833-
The admin client RPCs into the leader, which:
834-
- Verifies via the route catalog / membership snapshot that
835-
**every** voting member of every Raft group has reported
836-
`encryption_capable = true` in its periodic heartbeat
837-
metadata (a new field). Refuses to proceed otherwise.
838-
- Proposes a single Raft entry with the cluster-wide flag
839-
`raft_envelope_active = true`. This entry itself is sent
840-
**cleartext** so non-upgraded replicas (defensive: should
841-
not exist after step 6's check, but treated as a safety
842-
net) can still apply it.
843-
7. From the apply index of that entry onward, every leader
819+
#### Rolling enablement: three phases, not one
820+
821+
Both the §4.1 storage envelope and the §4.2 raft envelope create
822+
mixed-binary safety problems that have to be ruled out **before**
823+
any encrypted byte hits disk. The rollout is therefore explicitly
824+
three-phase:
825+
826+
**Why an "obvious" rolling restart with `--encryption-enabled` is
827+
unsafe.** A naive plan — flip the flag node-by-node and rely on
828+
the per-version `encryption_state` bit — is broken at two layers:
829+
830+
- *At the storage layer:* the `lsm_store.go` value header today is
831+
9 bytes `[tombstone(1)] [expireAt(8)]`, and `decodeValue` reads
832+
the first byte as `tombstone := data[0] != 0`. The §6.2 plan
833+
packs `encryption_state` into bits 1–2 of that same byte. An
834+
encrypted live value (`tombstone=0`, `encryption_state=0b01`)
835+
becomes `data[0] = 0b00000010 = 2`. A node still on the old
836+
binary that ingests this value via Raft replication or snapshot
837+
catch-up reads `data[0] != 0` → "tombstone" → drops the value
838+
from reads. Silent data loss, not a clean refusal.
839+
- *At the Raft layer:* §4.2 wraps the entire `Data []byte` per
840+
Raft entry. Apply is deterministic across replicas; a
841+
non-upgraded follower without the KEK fails every
842+
raft-envelope entry on apply, drifts behind, and is eventually
843+
removed.
844+
845+
Both problems share one fix: **gate the encryption-active state on
846+
a Raft-replicated cluster flag, not on a per-node startup flag.**
847+
The cluster-flag flip happens only after a membership-snapshot
848+
check confirms every voting member is running an
849+
encryption-capable binary. The per-node `--encryption-enabled`
850+
flag becomes a *capability* assertion, not a *behaviour* trigger.
851+
852+
**Phase 0 — Capability rollout.**
853+
854+
0. Operator provisions the KEK in the KMS.
855+
1. Operator restarts each node with `--encryption-enabled
856+
--kekUri=...`. Rolling restart is safe in this phase because
857+
the new binary still writes cleartext values (storage
858+
envelope is gated on the cluster flag below) and proposes
859+
cleartext Raft entries (raft envelope is gated on a separate
860+
flag in Phase 2).
861+
2. Each upgraded node advertises `encryption_capable = true` in
862+
its periodic Raft heartbeat metadata (a new field on the
863+
peer-metadata payload added by `peer_metadata.go`). The flag
864+
is purely informational at this point.
865+
3. When the membership snapshot of every Raft group shows
866+
`encryption_capable = true` for every voting member, Phase 0
867+
is complete. `encryption status` reports the capability
868+
coverage so the operator can wait for it before moving on.
869+
870+
**Phase 1 — Storage envelope cluster-flag flip.**
871+
872+
4. Operator runs `elastickv-admin encryption enable-storage-envelope`.
873+
The leader rechecks the membership-snapshot capability gate
874+
from step 3 and then proposes a single Raft entry with the
875+
cluster-wide flag `storage_envelope_active = true`. This
876+
entry's `Data []byte` is the **legacy framing** (i.e., the
877+
existing `kv/fsm.go` first-byte tag space; see Phase 2's note
878+
below) so every replica — even one that somehow missed the
879+
capability advert — can still decode and apply it.
880+
5. From the apply index of that flag entry onward, every storage
881+
layer Put writes `encryption_state = 0b01` and the §4.1
882+
envelope. MVCC versions written before that index keep
883+
`encryption_state = 0b00` (cleartext) and are read back through
884+
the dispatch path; mixing within a single key is safe per §5.4.
885+
6. Operator runs `elastickv-admin encryption rewrite` to convert
886+
the remaining cleartext MVCC versions in place (per §5.4).
887+
Phase 1 is complete when `encryption status --verify` reports
888+
zero `encryption_state = 0b00` versions across every node
889+
(excluding tombstones) AND `minRetainedTS` has advanced past
890+
the youngest cleartext `commit_ts`.
891+
892+
**Phase 2 — Raft envelope cluster-flag flip.**
893+
894+
7. Operator runs `elastickv-admin encryption enable-raft-envelope`.
895+
The leader rechecks the same capability gate, then proposes a
896+
single Raft entry with the cluster-wide flag
897+
`raft_envelope_active = true`. This entry itself uses the
898+
legacy first-byte tag space so it remains decodable by any
899+
replica that has applied entries up to it.
900+
8. From the apply index of that flag entry onward, every leader
844901
wraps new proposal `Data []byte` with the raft DEK (§4.2).
845-
Replicas dispatch on a 1-byte format tag at the start of the
846-
`Data` payload — `0x00` = cleartext (pre-flag entry, found
847-
only in WAL/snapshot history), `0x01` = raft envelope. The
848-
discriminator lives **at the start of the application-level
849-
payload**, never as the first byte of an arbitrary user
850-
value, so the §7.1 byte-collision argument does not apply
851-
here: every value the storage layer sees is opaque bytes
852-
from a known framing, not raw user input.
853-
8. Snapshots taken during Phase 2 carry the new flag in their
854-
metadata header so a fresh follower joining mid-Phase-2 sees
855-
the right framing for every entry it will ever receive.
902+
**There is no in-band format tag in the proposal payload.**
903+
Replicas dispatch on the **Raft log index** of the entry
904+
relative to the persisted `raft_envelope_cutover_index` (the
905+
apply index of the flag entry, recorded in the local sidecar
906+
on apply). Entries below the cutover index are decoded via
907+
the existing first-byte tag space (`0x00` single-request,
908+
`0x01` batch, `0x02` HLC-lease, etc.); entries at-or-above
909+
the cutover index are unwrapped through the raft envelope
910+
first. This explicitly avoids re-using `kv/fsm.go`'s existing
911+
first-byte tag values for the "is this an envelope?"
912+
discriminator — that re-use would make pre-cutover batch
913+
entries (`0x01`) ambiguous during WAL replay.
914+
9. Snapshots taken during Phase 2 carry the cutover index in
915+
their metadata header so a fresh follower joining
916+
mid-Phase-2 reconstructs the same dispatch boundary on
917+
ingest.
856918

857919
#### Why §4.2 cannot be turned off again
858920

@@ -996,6 +1058,12 @@ The process refuses to start if any of the following hold:
9961058
- The encryption package's startup membership check fails to
9971059
resolve a stable 16-bit `node_id` (the §4.1 nonce construction
9981060
needs one; without it nonce uniqueness cannot be guaranteed).
1061+
- The local sidecar's `raft_envelope_cutover_index` disagrees
1062+
with the value carried in the most-recent ingested snapshot
1063+
header (`ErrEnvelopeCutoverDivergence`). This catches a node
1064+
that joined mid-Phase-2 from a snapshot taken before Phase 2
1065+
was enabled, then later replayed Raft entries that crossed
1066+
the cutover; resolved by `encryption resync-sidecar`.
9991067

10001068
Each refusal logs a single, unambiguous error pointing at the
10011069
relevant flag and runbook section.
@@ -1081,14 +1149,23 @@ eventual code change.
10811149
unaffected because the bytes are already ciphertext; the
10821150
receiving node's keystore must contain the relevant DEK before
10831151
it can read the ingested data, which is guaranteed by the Raft
1084-
ordering of the rotation entry. The two-phase rollout in §7.1 is
1085-
the load-bearing piece for "node not yet upgraded" cases:
1086-
Phase 1 keeps Raft proposals cleartext so a non-upgraded
1087-
follower can still apply, and Phase 2 only flips on after a
1088-
membership-snapshot check confirms every voting member is
1089-
encryption-capable. Skipping the membership check would let
1090-
one upgraded leader produce raft envelopes that lock every
1091-
non-upgraded follower out of apply.
1152+
ordering of the rotation entry. The three-phase rollout in §7.1
1153+
is the load-bearing piece for "node not yet upgraded" cases:
1154+
Phase 0 (capability rollout) gets every node onto the new
1155+
binary while still writing cleartext so the storage-layer
1156+
`data[0] != 0` tombstone test on old code can never see a
1157+
value with `encryption_state` bits set; Phase 1 (storage
1158+
cluster-flag) only flips after a membership-snapshot check
1159+
confirms every voting member is `encryption_capable`; Phase 2
1160+
(raft cluster-flag) reuses the same gate. The membership
1161+
check is **not** advisory — skipping it would let one
1162+
upgraded leader produce raft envelopes that lock every
1163+
non-upgraded follower out of apply, or produce
1164+
`encryption_state = 0b01` values that an old follower
1165+
silently drops as tombstones. Sidecar recovery in §5.5 is
1166+
correspondingly load-bearing for the "missed multiple
1167+
rotations" case: replaying only the active DEK would strand
1168+
intermediate `key_id`s and silently break historical reads.
10921169

10931170
3. **Performance.** AES-NI puts encryption CPU below the existing
10941171
FSM apply CPU. The compress pass recovers most of Pebble's lost

0 commit comments

Comments
 (0)