Skip to content

Commit 1b8613f

Browse files
committed
docs(encryption): address PR707 round-3 P1 + P2 (2-phase rollout, deterministic nonce, sidecar fsync)
P1 (codex line 728): §4.2 raft envelope rollout was unsafe in a rolling restart -- a non-upgraded follower without KEK cannot decrypt entries proposed by an upgraded leader, breaking deterministic apply. Split rollout into Phase 1 (storage envelope only, raft Data stays cleartext) and Phase 2 (enable-raft-envelope admin command, gated on every voting member reporting encryption_capable). Phase 2 is one-way because the WAL would interleave cleartext and raft-envelope entries. P1 (codex line 179): 32-bit random nonce prefix had a non-zero birthday-collision probability across process loads (~1e-7 over multi-year DEK lifetime) and a single AES-GCM nonce reuse is catastrophic. Replace with deterministic node_id (16b) || local_epoch (16b) || write_count (64b). local_epoch is persisted in the sidecar and fsync-bumped before any encryption, removing collision probability entirely. P2 (codex line 561): os.Rename was atomic for visibility but not crash-durable. Specify the fsync-file -> rename -> fsync-parent-dir protocol; refuse to start on filesystems that cannot guarantee dir.Sync semantics (NFS, some FUSE). Self-review §10 lens 1 + lens 2 updated to call out sidecar non-durability and nonce reuse as data-loss-class hard preconditions, and the membership-snapshot check as the load-bearing piece for Phase 2 safety. §9.1 startup refusal list extended with ErrUnsupportedFilesystem and missing node_id. §6.6 admin commands gain enable-raft-envelope.
1 parent ff6a91a commit 1b8613f

1 file changed

Lines changed: 187 additions & 40 deletions

File tree

docs/design/2026_04_29_proposed_data_at_rest_encryption.md

Lines changed: 187 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -169,16 +169,45 @@ Envelope format (single byte stream, stored as the Pebble value):
169169
- `key_id` — 32-bit identifier of the DEK that produced this
170170
ciphertext. Required so a rotated DEK can decrypt entries written
171171
before the rotation (see §5.2).
172-
- `nonce` — 12-byte AES-GCM nonce. To stay clear of the NIST SP
173-
800-38D limit on randomly-generated 96-bit nonces (§5.2 derives
174-
the budget), elastickv issues nonces from a **per-DEK 96-bit
175-
counter** rather than a fresh `crypto/rand.Reader` draw per write.
176-
The counter's high 32 bits are a per-process random prefix
177-
(re-drawn whenever the DEK is loaded, so two processes that share
178-
a DEK never collide), and the low 64 bits are an `atomic.Uint64`
179-
incremented per write. This makes nonce reuse impossible within
180-
the lifetime of a (DEK, process-load) pair and removes the
181-
birthday-bound budget entirely.
172+
- `nonce` — 12-byte AES-GCM nonce, structured as **three
173+
deterministic fields** to eliminate the birthday bound entirely.
174+
No bits of the nonce are random; nonce uniqueness is by
175+
construction across `(node, process_load, write)`:
176+
177+
```text
178+
+-------------+----------------+----------------+
179+
| node_id(2B) | local_epoch(2B)| write_count(8B)|
180+
+-------------+----------------+----------------+
181+
```
182+
183+
- `node_id` — the 16-bit Raft member ID assigned at cluster
184+
bootstrap. Never reused for a different node within the
185+
cluster's lifetime; verified at process start against the
186+
membership snapshot.
187+
- `local_epoch` — 16-bit per-DEK process-load counter,
188+
persisted in the local sidecar (§5.1). Incremented and
189+
fsync'd **before** the new process performs any encryption,
190+
so a crash between increment and the first write still leaves
191+
the on-disk counter ahead of any nonce ever used. Wraps at
192+
65,536 process restarts per node per DEK lifetime; rotation
193+
triggers (§5.2) reset it to 0 with the new DEK. The
194+
encryption package logs a warning when `local_epoch >
195+
0xff00` so an operator has a 256-restart cushion to rotate
196+
before wrap.
197+
- `write_count` — 64-bit `atomic.Uint64`, incremented per
198+
write within the current process. Resets to 0 on process
199+
start (the `local_epoch` bump is what makes that safe).
200+
Wrap is unreachable in any realistic deployment (2⁶⁴ ≈
201+
1.8 × 10¹⁹ writes per process-load).
202+
203+
An earlier draft used a 32-bit CSPRNG-drawn prefix for the
204+
high half of the nonce; that has been retracted because a
205+
random 32-bit value across N process loads carries an N²/2³³
206+
birthday-collision probability (~1 in 10⁷ at N=30 over a
207+
multi-year DEK lifetime), and a single nonce collision under
208+
one DEK is a catastrophic AES-GCM failure (key-recovery + XOR
209+
of two plaintexts). The deterministic construction above has
210+
zero collision probability under its preconditions.
182211
- `ciphertext` — AES-256-GCM(plaintext, nonce, AAD = envelope_version
183212
‖ flag ‖ key_id). AAD binds the ciphertext to the **entire**
184213
envelope header — including the compression flag — so a
@@ -332,9 +361,11 @@ Two-tier hierarchy:
332361
"active": { "storage": 305419896, "raft": 2596069104 },
333362
"keys": {
334363
"305419896": { "purpose": "storage", "wrapped": "<base64>",
335-
"created": "2026-04-29T10:00:00Z" },
364+
"created": "2026-04-29T10:00:00Z",
365+
"local_epoch": 7 },
336366
"2596069104": { "purpose": "raft", "wrapped": "<base64>",
337-
"created": "2026-04-29T10:00:00Z" }
367+
"created": "2026-04-29T10:00:00Z",
368+
"local_epoch": 7 }
338369
}
339370
}
340371
```
@@ -352,6 +383,35 @@ Two-tier hierarchy:
352383
recent rotation entry that has been persisted into this
353384
sidecar. It is the load-bearing field for the
354385
sidecar/log-index reconciliation protocol in §5.5.
386+
- `local_epoch` is the per-DEK process-load counter consumed by
387+
the §4.1 nonce construction. Bumped and durably persisted on
388+
every process start before any encryption happens with that
389+
DEK; reset to 0 when the DEK is created.
390+
391+
**Crash-durable write protocol.** `os.Rename` is atomic for
392+
visibility but not crash-durable on its own — a power loss after
393+
the rename can roll back the file via the file system's metadata
394+
journal, leaving stale wrapped DEKs on disk while the rotation's
395+
Raft entry is already committed. To avoid stranding ciphertext
396+
written under a DEK whose wrap is then lost, the sidecar write
397+
protocol is:
398+
399+
1. Write the new contents to `<dataDir>/encryption/keys.json.tmp`.
400+
2. `file.Sync()` on the temp file (fsync the data + metadata).
401+
3. `os.Rename` to `keys.json`.
402+
4. `dir.Sync()` on `<dataDir>/encryption/` (fsync the directory
403+
entry so the rename is durable).
404+
5. Only after step 4 does the FSM acknowledge the rotation entry
405+
as applied (and update `raft_applied_index` in memory; that
406+
value is then persisted on the next sidecar write).
407+
408+
Skipping step 2 or 4 turns a power loss into permanent data loss
409+
for any value written under the new DEK; the §10 self-review
410+
treats sidecar non-durability as a data-loss-class bug. On
411+
filesystems that lack `dir.Sync()` semantics (NFS, some FUSE
412+
mounts) the encryption package refuses to start with
413+
`ErrUnsupportedFilesystem` rather than silently degrading the
414+
durability guarantee.
355415

356416
The sidecar is **safe to leak**: every entry in `keys` is wrapped
357417
by the KEK. Without the KEK, the file unwraps to nothing.
@@ -666,6 +726,11 @@ elastickv-admin encryption rewrap-deks
666726
elastickv-admin encryption rewrite --rate=10MiB/s
667727
elastickv-admin encryption retire-dek --key-id=<uint32>
668728
elastickv-admin encryption resync-sidecar # §5.5 follower repair
729+
elastickv-admin encryption enable-raft-envelope
730+
# §7.1 Phase 2 cutover;
731+
# refuses unless every
732+
# voting member reports
733+
# encryption_capable
669734
elastickv-admin encryption disable # refuses; documents the
670735
# dump-and-reload path
671736
elastickv-admin backup verify --backup-dir=...
@@ -719,34 +784,93 @@ envelope) is impossible by construction — the header layout is
719784
fixed and the encryption state is read before the value bytes are
720785
ever interpreted.
721786

722-
#### Rolling enablement
787+
#### Rolling enablement: two phases, not one
788+
789+
The §4.1 storage envelope and the §4.2 raft envelope have very
790+
different rollout constraints:
791+
792+
- **§4.1 (storage envelope)** is per-version, dispatched on the
793+
per-MVCC-version `encryption_state` bit. A node that has not yet
794+
been upgraded continues to serve old cleartext versions; only
795+
upgraded nodes write encrypted versions. Mixed-mode is safe.
796+
- **§4.2 (raft envelope)** wraps **every** committed Raft entry's
797+
`Data []byte`. **Every replica must be able to decrypt every
798+
committed entry** because Raft apply is deterministic and runs
799+
on every node — there is no "skip this entry on this node"
800+
escape hatch. If an upgraded leader proposes an encrypted
801+
entry while a non-upgraded follower is still in the cluster,
802+
the follower's `Apply` fails for every such entry. The
803+
follower stops making progress, drifts behind on commit index,
804+
and eventually triggers leader-side throttling.
805+
806+
This rules out enabling §4.2 at the same instant as §4.1. The
807+
rollout is therefore explicitly two-phase, controlled by separate
808+
flags / Raft-replicated cluster state:
809+
810+
**Phase 1 — Storage envelope only.**
723811

724-
1. Operator provisions the KEK in their KMS.
812+
1. Operator provisions the KEK in the KMS.
725813
2. Operator restarts each node with `--encryption-enabled --kekUri=...`.
726-
Restart is rolling; mixed-mode clusters are supported because
727-
each MVCC version carries its own `encryption_state` bit and
728-
reads dispatch on that bit, not on the value bytes.
729-
3. New writes from the upgraded node are encrypted (`encryption_state
730-
= 0b01`). Old MVCC versions retain `encryption_state = 0b00` and
731-
continue to be returned as cleartext to the storage layer's
732-
decryption shim, which short-circuits when the bit is `0b00`.
733-
4. Operator runs `elastickv-admin encryption rewrite` to walk the
734-
MVCC versions and re-encrypt every `encryption_state = 0b00`
735-
version in place (per §5.4: same-`commit_ts` rewrite, MVCC
736-
history preserved, rate-limited, resumable). The cursor lives at
737-
`!encryption|rewrite|cursor|cleartext`.
738-
5. When `encryption status --verify` reports zero `encryption_state
739-
= 0b00` versions across every node (NOT counting tombstones)
740-
AND `minRetainedTS` has advanced past the youngest cleartext
741-
`commit_ts`, the migration is complete and the cleartext code
742-
path can be retired in a follow-up release.
814+
Restart is rolling; mixed-mode is safe under §4.1 because each
815+
MVCC version carries its own `encryption_state` bit and reads
816+
dispatch on that bit, not on the value bytes.
817+
3. Upgraded nodes write encrypted MVCC versions (`encryption_state
818+
= 0b01`). Old MVCC versions retain `encryption_state = 0b00`
819+
and continue to be returned as cleartext.
820+
4. Raft proposals during Phase 1 carry **cleartext** `Data []byte`
821+
— i.e., §4.2 is not yet active. Lookup keys and operation tags
822+
in the WAL are still cleartext during this window.
823+
5. Operator runs `elastickv-admin encryption rewrite` to convert
824+
all `encryption_state = 0b00` MVCC versions in place
825+
(per §5.4). When `encryption status --verify` reports zero
826+
cleartext versions across every node AND `minRetainedTS` has
827+
advanced past the youngest cleartext `commit_ts`, Phase 1 is
828+
complete.
829+
830+
**Phase 2 — Raft envelope.**
831+
832+
6. Operator runs `elastickv-admin encryption enable-raft-envelope`.
833+
The admin client RPCs into the leader, which:
834+
- Verifies via the route catalog / membership snapshot that
835+
**every** voting member of every Raft group has reported
836+
`encryption_capable = true` in its periodic heartbeat
837+
metadata (a new field). Refuses to proceed otherwise.
838+
- Proposes a single Raft entry with the cluster-wide flag
839+
`raft_envelope_active = true`. This entry itself is sent
840+
**cleartext** so non-upgraded replicas (defensive: should
841+
not exist after step 6's check, but treated as a safety
842+
net) can still apply it.
843+
7. From the apply index of that entry onward, every leader
844+
wraps new proposal `Data []byte` with the raft DEK (§4.2).
845+
Replicas dispatch on a 1-byte format tag at the start of the
846+
`Data` payload — `0x00` = cleartext (pre-flag entry, found
847+
only in WAL/snapshot history), `0x01` = raft envelope. The
848+
discriminator lives **at the start of the application-level
849+
payload**, never as the first byte of an arbitrary user
850+
value, so the §7.1 byte-collision argument does not apply
851+
here: every value the storage layer sees is opaque bytes
852+
from a known framing, not raw user input.
853+
8. Snapshots taken during Phase 2 carry the new flag in their
854+
metadata header so a fresh follower joining mid-Phase-2 sees
855+
the right framing for every entry it will ever receive.
856+
857+
#### Why §4.2 cannot be turned off again
858+
859+
Once `raft_envelope_active = true` has been committed, the WAL on
860+
every node interleaves cleartext (pre-flag) and raft-envelope
861+
(post-flag) entries. Disabling §4.2 would require rewriting WAL
862+
entries, which etcd raft does not support. The flag is therefore
863+
one-way; the only way back is dump-and-reload (§7.2).
743864

744865
#### Compatibility with snapshot streaming during migration
745866

746-
A leader streaming a Pebble snapshot to a new follower mid-migration
747-
will ship a mix of `encryption_state = 0b00` and `0b01` versions.
748-
Followers ingest both correctly because the MVCC metadata travels
749-
with each version. No special handling at the snapshot layer.
867+
A leader streaming a Pebble snapshot to a new follower mid-Phase-1
868+
ships a mix of `encryption_state = 0b00` and `0b01` MVCC versions;
869+
the receiving follower ingests both correctly because the
870+
metadata travels per-version. A snapshot taken in Phase 2 ships
871+
only encryption-state-`0b01` MVCC versions plus a Phase-2-flagged
872+
header so the receiver knows to expect raft envelopes from there
873+
on.
750874

751875
### 7.2 Why we will not support encrypted → cleartext
752876

@@ -866,6 +990,12 @@ The process refuses to start if any of the following hold:
866990
persisted applied index AND the gap covers any rotation entries
867991
(see §5.5: `ErrSidecarBehindRaftLog`, recovered via
868992
`encryption resync-sidecar`).
993+
- The sidecar is on a filesystem whose `dir.Sync()` does not
994+
guarantee crash durability of the rename (`ErrUnsupportedFilesystem`,
995+
see §5.1's crash-durable write protocol).
996+
- The encryption package's startup membership check fails to
997+
resolve a stable 16-bit `node_id` (the §4.1 nonce construction
998+
needs one; without it nonce uniqueness cannot be guaranteed).
869999

8701000
Each refusal logs a single, unambiguous error pointing at the
8711001
relevant flag and runbook section.
@@ -924,10 +1054,20 @@ eventual code change.
9241054
value bytes — see §7.1 on why a leading byte is unsafe), so a
9251055
half-migrated database stays fully readable and a legacy value
9261056
that happens to start with `0x01` cannot be misclassified as an
927-
envelope. The largest risk is a buggy `compress-then-encrypt`
928-
path that mis-frames the compressed payload; mitigation is a
929-
round-trip property test in `store/` using `pgregory.net/rapid`
930-
over arbitrary byte slices.
1057+
envelope. Two specific data-loss-class failure modes are
1058+
addressed by hard preconditions rather than by recovery code:
1059+
(a) sidecar non-durability — the §5.1 write protocol fsyncs
1060+
the file *and* the parent directory before the rotation is
1061+
acknowledged, so a power loss cannot strand ciphertext under a
1062+
wrap that has rolled back; (b) AES-GCM nonce reuse — the §4.1
1063+
nonce is built from `node_id ‖ local_epoch ‖ write_count`,
1064+
each field deterministic, the epoch persisted-and-fsynced
1065+
before any encryption, so even a crash-restart loop cannot
1066+
cause two writes under one DEK to share a nonce. The remaining
1067+
open risk is a buggy `compress-then-encrypt` path that
1068+
mis-frames the compressed payload; mitigation is a round-trip
1069+
property test in `store/` using `pgregory.net/rapid` over
1070+
arbitrary byte slices.
9311071

9321072
2. **Concurrency / distributed failures.** DEK rotation goes through
9331073
Raft so every replica observes the new DEK at the same log index.
@@ -941,7 +1081,14 @@ eventual code change.
9411081
unaffected because the bytes are already ciphertext; the
9421082
receiving node's keystore must contain the relevant DEK before
9431083
it can read the ingested data, which is guaranteed by the Raft
944-
ordering of the rotation entry.
1084+
ordering of the rotation entry. The two-phase rollout in §7.1 is
1085+
the load-bearing piece for "node not yet upgraded" cases:
1086+
Phase 1 keeps Raft proposals cleartext so a non-upgraded
1087+
follower can still apply, and Phase 2 only flips on after a
1088+
membership-snapshot check confirms every voting member is
1089+
encryption-capable. Skipping the membership check would let
1090+
one upgraded leader produce raft envelopes that lock every
1091+
non-upgraded follower out of apply.
9451092

9461093
3. **Performance.** AES-NI puts encryption CPU below the existing
9471094
FSM apply CPU. The compress pass recovers most of Pebble's lost

0 commit comments

Comments
 (0)