Skip to content

Commit 2768cb3

Browse files
committed
docs(encryption): address PR707 round-5 (claude[bot] r3+r4 leftovers)
High (claude[bot] r4 #2): local_epoch wrap to 0 re-issues nonce node_id||0||0 of the same DEK first write -- catastrophic AES-GCM nonce reuse. Add ErrLocalEpochExhausted to §9.1 startup refusal at local_epoch == 0xFFFF; only DEK rotation can recover. High (claude[bot] r4 #3): node_id was described as 16-bit but DeriveNodeID() returns uint64 FNV-64a. Specify the truncation rule (lower 16 bits) and the startup ErrNodeIDCollision check that asserts no two voting members share the same uint16 truncated value. Medium (claude[bot] r4 #4): §5.1 step 5 said raft_applied_index is persisted on the next sidecar write, contradicting §5.5 which expects the same write. Reword: the index is part of the same JSON payload as the new wrapped DEKs, eliminating any deferred-write race that would spuriously fire ErrSidecarBehindRaftLog. Medium (claude[bot] r4 #5): encryption_capable heartbeat field path was unspecified. Pin it to the existing peer_metadata.go extension; cutover commands force a fresh poll within one election timeout rather than reading cached state. Critical reinforcement (claude[bot] r4 #1): explicitly cross-reference kv/fsm.go raftEncodeSingle/raftEncodeBatch/raftEncodeHLCLease constants in §7.1 Phase 2 and pseudocode the exact FSM dispatch order so the implementor cannot accidentally re-use the colliding tag bytes. Index-gating remains the chosen mechanism (already in round-4 commit). Medium (claude[bot] r3 #3): §6.1 sidecar.go entry now references the §5.1 crash-durable write protocol explicitly so an implementor reading §6.1 in isolation cannot omit the fsync. Medium (claude[bot] r3 #4): §6.2 corrected the misleading "old build refuses to serve" wording -- old decodeValue actually returns a silent tombstone, but the 3-phase rollout makes that path unreachable on a real cluster. Bit-layout comment requirement added. Minor (claude[bot] r3 #5): env KEK provider must os.Unsetenv("ELASTICKV_KEK_BASE64") immediately after read.
1 parent f34d9b7 commit 2768cb3

1 file changed

Lines changed: 125 additions & 32 deletions

File tree

docs/design/2026_04_29_proposed_data_at_rest_encryption.md

Lines changed: 125 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -180,10 +180,21 @@ Envelope format (single byte stream, stored as the Pebble value):
180180
+-------------+----------------+----------------+
181181
```
182182

183-
- `node_id` — the 16-bit Raft member ID assigned at cluster
184-
bootstrap. Never reused for a different node within the
185-
cluster's lifetime; verified at process start against the
186-
membership snapshot.
183+
- `node_id` — the lower 16 bits of
184+
`internal/raftengine/etcd/peers.go::DeriveNodeID(--raftId)`,
185+
which today returns the FNV-64a hash of the raft-id string
186+
as a `uint64`. Truncation rule: `node_id := uint16(DeriveNodeID(raftID))`.
187+
On every process start the encryption package reads the
188+
current Raft membership snapshot and asserts that **no two
189+
voting members share the same lower-16-bit value of their
190+
derived node ID** — refuses to start with
191+
`ErrNodeIDCollision` otherwise (see §9.1). The check is at
192+
most O(num_voting_members²) comparisons (single-digit
193+
cluster sizes today); the cost is paid once at boot. For
194+
elastickv's current `"n1"`/`"n2"`/`"n3"` ID convention the
195+
16-bit values are well-distributed and collisions are
196+
practically impossible, but the assertion makes the nonce
197+
uniqueness invariant unconditional rather than statistical.
187198
- `local_epoch` — 16-bit per-DEK process-load counter,
188199
persisted in the local sidecar (§5.1). Incremented and
189200
fsync'd **before** the new process performs any encryption,
@@ -193,7 +204,16 @@ Envelope format (single byte stream, stored as the Pebble value):
193204
triggers (§5.2) reset it to 0 with the new DEK. The
194205
encryption package logs a warning when `local_epoch >
195206
0xff00` so an operator has a 256-restart cushion to rotate
196-
before wrap.
207+
before wrap; if the operator ignores all 256 warnings and the
208+
epoch reaches `0xFFFF`, the process **refuses to start**
209+
with `ErrLocalEpochExhausted` (see §9.1). The hard stop is
210+
load-bearing: a wrap to 0 would re-issue the nonce
211+
`node_id ‖ 0 ‖ 0` for the current DEK's first write of the
212+
new process load, exactly matching the very first nonce ever
213+
issued under that DEK — a catastrophic AES-GCM nonce reuse
214+
(key recovery + plaintext XOR). DEK rotation (§5.2) is the
215+
only recovery; it allocates a new DEK with a fresh
216+
`local_epoch = 0`.
197217
- `write_count` — 64-bit `atomic.Uint64`, incremented per
198218
write within the current process. Resets to 0 on process
199219
start (the `local_epoch` bump is what makes that safe).
@@ -407,14 +427,22 @@ Two-tier hierarchy:
407427
written under a DEK whose wrap is then lost, the sidecar write
408428
protocol is:
409429

410-
1. Write the new contents to `<dataDir>/encryption/keys.json.tmp`.
411-
2. `file.Sync()` on the temp file (fsync the data + metadata).
430+
1. Build the new sidecar contents in memory, including
431+
`raft_applied_index = <log index of the rotation entry being
432+
applied right now>` — this is part of the same JSON payload
433+
as the new wrapped DEKs, written in the same step 2 below.
434+
2. Write the new contents to `<dataDir>/encryption/keys.json.tmp`,
435+
then `file.Sync()` (fsync the data + metadata).
412436
3. `os.Rename` to `keys.json`.
413437
4. `dir.Sync()` on `<dataDir>/encryption/` (fsync the directory
414438
entry so the rename is durable).
415439
5. Only after step 4 does the FSM acknowledge the rotation entry
416-
as applied (and update `raft_applied_index` in memory; that
417-
value is then persisted on the next sidecar write).
440+
as applied. The in-memory `raft_applied_index` matches the
441+
value that has just been durably written; **no deferred-write
442+
window** in which the on-disk sidecar carries new DEKs but a
443+
stale index. This eliminates the spurious-`ErrSidecarBehindRaftLog`
444+
race that an "update on next write" scheme would create on a
445+
crash between steps 4 and the next rotation.
418446

419447
Skipping step 2 or 4 turns a power loss into permanent data loss
420448
for any value written under the new DEK; the §10 self-review
@@ -643,9 +671,17 @@ rewrite."
643671
- `kek/` subpackage — pluggable KEK providers (`file.go`, `awskms.go`,
644672
`gcpkms.go`, `vault.go`, `env.go`). Each implements
645673
`Wrap(dek []byte) (wrapped []byte, error)` and `Unwrap(wrapped
646-
[]byte) (dek []byte, error)`.
647-
- `sidecar.go` — atomic read/write of `keys.json` (write to
648-
`keys.json.tmp` + `os.Rename`).
674+
[]byte) (dek []byte, error)`. The `env.go` provider must call
675+
`os.Unsetenv("ELASTICKV_KEK_BASE64")` immediately after reading
676+
and decoding the value to narrow the window during which the
677+
raw KEK material is visible via `/proc/<pid>/environ` /
678+
container env inspection. (The recommendation in §5.1 stands:
679+
`env.go` is for tests and CI only, not production.)
680+
- `sidecar.go` — read/write of `keys.json` per the **§5.1
681+
crash-durable write protocol** (write tmp → `file.Sync()`
682+
`os.Rename``dir.Sync()` → ack). Skipping any of these
683+
fsyncs is a data-loss-class bug per §10 lens 1; do not
684+
shortcut to a plain `os.Rename`.
649685

650686
### 6.2 Hooks into the storage layer
651687

@@ -666,11 +702,23 @@ coordinator does not pre-encrypt; the FSM does not bypass this path.
666702
(`[tombstone(1)] [expireAt(8)] [user_value...]`, see
667703
`encodeValue`) is extended so the leading byte encodes both the
668704
tombstone bit (bit 0) and the encryption-state bits (bits 1–2),
669-
matching the 2-bit field in §7.1. Existing builds use only bit 0,
670-
so an old build reading a value written by an encryption-enabled
671-
build sees an unrecognised tombstone byte and refuses to serve it
672-
rather than returning ciphertext as cleartext. `valueHeaderSize`
673-
stays at 9 bytes; no on-disk layout shift.
705+
matching the 2-bit field in §7.1. `valueHeaderSize` stays at 9
706+
bytes; no on-disk layout shift. Note: today's `decodeValue`
707+
reads `tombstone := data[0] != 0`, so an encrypted live value
708+
(`data[0] = 0b00000010`) seen by an unmodified pre-encryption
709+
build would silently appear as a tombstone (key-not-found),
710+
not an error. The 3-phase rollout in §7.1 prevents this from
711+
ever happening on a real cluster — encrypted writes only begin
712+
after every replica has applied the
713+
`enable-storage-envelope` cluster-flag entry, which by
714+
precondition requires every voting member to be on a binary
715+
that knows the new bit layout. The §9.1 startup refusal "data
716+
dir contains a sidecar but `--encryption-enabled` is not set"
717+
is the second backstop. The implementation PR must add a
718+
comment to `encodeValue` / `decodeValue` pinning bit 0 =
719+
tombstone, bits 1–2 = `encryption_state`, bits 3–7 reserved,
720+
so future bit additions don't accidentally clobber the
721+
layout.
674722
- `store/snapshot_pebble.go` — no change. It iterates Pebble values
675723
byte-for-byte, which are already ciphertext (with the
676724
`encryption_state` bit travelling in the value header).
@@ -859,13 +907,29 @@ flag becomes a *capability* assertion, not a *behaviour* trigger.
859907
cleartext Raft entries (raft envelope is gated on a separate
860908
flag in Phase 2).
861909
2. Each upgraded node advertises `encryption_capable = true` in
862-
its periodic Raft heartbeat metadata (a new field on the
863-
peer-metadata payload added by `peer_metadata.go`). The flag
864-
is purely informational at this point.
865-
3. When the membership snapshot of every Raft group shows
866-
`encryption_capable = true` for every voting member, Phase 0
867-
is complete. `encryption status` reports the capability
868-
coverage so the operator can wait for it before moving on.
910+
the **peer-metadata extension** that today already piggy-backs
911+
on the etcd-raft heartbeat path
912+
(`internal/raftengine/etcd/peer_metadata.go`). A new
913+
`encryption_capable bool` field is added to the existing
914+
`PeerMetadata` proto; old binaries omit the field, which
915+
protobuf decodes as the zero value (`false`), so unscanned
916+
non-upgraded peers default to `not capable` without any
917+
special handling. The flag is purely informational at this
918+
point.
919+
3. The Phase-1 and Phase-2 cutover commands
920+
(`enable-storage-envelope`, `enable-raft-envelope`) **do not**
921+
read cached `encryption_capable` state from the local
922+
keystore. Instead they force a fresh poll: the leader sends a
923+
heartbeat to every voting member of every Raft group and
924+
blocks until it has received a heartbeat response carrying
925+
`encryption_capable = true` from every member, all within one
926+
election timeout (configurable; default the existing
927+
`RaftElectionTimeout`). If any member is unreachable or
928+
reports `false` within the window, the command returns
929+
`ErrCapabilityCheckFailed` with the offending node IDs and
930+
the operator must re-run after fixing them. Stale cached
931+
capability state therefore cannot trigger a premature
932+
cutover.
869933

870934
**Phase 1 — Storage envelope cluster-flag flip.**
871935

@@ -903,14 +967,33 @@ flag becomes a *capability* assertion, not a *behaviour* trigger.
903967
Replicas dispatch on the **Raft log index** of the entry
904968
relative to the persisted `raft_envelope_cutover_index` (the
905969
apply index of the flag entry, recorded in the local sidecar
906-
on apply). Entries below the cutover index are decoded via
907-
the existing first-byte tag space (`0x00` single-request,
908-
`0x01` batch, `0x02` HLC-lease, etc.); entries at-or-above
909-
the cutover index are unwrapped through the raft envelope
910-
first. This explicitly avoids re-using `kv/fsm.go`'s existing
911-
first-byte tag values for the "is this an envelope?"
912-
discriminator — that re-use would make pre-cutover batch
913-
entries (`0x01`) ambiguous during WAL replay.
970+
on apply). Concretely, the FSM apply path becomes:
971+
972+
```text
973+
Apply(index, data):
974+
if index >= raft_envelope_cutover_index:
975+
data = raftDEK.Unwrap(data) // strip §4.2 envelope
976+
// existing dispatch on data[0], unchanged from today:
977+
switch data[0] {
978+
case raftEncodeSingle (0x00): decodeRaftRequests(data)
979+
case raftEncodeBatch (0x01): decodeRaftRequests(data)
980+
case raftEncodeHLCLease (0x02): applyHLCLease(data)
981+
}
982+
```
983+
984+
The unwrap happens **before** the existing
985+
`kv/fsm.go::Apply` dispatch on `raftEncodeSingle` /
986+
`raftEncodeBatch` / `raftEncodeHLCLease`, so HLC ceiling
987+
proposals (§4.5) are transparently decrypted then routed
988+
through the existing HLC path with no special-casing in the
989+
HLC handler. Crucially, this dispatch scheme **does not
990+
reuse any of the existing `kv/fsm.go` first-byte tag
991+
values** for the encryption discriminator — an earlier
992+
draft proposed `0x00`/`0x01` framing tags but those collide
993+
with `raftEncodeSingle` / `raftEncodeBatch` and would make
994+
every pre-cutover batch entry (`0x01`) look like a raft
995+
envelope to the FSM. Index gating sidesteps that collision
996+
by construction.
914997
9. Snapshots taken during Phase 2 carry the cutover index in
915998
their metadata header so a fresh follower joining
916999
mid-Phase-2 reconstructs the same dispatch boundary on
@@ -1058,6 +1141,16 @@ The process refuses to start if any of the following hold:
10581141
- The encryption package's startup membership check fails to
10591142
resolve a stable 16-bit `node_id` (the §4.1 nonce construction
10601143
needs one; without it nonce uniqueness cannot be guaranteed).
1144+
- Two voting members of any Raft group share the same
1145+
`uint16(DeriveNodeID(raftID))` (`ErrNodeIDCollision`). One of
1146+
them must be re-bootstrapped under a different `--raftId` whose
1147+
hash truncates differently before encryption can be enabled on
1148+
the cluster.
1149+
- Any active DEK in the sidecar has `local_epoch == 0xFFFF`
1150+
(`ErrLocalEpochExhausted`). The DEK must be rotated (§5.2)
1151+
before the process can resume; without rotation the next
1152+
`local_epoch` would wrap to 0 and re-issue a nonce that has
1153+
already been used under the same DEK.
10611154
- The local sidecar's `raft_envelope_cutover_index` disagrees
10621155
with the value carried in the most-recent ingested snapshot
10631156
header (`ErrEnvelopeCutoverDivergence`). This catches a node

0 commit comments

Comments
 (0)