Commit 2dda2cb
authored
docs: propose data-at-rest encryption design (#707)
## Summary
- Add `docs/design/2026_04_29_proposed_data_at_rest_encryption.md`
(Status: Proposed) covering encryption-at-rest for elastickv.
- Threat model: protect against disk theft, decommissioned-drive
recovery, backup leak, and Raft WAL/snapshot leak. Without the
externally-held key material, the persisted state cannot be decrypted.
- Encryption boundary: per-value AES-256-GCM envelope at the storage
layer + the same envelope wrapping every Raft proposal `Data []byte`.
Keeps the same ciphertext flowing through Raft → WAL → Pebble SST → FSM
snapshot, so no surface holds cleartext values.
- Key hierarchy: external KEK (AWS KMS / GCP KMS / Vault / file) wraps
DEKs; the data dir only holds wrapped DEKs in `encryption/keys.json`.
DEK rotation is operator-driven via Raft so every replica observes the
new key at the same log index.
- Migration: rolling restart with envelope-version byte (`0x00`
cleartext, `0x01` encrypted) plus a rate-limited rewrite job. Reverse
migration is intentionally unsupported (dump-and-reload).
- Self-review per CLAUDE.md (data loss / concurrency / performance /
consistency / test coverage) included; Jepsen Redis + DynamoDB suites
against an encrypted 3-node cluster are the implementation acceptance
gate.
Follows the design-doc-first workflow in CLAUDE.md — implementation PRs
will land after review of this proposal.
## Test plan
- [ ] Doc-only change; no code or tests in this PR.
- [ ] Reviewer: confirm filename / header follow `docs/design/README.md`
conventions.
- [ ] Reviewer: confirm threat model scope (in/out) matches the
operational stance you want.
- [ ] Reviewer: weigh open question §11.5 (interaction with Lua commit
batching) before the implementation PR is opened.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Documentation**
* Added comprehensive design proposal for data‑at‑rest encryption:
per‑value ciphertext everywhere, encrypted replication envelopes,
external key management and DEK lifecycle, cluster-wide nonce/uniqueness
and refusal conditions, crash‑durable key sidecar and startup
validation, MVCC encryption state handling, multi‑phase rollout and
admin commands, snapshot/joiner semantics, observability constraints,
performance/Jepsen gates, and a full test plan.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->1 file changed
Lines changed: 2407 additions & 0 deletions
0 commit comments