Skip to content

Commit ef6a58e

Browse files
authored
docs(encryption): Stage 7a proposed - process-start registration + coordinator first-write gate (#835)
## Summary Stage 6 is complete (6D-6c-3b merged). This is the **design doc for Stage 7a** — the process-start half of the deferred registration-before-first-write nonce gate (§4.1 "Process start under an existing DEK"). **Doc-only PR by design.** The coordinator first-write gate touches a hot, central path with genuine open questions, so I'm landing the design for review *before* implementing rather than bundling a speculative coordinator change. ## What 7a proposes - After `BumpLocalEpoch` advances the §4.1 `local_epoch` on a fresh load with an active storage envelope, register the bumped epoch in the writer registry **before this node originates any encrypted write**. - **Skip** the propose when the registry slot already records the current epoch (bootstrap cohort / no-op restart — avoids the `ErrLocalEpochRollback` self-brick). - **Defer** in Phase 0 (sidecar present, cutover not yet fired — registry keys don't exist cluster-wide). - A coordinator **first-write barrier** blocks self-originated encrypted writes (not reads, not cleartext writes) until the registration commits, bounded by the client `ctx`. ## Why registration must precede self-originated encrypted writes FSM apply runs `encryptForKey` → `nonceFactory.Next()` on **every** node, so each node encrypts with its own `node_id || local_epoch || write_count`. The registry is the cluster-wide ledger the §4.1 `ErrLocalEpochRollback` / `ErrNodeIDCollision` guards rely on; registering the bumped epoch keeps the ledger ahead of the nonces this load emits. ## Open questions for review (§3.2 / §6) 1. **Coordinator gate injection point** — `ShardedCoordinator.Dispatch` vs. a narrower seam; confirm reads / txn-reads stay ungated. 2. **Barrier representation** — `chan struct{}` closed on commit (ctx-aware `select`) vs. `atomic.Bool` + condvar. 3. **Sync vs async startup propose** — block startup until commit, vs. arm the barrier and let only the first encrypted write wait (leaning async so reads serve immediately). Reviewer input on #1 especially would shape the implementation PR. ## Scope - **In (7a):** startup registration-intent decision (defer/skip/propose), coordinator first-write barrier, follower→leader forwarding of the startup propose, unit tests. - **Out:** 7b (post-rotation re-registration), 7c (ConfChange-time registration). ## Test plan - [x] Doc-only; `golangci-lint run .` clean (no code changes) - [ ] Implementation + tests land in the follow-up 7a PR once the design direction is confirmed <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added design documentation for Stage 7a process-start operations and coordinator gating mechanisms. <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/835?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
2 parents cedabd9 + 11595ca commit ef6a58e

1 file changed

Lines changed: 292 additions & 0 deletions

File tree

Lines changed: 292 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,292 @@
1+
# Stage 7a — process-start writer registration + coordinator first-write gate
2+
3+
| Field | Value |
4+
|---|---|
5+
| Status | proposed |
6+
| Date | 2026-05-26 |
7+
| Parent design | [`2026_04_29_partial_data_at_rest_encryption.md`](2026_04_29_partial_data_at_rest_encryption.md) (§4.1 writer registry, §5.2 process-start path) |
8+
| Builds on | 6A (`ApplyRegistration` §4.1 case 1–4 dispatch), 5B (`RegisterEncryptionWriter` RPC + leader Proposer wiring), 6C-3 (`ErrLocalEpochRollback`), 6D-6c-2 (`BumpLocalEpoch`, `DeterministicNonceFactory`, keystore hydration) |
9+
| Slice of | Stage 7 (writer registry + deterministic nonce). 7b (post-rotation re-registration) and 7c (ConfChange-time registration) follow. |
10+
11+
## 0. Why this slice exists
12+
13+
6D-6c-2 wired the production storage write path: on a fresh process
14+
load with an active storage DEK, `BumpLocalEpoch` advances and fsyncs
15+
the §4.1 `local_epoch`, and the `DeterministicNonceFactory` is pinned
16+
to that bumped epoch. But the **writer registry** — the Raft-replicated
17+
record that makes the 16-bit `node_id` uniqueness invariant
18+
unconditional (§4.1) — is not updated on restart. The deferred gap,
19+
called out in the 6D-6c-2 partial doc and the Stage-6D parent doc:
20+
21+
> the multi-node-churn registration-before-first-write nonce gate is a
22+
> Stage 7 item.
23+
24+
7a closes the **process-start** half of that gap (§4.1 "Process start
25+
under an existing DEK"): a node that restarts under an already-active
26+
storage envelope must register its bumped `local_epoch` in the writer
27+
registry **before it originates any encrypted write**, and must skip
28+
the registration when its registry slot already records the current
29+
epoch (the bootstrap cohort / no-op-restart case). 7b and 7c cover the
30+
post-rotation and ConfChange-time paths.
31+
32+
## 1. The §4.1 / §5.2 contract (process-start path)
33+
34+
Restating the authoritative parent-doc contract that 7a implements:
35+
36+
1. The trigger fires **only** on a fresh process load whose
37+
`BumpLocalEpoch` persisted a value **strictly greater** than the
38+
registry's current `last_seen_local_epoch` for this
39+
`(full_node_id, active_storage_dek_id)`.
40+
2. In that case the node proposes
41+
`RegisterEncryptionWriter(dek_id, full_node_id, local_epoch)`
42+
through the default group's leader **before the coordinator accepts
43+
any client write that would land as an encrypted entry**.
44+
3. It does **NOT** fire on the bootstrap cohort's first observation of
45+
the `enable-storage-envelope` cutover — those nodes were already
46+
registered by the §5.6 bootstrap batch at the same `local_epoch`, so
47+
a duplicate registration with an unchanged epoch would be rejected
48+
by §4.1 case 3 (`ErrLocalEpochRollback`) and brick a healthy
49+
rollout. The implementation checks the registry on coordinator
50+
startup and **skips the propose** if its slot already contains the
51+
current `local_epoch` (§4.1 case 2-idempotent territory).
52+
4. A node booting in **Phase 0** (sidecar present, but
53+
`StorageEnvelopeActive == false`) defers registration until both the
54+
bootstrap and the cutover entries commit — the registry keys do not
55+
yet exist cluster-wide. FSM apply may still run on leader-proposed
56+
entries during this window (decrypted with the sidecar DEKs the node
57+
already holds); only **self-originated** encrypted writes are gated.
58+
59+
## 2. Why registration must precede self-originated encrypted writes
60+
61+
FSM apply runs `kvFSM.Apply → store.ApplyMutationsRaft → encryptForKey
62+
→ nonceFactory.Next()` on **every** node, so each node encrypts the
63+
committed value with **its own** `node_id ‖ local_epoch ‖ write_count`
64+
nonce. The registry is the cluster-wide ledger of which `node_id` has
65+
written under which DEK at which epoch; the §4.1 `ErrLocalEpochRollback`
66+
/ `ErrNodeIDCollision` guards rely on it. Registering the bumped epoch
67+
before this node originates writes under the active DEK keeps the
68+
ledger ahead of the nonces this load will emit, so a future
69+
stale-disk restore (epoch ≤ registry) is detected rather than silently
70+
recycling nonces.
71+
72+
## 3. Proposed implementation
73+
74+
### 3.1 Registration decision (startup, post-`buildShardGroups`)
75+
76+
The bumped epoch comes from 6D-6c-2's `prepareStorageNonceEpoch` (run
77+
inside `buildEncryptionWriteWiring`, before `buildShardGroups`). The
78+
**registration-intent decision**, however, **cannot** live there:
79+
80+
> **Timing constraint (review finding #1).** The intent decision reads
81+
> the writer registry, which is backed by the default group's Pebble
82+
> `MVCCStore`. That store is not open until `buildShardGroups` returns.
83+
> So the intent decision is a **post-`buildShardGroups` step**, wired
84+
> into the same `run()` phase as `chainEncryptionStartupGuard` (which
85+
> already runs after the engines/stores open), **not** into
86+
> `buildEncryptionWriteWiring`. The epoch bump (write-path nonce
87+
> safety) stays pre-`buildShardGroups`; the registry intent (control
88+
> plane) runs after.
89+
90+
**Ordering assumption (review #1).** This intent logic runs *after*
91+
the §9.1 `CheckLocalEpochRollback` startup guard has passed (it runs in
92+
the same post-engine phase and halts the process on
93+
`ErrLocalEpochRollback`). Consequently `epoch < registry.last_seen`
94+
(rollback) and the no-row-post-cutover case are **unreachable** here —
95+
the guard already rejected them. A **missing** registry row is treated
96+
as `last_seen = 0`, which falls into the `< epoch` "propose" branch
97+
(the legitimate freshly-joined writer whose row the guard tolerated
98+
pre-cutover, or the first registration under a new DEK). The intent
99+
therefore only ever sees `==` (skip) or `<` (propose).
100+
101+
Intent:
102+
103+
```text
104+
if !StorageEnvelopeActive -> defer (Phase 0; §1.4)
105+
if registry.last_seen(node,DEK) == epoch -> skip (already registered; §1.3)
106+
if registry.last_seen(node,DEK) < epoch -> propose RegisterEncryptionWriter
107+
(the §1.1/1.2 trigger;
108+
missing row == last_seen 0)
109+
```
110+
111+
The registry read is local (Pebble, no Raft round-trip — the registry
112+
is replicated state already present on this node post-bootstrap),
113+
reusing the same `RegistryKey(dek_id, NodeID16(full_node_id))` lookup
114+
`ApplyRegistration` / `GuardLocalEpochRollback` use.
115+
116+
**Registry handle access (review finding #2).** `buildShardGroups`
117+
already constructs a `WriterRegistryStore` per shard via
118+
`store.WriterRegistryFor(st)` and hands it to each `Applier`. 7a
119+
threads the **default group's** `WriterRegistryStore` out of
120+
`buildShardGroups` (returned alongside the runtimes, keyed by
121+
`cfg.defaultGroup`) so the post-`buildShardGroups` intent step reads
122+
the same handle the FSM's `ApplyRegistration` mutates — rather than
123+
adding a new accessor that reaches into the runtime/applier internals.
124+
This keeps the registry's single owner (the default-group store) and
125+
avoids a second `WriterRegistryFor` over the same Pebble dir.
126+
127+
### 3.2 Coordinator first-write gate
128+
129+
The coordinator gains a **registration barrier**: a one-shot readiness
130+
signal that blocks the first client write which would land encrypted
131+
until the pending `RegisterEncryptionWriter` proposal has committed.
132+
133+
**Barrier representation — `chan struct{}`, three states (review #5).**
134+
The barrier field on `ShardedCoordinator` is a `chan struct{}` with an
135+
explicit three-state semantic:
136+
137+
| State | Meaning | Write behavior |
138+
|---|---|---|
139+
| `nil` | no pending registration (Phase 0, skip case, or encryption off) | ungated |
140+
| open (non-nil, not closed) | registration proposed, not yet committed | mutating encrypted writes block on it |
141+
| closed | registration committed | ungated (fast path: non-blocking `select` branch) |
142+
143+
The channel MUST be created (open) **before** the barrier reference is
144+
visible to any write-dispatch goroutine — i.e., armed in the
145+
post-`buildShardGroups` phase (§3.1) before gRPC starts serving. A
146+
`nil` channel blocks forever on receive (would deadlock rather than
147+
ctx-cancel), so `nil` is reserved exclusively for the never-armed
148+
ungated case and the dispatch path checks `barrier == nil` first.
149+
150+
**Gate injection — `ShardedCoordinator.Dispatch` (review #4).** The
151+
reads (`LinearizableRead`, `LeaseRead`, `VerifyLeader`) are separate
152+
`Coordinator` methods that never reach `Dispatch`, so they are
153+
structurally ungated. Inside `Dispatch`, the gate applies only to
154+
mutating encrypted writes. Go has no `closed()` builtin, so the
155+
open-vs-closed distinction is a non-blocking `select` after the `nil`
156+
check (review #2):
157+
158+
```go
159+
if barrier != nil {
160+
select {
161+
case <-barrier:
162+
// closed → registration committed → pass through (fast path)
163+
default:
164+
// open → registration pending; gate only mutating encrypted writes
165+
if cache.StorageEnvelopeActive() && cache.activeKeyID != 0 && hasMutatingElems(reqs) {
166+
select {
167+
case <-barrier: // registration committed while we waited
168+
case <-ctx.Done():
169+
return ctx.Err() // fail-closed: never emit an unregistered nonce
170+
}
171+
}
172+
}
173+
}
174+
```
175+
176+
`hasMutatingElems` is required (not just `IsTxn`): a read-only
177+
transaction carrying only `ReadKeys` (no PUT/DEL elements) must stay
178+
ungated even though it flows through `dispatchTxn`. Fail-closed: a
179+
write that cannot register within its `ctx` returns the ctx error
180+
rather than emitting an unregistered nonce.
181+
182+
**Multi-shard note (review #6).** The barrier is node-global; the
183+
registration targets the default group. So in a multi-shard
184+
deployment a write to group 2 blocks until the group-1 registration
185+
Raft round-trip commits. This is correct — nonce uniqueness is
186+
DEK-wide, not shard-scoped — and only the first encrypted write pays
187+
the wait.
188+
189+
**Arming + async commit (review #3).** §3.1's "propose" branch arms
190+
the barrier (creates the open channel) and starts a goroutine that
191+
proposes `RegisterEncryptionWriter` and waits for its apply (the same
192+
applied-index wait the admin RPC uses), then `close()`s the barrier.
193+
That goroutine MUST `select` on **both** the apply signal **and** the
194+
process run-context: on run-ctx cancellation (clean shutdown before
195+
commit) it returns **without** closing the barrier, leaving any
196+
in-flight encrypted writes blocked until they hit their own ctx
197+
deadline / the gRPC server drains — never closing the barrier on an
198+
uncommitted registration. No goroutine leak: it always exits on one of
199+
the two signals.
200+
201+
**Transient propose failures (review #3, follow-on).** A propose that
202+
*starts* but fails mid-flight — leader change, proposal timeout,
203+
`ErrProposalDropped` — is retried inside the goroutine with bounded
204+
backoff, against the same run-context (so retries stop on shutdown).
205+
The barrier stays open across retries, so encrypted writes keep
206+
waiting (fail-closed) rather than proceeding unregistered. The retry
207+
loop only exits by (a) a committed registration → `close()` the
208+
barrier, or (b) run-ctx cancellation → return without closing. There
209+
is no terminal "give up and close" path: closing the barrier without a
210+
committed registration would let this node emit unregistered nonces,
211+
which is the exact hazard 7a exists to prevent.
212+
213+
**Snapshot/compaction interaction (review #8).** If the registration
214+
entry is compacted into a snapshot before the goroutine observes its
215+
index, the engine's `AppliedIndex()` is already ≥ the snapshot index
216+
and the snapshot application updated the registry row, so the
217+
apply-wait completes correctly — same reasoning as the 6C-2b scanner.
218+
219+
### 3.3 Leader routing for the propose
220+
221+
`RegisterEncryptionWriter` proposes a §11.3 0x03 entry through the
222+
default group's leader (the existing 5B Proposer path). On a follower,
223+
the startup propose forwards to the current leader (same path the admin
224+
RPC uses). If no leader is available yet at startup, the barrier stays
225+
**open** (pending — the channel is non-nil and not yet closed, so
226+
mutating encrypted writes block per the §3.2 table; "closed" is
227+
reserved for the committed/ungated state) and the first encrypted
228+
write blocks until leadership settles — bounded by the client's `ctx`
229+
deadline (fail-closed: a write that cannot register does not get to
230+
emit an unregistered nonce).
231+
232+
## 4. Scope
233+
234+
### In scope (7a)
235+
- Startup registration-intent decision (defer / skip / propose).
236+
- Coordinator first-write barrier gating self-originated encrypted
237+
writes on registration commit.
238+
- Follower→leader forwarding of the startup registration propose.
239+
- Unit tests: the three intent branches; barrier blocks-then-releases;
240+
reads + non-encrypted writes never gated; Phase-0 defer; ctx-cancel
241+
while blocked.
242+
243+
### Out of scope
244+
- **7b** — post-rotation re-registration (register against the new DEK
245+
on a `rotate-dek` apply, same barrier mechanism).
246+
- **7c** — ConfChange-time registration (leader pairs a
247+
`RegisterEncryptionWriter` with a learner/voter add before the
248+
conf-change commits; `ErrPeerNotEncryptionRegistered`).
249+
- Snapshot of registry rows across compaction (covered by the §5.5
250+
resync path already shipped).
251+
252+
## 5. Self-review checklist (for the implementation PR)
253+
254+
- **Data loss / consistency** — registration is additive (a 0x03 entry
255+
+ a registry-row insert/advance); no user data path changes. The
256+
gate only *delays* encrypted writes, never drops them.
257+
- **Concurrency** — the barrier is read on every encrypted write; once
258+
closed the fast path is a non-blocking `select` branch on the closed
259+
channel (O(1), no extra synchronization), and ctx-aware while
260+
pending. No deadlock if leadership never settles — ctx bound. The
261+
`nil` (never-armed) case is checked first so it is never received on.
262+
- **Performance** — fast path after first write is a non-blocking
263+
`select` on the closed channel (O(1)); the registry read + propose
264+
happen once per process load.
265+
- **Data consistency** — registry monotonicity (§4.1 case 2/3)
266+
preserved; skip-at-equal-epoch avoids the `ErrLocalEpochRollback`
267+
self-brick.
268+
- **Test coverage** — intent branches, barrier blocking/release,
269+
ungated reads, Phase-0 defer, ctx-cancel.
270+
271+
## 6. Resolved design decisions (PR #835 round 1)
272+
273+
All three original open questions were settled by the round-1 review:
274+
275+
1. **Gate injection point**`ShardedCoordinator.Dispatch`, the single
276+
adapter entry point. Reads (`LinearizableRead` / `LeaseRead` /
277+
`VerifyLeader`) are separate methods, structurally ungated. Gate
278+
condition gated on `hasMutatingElems(reqs)` (PUT/DEL/DEL_PREFIX), so
279+
read-only transactions stay ungated (§3.2).
280+
2. **Barrier representation**`chan struct{}` with the explicit
281+
nil / open / closed three-state semantic (§3.2). Idiomatic,
282+
ctx-aware, no extra synchronization primitives.
283+
3. **Sync vs async propose** → asynchronous: arm the barrier and let
284+
only the first encrypted write wait, so reads serve immediately. The
285+
propose goroutine selects on both the apply signal and the process
286+
run-context to avoid a leak / a barrier-close on an uncommitted
287+
registration (§3.2, review #3).
288+
289+
Remaining items to resolve **within the implementation PR** (not
290+
deferred further): the exact `hasMutatingElems` predicate over the
291+
`OperationGroup` shape, and the default-group `WriterRegistryStore`
292+
threading out of `buildShardGroups` (§3.1).

0 commit comments

Comments
 (0)