|
| 1 | +# Stage 7a — process-start writer registration + coordinator first-write gate |
| 2 | + |
| 3 | +| Field | Value | |
| 4 | +|---|---| |
| 5 | +| Status | proposed | |
| 6 | +| Date | 2026-05-26 | |
| 7 | +| Parent design | [`2026_04_29_partial_data_at_rest_encryption.md`](2026_04_29_partial_data_at_rest_encryption.md) (§4.1 writer registry, §5.2 process-start path) | |
| 8 | +| Builds on | 6A (`ApplyRegistration` §4.1 case 1–4 dispatch), 5B (`RegisterEncryptionWriter` RPC + leader Proposer wiring), 6C-3 (`ErrLocalEpochRollback`), 6D-6c-2 (`BumpLocalEpoch`, `DeterministicNonceFactory`, keystore hydration) | |
| 9 | +| Slice of | Stage 7 (writer registry + deterministic nonce). 7b (post-rotation re-registration) and 7c (ConfChange-time registration) follow. | |
| 10 | + |
| 11 | +## 0. Why this slice exists |
| 12 | + |
| 13 | +6D-6c-2 wired the production storage write path: on a fresh process |
| 14 | +load with an active storage DEK, `BumpLocalEpoch` advances and fsyncs |
| 15 | +the §4.1 `local_epoch`, and the `DeterministicNonceFactory` is pinned |
| 16 | +to that bumped epoch. But the **writer registry** — the Raft-replicated |
| 17 | +record that makes the 16-bit `node_id` uniqueness invariant |
| 18 | +unconditional (§4.1) — is not updated on restart. The deferred gap, |
| 19 | +called out in the 6D-6c-2 partial doc and the Stage-6D parent doc: |
| 20 | + |
| 21 | +> the multi-node-churn registration-before-first-write nonce gate is a |
| 22 | +> Stage 7 item. |
| 23 | +
|
| 24 | +7a closes the **process-start** half of that gap (§4.1 "Process start |
| 25 | +under an existing DEK"): a node that restarts under an already-active |
| 26 | +storage envelope must register its bumped `local_epoch` in the writer |
| 27 | +registry **before it originates any encrypted write**, and must skip |
| 28 | +the registration when its registry slot already records the current |
| 29 | +epoch (the bootstrap cohort / no-op-restart case). 7b and 7c cover the |
| 30 | +post-rotation and ConfChange-time paths. |
| 31 | + |
| 32 | +## 1. The §4.1 / §5.2 contract (process-start path) |
| 33 | + |
| 34 | +Restating the authoritative parent-doc contract that 7a implements: |
| 35 | + |
| 36 | +1. The trigger fires **only** on a fresh process load whose |
| 37 | + `BumpLocalEpoch` persisted a value **strictly greater** than the |
| 38 | + registry's current `last_seen_local_epoch` for this |
| 39 | + `(full_node_id, active_storage_dek_id)`. |
| 40 | +2. In that case the node proposes |
| 41 | + `RegisterEncryptionWriter(dek_id, full_node_id, local_epoch)` |
| 42 | + through the default group's leader **before the coordinator accepts |
| 43 | + any client write that would land as an encrypted entry**. |
| 44 | +3. It does **NOT** fire on the bootstrap cohort's first observation of |
| 45 | + the `enable-storage-envelope` cutover — those nodes were already |
| 46 | + registered by the §5.6 bootstrap batch at the same `local_epoch`, so |
| 47 | + a duplicate registration with an unchanged epoch would be rejected |
| 48 | + by §4.1 case 3 (`ErrLocalEpochRollback`) and brick a healthy |
| 49 | + rollout. The implementation checks the registry on coordinator |
| 50 | + startup and **skips the propose** if its slot already contains the |
| 51 | + current `local_epoch` (§4.1 case 2-idempotent territory). |
| 52 | +4. A node booting in **Phase 0** (sidecar present, but |
| 53 | + `StorageEnvelopeActive == false`) defers registration until both the |
| 54 | + bootstrap and the cutover entries commit — the registry keys do not |
| 55 | + yet exist cluster-wide. FSM apply may still run on leader-proposed |
| 56 | + entries during this window (decrypted with the sidecar DEKs the node |
| 57 | + already holds); only **self-originated** encrypted writes are gated. |
| 58 | + |
| 59 | +## 2. Why registration must precede self-originated encrypted writes |
| 60 | + |
| 61 | +FSM apply runs `kvFSM.Apply → store.ApplyMutationsRaft → encryptForKey |
| 62 | +→ nonceFactory.Next()` on **every** node, so each node encrypts the |
| 63 | +committed value with **its own** `node_id ‖ local_epoch ‖ write_count` |
| 64 | +nonce. The registry is the cluster-wide ledger of which `node_id` has |
| 65 | +written under which DEK at which epoch; the §4.1 `ErrLocalEpochRollback` |
| 66 | +/ `ErrNodeIDCollision` guards rely on it. Registering the bumped epoch |
| 67 | +before this node originates writes under the active DEK keeps the |
| 68 | +ledger ahead of the nonces this load will emit, so a future |
| 69 | +stale-disk restore (epoch ≤ registry) is detected rather than silently |
| 70 | +recycling nonces. |
| 71 | + |
| 72 | +## 3. Proposed implementation |
| 73 | + |
| 74 | +### 3.1 Registration decision (startup, post-`buildShardGroups`) |
| 75 | + |
| 76 | +The bumped epoch comes from 6D-6c-2's `prepareStorageNonceEpoch` (run |
| 77 | +inside `buildEncryptionWriteWiring`, before `buildShardGroups`). The |
| 78 | +**registration-intent decision**, however, **cannot** live there: |
| 79 | + |
| 80 | +> **Timing constraint (review finding #1).** The intent decision reads |
| 81 | +> the writer registry, which is backed by the default group's Pebble |
| 82 | +> `MVCCStore`. That store is not open until `buildShardGroups` returns. |
| 83 | +> So the intent decision is a **post-`buildShardGroups` step**, wired |
| 84 | +> into the same `run()` phase as `chainEncryptionStartupGuard` (which |
| 85 | +> already runs after the engines/stores open), **not** into |
| 86 | +> `buildEncryptionWriteWiring`. The epoch bump (write-path nonce |
| 87 | +> safety) stays pre-`buildShardGroups`; the registry intent (control |
| 88 | +> plane) runs after. |
| 89 | +
|
| 90 | +**Ordering assumption (review #1).** This intent logic runs *after* |
| 91 | +the §9.1 `CheckLocalEpochRollback` startup guard has passed (it runs in |
| 92 | +the same post-engine phase and halts the process on |
| 93 | +`ErrLocalEpochRollback`). Consequently `epoch < registry.last_seen` |
| 94 | +(rollback) and the no-row-post-cutover case are **unreachable** here — |
| 95 | +the guard already rejected them. A **missing** registry row is treated |
| 96 | +as `last_seen = 0`, which falls into the `< epoch` "propose" branch |
| 97 | +(the legitimate freshly-joined writer whose row the guard tolerated |
| 98 | +pre-cutover, or the first registration under a new DEK). The intent |
| 99 | +therefore only ever sees `==` (skip) or `<` (propose). |
| 100 | + |
| 101 | +Intent: |
| 102 | + |
| 103 | +```text |
| 104 | +if !StorageEnvelopeActive -> defer (Phase 0; §1.4) |
| 105 | +if registry.last_seen(node,DEK) == epoch -> skip (already registered; §1.3) |
| 106 | +if registry.last_seen(node,DEK) < epoch -> propose RegisterEncryptionWriter |
| 107 | + (the §1.1/1.2 trigger; |
| 108 | + missing row == last_seen 0) |
| 109 | +``` |
| 110 | + |
| 111 | +The registry read is local (Pebble, no Raft round-trip — the registry |
| 112 | +is replicated state already present on this node post-bootstrap), |
| 113 | +reusing the same `RegistryKey(dek_id, NodeID16(full_node_id))` lookup |
| 114 | +`ApplyRegistration` / `GuardLocalEpochRollback` use. |
| 115 | + |
| 116 | +**Registry handle access (review finding #2).** `buildShardGroups` |
| 117 | +already constructs a `WriterRegistryStore` per shard via |
| 118 | +`store.WriterRegistryFor(st)` and hands it to each `Applier`. 7a |
| 119 | +threads the **default group's** `WriterRegistryStore` out of |
| 120 | +`buildShardGroups` (returned alongside the runtimes, keyed by |
| 121 | +`cfg.defaultGroup`) so the post-`buildShardGroups` intent step reads |
| 122 | +the same handle the FSM's `ApplyRegistration` mutates — rather than |
| 123 | +adding a new accessor that reaches into the runtime/applier internals. |
| 124 | +This keeps the registry's single owner (the default-group store) and |
| 125 | +avoids a second `WriterRegistryFor` over the same Pebble dir. |
| 126 | + |
| 127 | +### 3.2 Coordinator first-write gate |
| 128 | + |
| 129 | +The coordinator gains a **registration barrier**: a one-shot readiness |
| 130 | +signal that blocks the first client write which would land encrypted |
| 131 | +until the pending `RegisterEncryptionWriter` proposal has committed. |
| 132 | + |
| 133 | +**Barrier representation — `chan struct{}`, three states (review #5).** |
| 134 | +The barrier field on `ShardedCoordinator` is a `chan struct{}` with an |
| 135 | +explicit three-state semantic: |
| 136 | + |
| 137 | +| State | Meaning | Write behavior | |
| 138 | +|---|---|---| |
| 139 | +| `nil` | no pending registration (Phase 0, skip case, or encryption off) | ungated | |
| 140 | +| open (non-nil, not closed) | registration proposed, not yet committed | mutating encrypted writes block on it | |
| 141 | +| closed | registration committed | ungated (fast path: non-blocking `select` branch) | |
| 142 | + |
| 143 | +The channel MUST be created (open) **before** the barrier reference is |
| 144 | +visible to any write-dispatch goroutine — i.e., armed in the |
| 145 | +post-`buildShardGroups` phase (§3.1) before gRPC starts serving. A |
| 146 | +`nil` channel blocks forever on receive (would deadlock rather than |
| 147 | +ctx-cancel), so `nil` is reserved exclusively for the never-armed |
| 148 | +ungated case and the dispatch path checks `barrier == nil` first. |
| 149 | + |
| 150 | +**Gate injection — `ShardedCoordinator.Dispatch` (review #4).** The |
| 151 | +reads (`LinearizableRead`, `LeaseRead`, `VerifyLeader`) are separate |
| 152 | +`Coordinator` methods that never reach `Dispatch`, so they are |
| 153 | +structurally ungated. Inside `Dispatch`, the gate applies only to |
| 154 | +mutating encrypted writes. Go has no `closed()` builtin, so the |
| 155 | +open-vs-closed distinction is a non-blocking `select` after the `nil` |
| 156 | +check (review #2): |
| 157 | + |
| 158 | +```go |
| 159 | +if barrier != nil { |
| 160 | + select { |
| 161 | + case <-barrier: |
| 162 | + // closed → registration committed → pass through (fast path) |
| 163 | + default: |
| 164 | + // open → registration pending; gate only mutating encrypted writes |
| 165 | + if cache.StorageEnvelopeActive() && cache.activeKeyID != 0 && hasMutatingElems(reqs) { |
| 166 | + select { |
| 167 | + case <-barrier: // registration committed while we waited |
| 168 | + case <-ctx.Done(): |
| 169 | + return ctx.Err() // fail-closed: never emit an unregistered nonce |
| 170 | + } |
| 171 | + } |
| 172 | + } |
| 173 | +} |
| 174 | +``` |
| 175 | + |
| 176 | +`hasMutatingElems` is required (not just `IsTxn`): a read-only |
| 177 | +transaction carrying only `ReadKeys` (no PUT/DEL elements) must stay |
| 178 | +ungated even though it flows through `dispatchTxn`. Fail-closed: a |
| 179 | +write that cannot register within its `ctx` returns the ctx error |
| 180 | +rather than emitting an unregistered nonce. |
| 181 | + |
| 182 | +**Multi-shard note (review #6).** The barrier is node-global; the |
| 183 | +registration targets the default group. So in a multi-shard |
| 184 | +deployment a write to group 2 blocks until the group-1 registration |
| 185 | +Raft round-trip commits. This is correct — nonce uniqueness is |
| 186 | +DEK-wide, not shard-scoped — and only the first encrypted write pays |
| 187 | +the wait. |
| 188 | + |
| 189 | +**Arming + async commit (review #3).** §3.1's "propose" branch arms |
| 190 | +the barrier (creates the open channel) and starts a goroutine that |
| 191 | +proposes `RegisterEncryptionWriter` and waits for its apply (the same |
| 192 | +applied-index wait the admin RPC uses), then `close()`s the barrier. |
| 193 | +That goroutine MUST `select` on **both** the apply signal **and** the |
| 194 | +process run-context: on run-ctx cancellation (clean shutdown before |
| 195 | +commit) it returns **without** closing the barrier, leaving any |
| 196 | +in-flight encrypted writes blocked until they hit their own ctx |
| 197 | +deadline / the gRPC server drains — never closing the barrier on an |
| 198 | +uncommitted registration. No goroutine leak: it always exits on one of |
| 199 | +the two signals. |
| 200 | + |
| 201 | +**Transient propose failures (review #3, follow-on).** A propose that |
| 202 | +*starts* but fails mid-flight — leader change, proposal timeout, |
| 203 | +`ErrProposalDropped` — is retried inside the goroutine with bounded |
| 204 | +backoff, against the same run-context (so retries stop on shutdown). |
| 205 | +The barrier stays open across retries, so encrypted writes keep |
| 206 | +waiting (fail-closed) rather than proceeding unregistered. The retry |
| 207 | +loop only exits by (a) a committed registration → `close()` the |
| 208 | +barrier, or (b) run-ctx cancellation → return without closing. There |
| 209 | +is no terminal "give up and close" path: closing the barrier without a |
| 210 | +committed registration would let this node emit unregistered nonces, |
| 211 | +which is the exact hazard 7a exists to prevent. |
| 212 | + |
| 213 | +**Snapshot/compaction interaction (review #8).** If the registration |
| 214 | +entry is compacted into a snapshot before the goroutine observes its |
| 215 | +index, the engine's `AppliedIndex()` is already ≥ the snapshot index |
| 216 | +and the snapshot application updated the registry row, so the |
| 217 | +apply-wait completes correctly — same reasoning as the 6C-2b scanner. |
| 218 | + |
| 219 | +### 3.3 Leader routing for the propose |
| 220 | + |
| 221 | +`RegisterEncryptionWriter` proposes a §11.3 0x03 entry through the |
| 222 | +default group's leader (the existing 5B Proposer path). On a follower, |
| 223 | +the startup propose forwards to the current leader (same path the admin |
| 224 | +RPC uses). If no leader is available yet at startup, the barrier stays |
| 225 | +**open** (pending — the channel is non-nil and not yet closed, so |
| 226 | +mutating encrypted writes block per the §3.2 table; "closed" is |
| 227 | +reserved for the committed/ungated state) and the first encrypted |
| 228 | +write blocks until leadership settles — bounded by the client's `ctx` |
| 229 | +deadline (fail-closed: a write that cannot register does not get to |
| 230 | +emit an unregistered nonce). |
| 231 | + |
| 232 | +## 4. Scope |
| 233 | + |
| 234 | +### In scope (7a) |
| 235 | +- Startup registration-intent decision (defer / skip / propose). |
| 236 | +- Coordinator first-write barrier gating self-originated encrypted |
| 237 | + writes on registration commit. |
| 238 | +- Follower→leader forwarding of the startup registration propose. |
| 239 | +- Unit tests: the three intent branches; barrier blocks-then-releases; |
| 240 | + reads + non-encrypted writes never gated; Phase-0 defer; ctx-cancel |
| 241 | + while blocked. |
| 242 | + |
| 243 | +### Out of scope |
| 244 | +- **7b** — post-rotation re-registration (register against the new DEK |
| 245 | + on a `rotate-dek` apply, same barrier mechanism). |
| 246 | +- **7c** — ConfChange-time registration (leader pairs a |
| 247 | + `RegisterEncryptionWriter` with a learner/voter add before the |
| 248 | + conf-change commits; `ErrPeerNotEncryptionRegistered`). |
| 249 | +- Snapshot of registry rows across compaction (covered by the §5.5 |
| 250 | + resync path already shipped). |
| 251 | + |
| 252 | +## 5. Self-review checklist (for the implementation PR) |
| 253 | + |
| 254 | +- **Data loss / consistency** — registration is additive (a 0x03 entry |
| 255 | + + a registry-row insert/advance); no user data path changes. The |
| 256 | + gate only *delays* encrypted writes, never drops them. |
| 257 | +- **Concurrency** — the barrier is read on every encrypted write; once |
| 258 | + closed the fast path is a non-blocking `select` branch on the closed |
| 259 | + channel (O(1), no extra synchronization), and ctx-aware while |
| 260 | + pending. No deadlock if leadership never settles — ctx bound. The |
| 261 | + `nil` (never-armed) case is checked first so it is never received on. |
| 262 | +- **Performance** — fast path after first write is a non-blocking |
| 263 | + `select` on the closed channel (O(1)); the registry read + propose |
| 264 | + happen once per process load. |
| 265 | +- **Data consistency** — registry monotonicity (§4.1 case 2/3) |
| 266 | + preserved; skip-at-equal-epoch avoids the `ErrLocalEpochRollback` |
| 267 | + self-brick. |
| 268 | +- **Test coverage** — intent branches, barrier blocking/release, |
| 269 | + ungated reads, Phase-0 defer, ctx-cancel. |
| 270 | + |
| 271 | +## 6. Resolved design decisions (PR #835 round 1) |
| 272 | + |
| 273 | +All three original open questions were settled by the round-1 review: |
| 274 | + |
| 275 | +1. **Gate injection point** → `ShardedCoordinator.Dispatch`, the single |
| 276 | + adapter entry point. Reads (`LinearizableRead` / `LeaseRead` / |
| 277 | + `VerifyLeader`) are separate methods, structurally ungated. Gate |
| 278 | + condition gated on `hasMutatingElems(reqs)` (PUT/DEL/DEL_PREFIX), so |
| 279 | + read-only transactions stay ungated (§3.2). |
| 280 | +2. **Barrier representation** → `chan struct{}` with the explicit |
| 281 | + nil / open / closed three-state semantic (§3.2). Idiomatic, |
| 282 | + ctx-aware, no extra synchronization primitives. |
| 283 | +3. **Sync vs async propose** → asynchronous: arm the barrier and let |
| 284 | + only the first encrypted write wait, so reads serve immediately. The |
| 285 | + propose goroutine selects on both the apply signal and the process |
| 286 | + run-context to avoid a leak / a barrier-close on an uncommitted |
| 287 | + registration (§3.2, review #3). |
| 288 | + |
| 289 | +Remaining items to resolve **within the implementation PR** (not |
| 290 | +deferred further): the exact `hasMutatingElems` predicate over the |
| 291 | +`OperationGroup` shape, and the default-group `WriterRegistryStore` |
| 292 | +threading out of `buildShardGroups` (§3.1). |
0 commit comments