Skip to content

Commit dd0beb2

Browse files
committed
docs(encryption): PR893 round-1 — gemini HIGH (wrapOnPropose startup init) + 2 mediums
gemini HIGH on §2.1 (line 57) The original spec said wrapOnPropose is 'set by the cutover barrier and never reset within a process lifetime' but did not specify how the flag is initialized on startup. A node that restarts AFTER the cluster has cut over would default the flag to false, propose plaintext as a new leader, and the engine apply-hook (which uses the sidecar — NOT the in-process flag — as the source of truth) would attempt Unwrap on those plaintext entries, fail GCM, and halt apply cluster-wide. Fix: §2.1 now requires the flag to be initialized to true iff sidecar.RaftEnvelopeCutoverIndex != 0 at coordinator construction time. Two set-paths total (startup hydration, barrier step 5) and zero reset-paths within a process load. Added test TestCoordinatorWrap_StartupInitFromSidecar in §6.2 to pin the regression. gemini MEDIUM on §2.4 (line 143) applyNormalEntry signature in the proposed code snippet did not match the actual codebase. Verified at internal/raftengine/etcd/engine.go:2226 that the real signature is (entry raftpb.Entry) (any, error); setApplied and resolveProposal live in applyNormalCommitted (line 2173, 2186-2187) which calls applyNormalEntry. Updated the snippet + commentary to match: 6E-2 adds the unwrap shim inside applyNormalEntry without changing the caller contract; the error return is what makes applyNormalCommitted skip setApplied (the existing fail-closed shape). gemini MEDIUM on §5 header (line 268) Header said 'is unwrapped' but the body says 'is NOT unwrapped'. Header was a typo. Now reads 'NOT unwrapped'. Self-review (5-lens) on the doc change itself: 1. Data loss — startup-init rule directly prevents the halt-apply-cluster-wide scenario that the original spec left open. 2. Concurrency — the flag has two write paths (startup hydration on a single goroutine, barrier step 5 on a single goroutine), both serial; no race possible within the lifetime of one process load. 3. Performance — startup-init is a single sidecar read at coordinator construction (already happens for 6C-2); zero hot-path cost. 4. Data consistency — the engine apply-hook still reads the sidecar as the source of truth (per-replica deterministic); the in-process flag is the leader's write-side mirror, not a parallel decision oracle. 5. Test coverage — new test pinned; without it the gemini HIGH scenario would regress silently.
1 parent 5259d23 commit dd0beb2

1 file changed

Lines changed: 57 additions & 18 deletions

File tree

docs/design/2026_05_31_proposed_6e_enable_raft_envelope.md

Lines changed: 57 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -51,11 +51,29 @@ plaintext entry at `index > cutover`.
5151
wrap vs plaintext per entry.
5252
3. **Coordinator wrap-on-propose switch**
5353
`kv/coordinator.go` / `kv/sharded_coordinator.go` consult an
54-
in-process `wrapOnPropose` flag (set by the cutover barrier,
55-
reset never within a process lifetime) before calling
54+
in-process `wrapOnPropose` flag before calling
5655
`engine.Propose`. Phase-2 leaders wrap; Phase-0/1 leaders
5756
propose plaintext exactly as today.
5857

58+
**Startup initialization (load-bearing)**: the flag is
59+
`true` when the local sidecar's `RaftEnvelopeCutoverIndex
60+
!= 0`, otherwise `false`. The check runs once during
61+
coordinator construction (after the sidecar is hydrated
62+
per Stage 6C-2). Without this startup rule, a node that
63+
restarts after the cluster has already cut over would
64+
default to `false`, propose plaintext as a leader, and
65+
the engine apply-hook (which uses the sidecar — not the
66+
in-process flag — as the source of truth) would attempt
67+
`Unwrap` on those plaintext entries, fail GCM, and halt
68+
apply cluster-wide (gemini HIGH on PR #893).
69+
70+
Once set, the flag is never reset within a process
71+
lifetime. It is set by exactly two paths:
72+
- Startup: `sidecar.RaftEnvelopeCutoverIndex != 0`
73+
immediately `true` (the cluster cut over earlier).
74+
- Runtime: §7.1 step 5 of the cutover barrier (the
75+
cluster is cutting over right now on this leader).
76+
5977
### 2.2 The §7.1 6-step quiescence barrier
6078

6179
The barrier is the only place in the design that intentionally
@@ -111,12 +129,23 @@ notion of it. The unwrap therefore lives in
111129
`internal/raftengine/etcd/engine.go::applyNormalEntry`, **before**
112130
the FSM apply call. Concretely:
113131

132+
The actual codebase signature (verified against
133+
`internal/raftengine/etcd/engine.go:2226`):
134+
- `applyNormalEntry(entry) (any, error)` — returns the FSM
135+
response and any error. The caller
136+
(`applyNormalCommitted`, line 2173) does the `setApplied`
137+
and `resolveProposal` after this returns. On a non-nil
138+
error, `applyNormalCommitted` skips `setApplied` so the
139+
next restart replays the entry — the existing fail-closed
140+
shape. 6E-2 adds the unwrap shim inside
141+
`applyNormalEntry`; no caller signature change.
142+
114143
```go
115144
// internal/raftengine/etcd/engine.go::applyNormalEntry
116-
func (e *engine) applyNormalEntry(entry raftpb.Entry) error {
145+
func (e *Engine) applyNormalEntry(entry raftpb.Entry) (any, error) {
117146
id, maybeEncPayload, ok := decodeProposalEnvelope(entry.Data)
118147
if !ok {
119-
return nil // pre-envelope entry, leave intact
148+
return nil, nil // pre-envelope entry, leave intact
120149
}
121150
payload := maybeEncPayload
122151
// §6.3 hook: unwrap only at index strictly greater than
@@ -129,25 +158,25 @@ func (e *engine) applyNormalEntry(entry raftpb.Entry) error {
129158
payload, err = e.encryption.RaftDEK().Unwrap(maybeEncPayload)
130159
if err != nil {
131160
// GCM tag mismatch = sidecar/keystore divergence
132-
// or on-disk corruption. Halt apply WITHOUT
133-
// setApplied so the next restart replays this
134-
// entry under a corrected keystore. Silent skip
135-
// would diverge the FSM.
136-
return errors.Wrap(err, "raft envelope: unwrap")
137-
// engine treats this as ErrRaftUnwrapFailed → fatal
161+
// or on-disk corruption. Return the error so
162+
// applyNormalCommitted skips setApplied; the next
163+
// restart replays under a corrected keystore.
164+
// Silent skip would diverge the FSM.
165+
return nil, errors.Wrap(err, "raft envelope: unwrap")
166+
// applyNormalCommitted treats as ErrRaftUnwrapFailed → fatal
138167
}
139168
}
140-
response := e.fsm.Apply(payload)
141-
e.resolveProposal(entry.Index, entry.Data, response)
142-
return nil
169+
return e.fsm.Apply(payload), nil
143170
}
144171
```
145172

146173
**CRITICAL**: `decodeProposalEnvelope` must run FIRST so the
147-
proposal-ID handoff to `resolveProposal` stays intact. The raft
148-
envelope wraps the FSM payload *inside* the proposal envelope;
149-
wrapping `entry.Data` itself would clobber the proposal-envelope
150-
version byte and break every coordinator write (timeout forever).
174+
proposal-ID handoff that `applyNormalCommitted`'s
175+
`resolveProposal` call (line 2187) depends on stays intact.
176+
The raft envelope wraps the FSM payload *inside* the
177+
proposal envelope; wrapping `entry.Data` itself would
178+
clobber the proposal-envelope version byte and break every
179+
coordinator write (timeout forever).
151180

152181
## 3. Milestone breakdown
153182

@@ -265,7 +294,7 @@ applied index via the existing crash-durable
265294
field has been on disk since 6D shipped; only its
266295
load-bearingness changes per milestone.
267296

268-
## 5. Why the cutover entry is unwrapped at index == cutover
297+
## 5. Why the cutover entry is NOT unwrapped at index == cutover
269298

270299
§7.1 step 4 says: "wait for that entry to commit AND for the
271300
local FSM apply to set `raft_envelope_cutover_index`". The
@@ -330,6 +359,16 @@ or unwrapped-then-not-wrapped.
330359
`wrapOnPropose = false`, proposals are plaintext; flip to
331360
true; subsequent proposals are wrapped. Pin against any
332361
in-process race that could flip the flag mid-proposal.
362+
- `TestCoordinatorWrap_StartupInitFromSidecar` — pins the
363+
gemini-HIGH fix (PR #893): construct a coordinator with a
364+
sidecar whose `RaftEnvelopeCutoverIndex != 0`; the
365+
in-process `wrapOnPropose` flag MUST be `true` immediately
366+
after construction (before any cutover barrier runs).
367+
Construct with cutover = 0 → flag MUST be `false`. Without
368+
this, a post-cutover restart that becomes leader would
369+
propose plaintext at `index > cutover`, the engine
370+
apply-hook would attempt `Unwrap`, fail GCM, and halt
371+
apply cluster-wide.
333372
- `TestPhase2EndToEnd_ProposeWrapApplyUnwrap` — full
334373
round-trip: coordinator wraps, engine unwraps, FSM
335374
applies, value lands in the store correctly.

0 commit comments

Comments
 (0)