Skip to content

Commit c9a7870

Browse files
authored
feat(encryption): Stage 6D-2 — ErrNodeIDCollision + ErrLocalEpochRollback primitives + probe-node-id CLI (#788)
## Summary Stage 6D-2 per the [6D design doc §5](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_05_18_proposed_6d_enable_storage_envelope.md) (landed in PR #786). Ships the two bundled 6C-3 startup-guard primitives as pure functions in the encryption package, plus the `probe-node-id` CLI helper from §5.1 mitigation #2. **Operator-inert until 6D-6 wires the primitives into `main.go`'s startup-guard phase.** The `probe-node-id` CLI is reachable today but only computes locally — no operator action against the cluster. ## What this ships ### Two new sentinel errors (`internal/encryption/errors.go`) - `ErrNodeIDCollision` — two distinct `full_node_id` values hash to the same 16-bit `node_id` (would reuse GCM nonces under the active DEK). Triage message points at `probe-node-id` for pre-join verification. - `ErrLocalEpochRollback` — sidecar `local_epoch <= registry.LastSeenLocalEpoch` (strict-ahead invariant from §5.2). Triage message points at `resync-sidecar` or DEK rotation. ### Two pure-function guard primitives - `CheckNodeIDCollision(fullNodeIDs []uint64) error` (`node_id_collision.go`) - `CheckLocalEpochRollback(registry, fullNodeID, dekID, sidecarLocalEpoch) error` (`local_epoch_rollback.go`) Both use the shipped `uint16(full_node_id & 0xFFFF)` narrowing (matches `applier.go::nodeIDMask`). Local-state-only — no RPC, no engine access. Same-value-twice is NOT a collision (duplicate-membership-entry defense). Equality fires `ErrLocalEpochRollback` per the §5.2 strict-ahead invariant (replay would reuse the same `(node_id, local_epoch)` prefix). ### CLI: `elastickv-admin encryption probe-node-id` Per §5.1 mitigation #2 — operator runs against a candidate `full_node_id` BEFORE joining a node to verify the value is free in the cluster's allocated `node_id` space. Accepts decimal or `0x`-prefixed hex input; prints both the 16-hex-digit canonical `full_node_id` and the derived `node_id`. No RPC; pure local computation. ## Test coverage 19 new tests: | File | Tests | |---|---| | `node_id_collision_test.go` | Empty, SingleNode, NoCollision, Collision, DuplicateNotCollision (same value twice ≠ collision), OrderIndependent, ErrorIncludesBothIDs | | `local_epoch_rollback_test.go` | NoRow (freshly-joined-learner skip), SidecarStrictlyAhead, SidecarLessThan, SidecarEqual (strict-ahead replay case), PebbleError (not classified as rollback), NarrowingMatchesShippedConvention | | `encryption_test.go` (new tests added) | ProbeNodeID_Hex, ProbeNodeID_Decimal, RequiresFlag, RejectsBadInput, ProbeNodeIDSubcommand (dispatch reaches handler) | ## Caller audit No semantic changes to existing functions; only additions: - New sentinels in `errors.go` — exported package-surface additions, no callers to update. - New pure functions — no callers yet (operator-inert until 6D-6). - New CLI subcommand `probe-node-id` — additive; no existing subcommand changed. ## Five-lens self-review 1. **Data loss** — net-positive. Both guards catch nonce-reuse scenarios BEFORE any encrypted write that would corrupt confidentiality. 2. **Concurrency / distributed failures** — pure functions over local snapshots; no shared state. Run before any goroutine fan-out. 3. **Performance** — O(N) map walk for collision check; single Pebble Get for rollback check. Paid once per process start on happy path. 4. **Data consistency** — narrowing matches the shipped `nodeIDMask = 0xFFFF` convention from `applier.go`. The `NarrowingMatchesShippedConvention` test pins this against a future change that would diverge. 5. **Test coverage** — 19 new unit tests cover every documented branch + operationally important edges (equality for strict-ahead, narrowing equivalence, error propagation). ## Verification - `go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/` — PASS - `golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/` — 0 issues - `go build ./...` — PASS ## Plan Per the 6D design doc §7 decomposition, the next sub-PRs are: - **6D-3** — Voters ∪ Learners `CapabilityFanout` helper (used by 6D-6's pre-flight gate and by Stage 6E) - **6D-4** — Wire format addition (`RotateSubEnableStorageEnvelope = 0x04`) + FSM apply path with `len(Wrapped) == 0` check + sidecar `StorageEnvelopeCutoverIndex` write - **6D-5** — §6.2 storage-layer toggle (`PutAt` reads `StorageEnvelopeActive`) - **6D-6** — `EnableStorageEnvelope` admin RPC + CLI + integration test + `main.go` wiring of THIS PR's startup-guard primitives ## Test plan - [x] `go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/` — PASS (19 new tests + existing) - [x] `golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/` — 0 issues - [x] `go build ./...` — PASS - [ ] Jepsen — not relevant for an operator-inert PR; the 6D-6 integration test will exercise the wired guards end-to-end <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a CLI subcommand to display a provided full node identifier (decimal or hex) and its derived 16-bit node id. * Added startup validation checks that detect 16-bit node-id collisions and local-epoch rollback conditions to prevent unsafe encryption states. * **Tests** * Added unit and integration tests covering the new CLI parsing, collision detection, and local-epoch rollback behaviors. <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/788?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
2 parents 8c0d07f + 5aaa3e0 commit c9a7870

7 files changed

Lines changed: 861 additions & 2 deletions

File tree

cmd/elastickv-admin/encryption.go

Lines changed: 81 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ import (
77
"io"
88
"os"
99
"slices"
10+
"strconv"
1011
"time"
1112

1213
pb "github.com/bootjp/elastickv/proto"
@@ -42,21 +43,99 @@ func encryptionMain(args []string) error {
4243
return runEncryptionRegisterWriter(rest, os.Stdout)
4344
case "bootstrap":
4445
return runEncryptionBootstrap(rest, os.Stdout)
46+
case "probe-node-id":
47+
return runEncryptionProbeNodeID(rest, os.Stdout)
4548
case "-h", "--help", "help":
4649
// `-h` is the universal "show usage" affordance for CLI
4750
// subcommands; returning nil keeps the exit code at 0
4851
// so shell scripts using $? to detect success do not
4952
// trip on a help request.
50-
_, err := fmt.Fprintln(os.Stdout, "usage: elastickv-admin encryption <subcommand> [flags]\n\nsubcommands:\n status\n rotate-dek\n register-writer\n bootstrap")
53+
_, err := fmt.Fprintln(os.Stdout, "usage: elastickv-admin encryption <subcommand> [flags]\n\nsubcommands:\n status\n rotate-dek\n register-writer\n bootstrap\n probe-node-id")
5154
if err != nil {
5255
return errors.Wrap(err, "write usage")
5356
}
5457
return nil
5558
default:
56-
return errors.Errorf("encryption: unknown subcommand %q (supported: status, rotate-dek, register-writer, bootstrap)", sub)
59+
return errors.Errorf("encryption: unknown subcommand %q (supported: status, rotate-dek, register-writer, bootstrap, probe-node-id)", sub)
5760
}
5861
}
5962

63+
// runEncryptionProbeNodeID implements the §5.1 collision-
64+
// mitigation helper from the 6D design doc: it derives the
65+
// 16-bit `node_id` from a candidate `full_node_id` so the
66+
// operator can verify the value is free in the cluster's
67+
// allocated node_id space BEFORE joining the node and
68+
// triggering an `ErrNodeIDCollision` refusal.
69+
//
70+
// The narrowing — `uint16(full_node_id & 0xFFFF)` — matches the
71+
// shipped writer-registry keying and §4.1 GCM nonce prefix
72+
// (see `internal/encryption/applier.go::nodeIDMask`). Anything
73+
// other than this narrowing would compute a value that diverges
74+
// from what the runtime uses, defeating the purpose of the
75+
// probe.
76+
//
77+
// No RPC — this is a pure local computation. The operator runs
78+
// it against a candidate `full_node_id` and a current cluster's
79+
// allocated node_id list (out of scope for this subcommand; the
80+
// operator can read `gh ... encryption status` on each member
81+
// to gather them).
82+
func runEncryptionProbeNodeID(args []string, out io.Writer) error {
83+
fs := flag.NewFlagSet("encryption probe-node-id", flag.ContinueOnError)
84+
fullNodeIDStr := fs.String("full-node-id", "", "candidate 64-bit full_node_id to probe (decimal or 0x-prefixed hex)")
85+
if err := fs.Parse(args); err != nil {
86+
// flag.ContinueOnError reports `-h`/`--help` via the
87+
// sentinel flag.ErrHelp after writing usage to
88+
// fs.Output(). Match the convention used by every other
89+
// encryption subcommand (status, rotate-dek,
90+
// register-writer, bootstrap): treat ErrHelp as a usage
91+
// request, exit 0 so shell scripts that test $? on help
92+
// invocations don't trip.
93+
if errors.Is(err, flag.ErrHelp) {
94+
return nil
95+
}
96+
return errors.Wrap(err, "parse probe-node-id flags")
97+
}
98+
if *fullNodeIDStr == "" {
99+
return errors.New("--full-node-id is required (decimal or 0x-prefixed hex)")
100+
}
101+
full, err := parseUint64WithRadix(*fullNodeIDStr)
102+
if err != nil {
103+
return errors.Wrapf(err, "parse --full-node-id=%q", *fullNodeIDStr)
104+
}
105+
const nodeIDMask = 0xFFFF
106+
// keep masked as uint64: the value is only formatted for display, so the uint16 cast (and its gosec G115 nolint) is not load-bearing here.
107+
masked := full & nodeIDMask
108+
if _, err := fmt.Fprintf(out, "full_node_id: %#016x (%d)\nnode_id: %#04x (%d)\n",
109+
full, full, masked, masked); err != nil {
110+
return errors.Wrap(err, "write probe-node-id result")
111+
}
112+
return nil
113+
}
114+
115+
// parseUint64WithRadix accepts either decimal ("12345") or
116+
// 0x-prefixed hex ("0xDEADBEEF") so operators can paste values
117+
// in whichever form their inventory uses. Uses strconv.ParseUint
118+
// rather than fmt.Sscanf: ParseUint requires the ENTIRE input to
119+
// be valid for the chosen radix, whereas Sscanf stops at the
120+
// first non-matching character and silently returns the prefix.
121+
// The silent-prefix behaviour would let "0x1234ZZZZ" parse as
122+
// 0x1234 (or "1234abc" as 1234), which is unsafe for an
123+
// operator-visible identifier where every digit matters.
124+
func parseUint64WithRadix(s string) (uint64, error) {
125+
if len(s) >= 2 && (s[0:2] == "0x" || s[0:2] == "0X") {
126+
v, err := strconv.ParseUint(s[2:], 16, 64)
127+
if err != nil {
128+
return 0, errors.Wrap(err, "hex parse")
129+
}
130+
return v, nil
131+
}
132+
v, err := strconv.ParseUint(s, 10, 64)
133+
if err != nil {
134+
return 0, errors.Wrap(err, "decimal parse")
135+
}
136+
return v, nil
137+
}
138+
60139
type encryptionEndpointFlags struct {
61140
endpoint *string
62141
timeout *time.Duration

cmd/elastickv-admin/encryption_test.go

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -462,3 +462,140 @@ func startCustomEncryptionAdminTestServer(t *testing.T, impl pb.EncryptionAdminS
462462
})
463463
return lis.Addr().String()
464464
}
465+
466+
// TestRunEncryptionProbeNodeID_Hex pins the §5.1 collision-
467+
// mitigation helper from the 6D design doc: a hex full_node_id
468+
// is parsed and narrowed via uint16(full_node_id & 0xFFFF) — the
469+
// same mask that applier.go (writer-registry keying) and §4.1
470+
// (GCM nonce prefix) already use. The CLI prints both the
471+
// full_node_id (canonical 16-hex-digit form) and the derived
472+
// node_id so the operator can verify the value is free in the
473+
// cluster's allocated node_id space before joining.
474+
func TestRunEncryptionProbeNodeID_Hex(t *testing.T) {
475+
t.Parallel()
476+
var buf bytes.Buffer
477+
err := runEncryptionProbeNodeID([]string{"--full-node-id=0xDEADBEEFCAFE1234"}, &buf)
478+
if err != nil {
479+
t.Fatalf("runEncryptionProbeNodeID: %v", err)
480+
}
481+
out := buf.String()
482+
for _, needle := range []string{
483+
"full_node_id: 0xdeadbeefcafe1234",
484+
"node_id: 0x1234",
485+
} {
486+
if !strings.Contains(out, needle) {
487+
t.Errorf("output missing %q:\n%s", needle, out)
488+
}
489+
}
490+
}
491+
492+
// TestRunEncryptionProbeNodeID_Decimal pins the dual-radix
493+
// parser: decimal input works alongside the 0x-prefixed hex case.
494+
func TestRunEncryptionProbeNodeID_Decimal(t *testing.T) {
495+
t.Parallel()
496+
var buf bytes.Buffer
497+
err := runEncryptionProbeNodeID([]string{"--full-node-id=305419896"}, &buf)
498+
if err != nil {
499+
t.Fatalf("runEncryptionProbeNodeID: %v", err)
500+
}
501+
// 305419896 = 0x12345678; low 16 bits = 0x5678
502+
out := buf.String()
503+
for _, needle := range []string{
504+
"full_node_id: 0x0000000012345678",
505+
"node_id: 0x5678",
506+
} {
507+
if !strings.Contains(out, needle) {
508+
t.Errorf("output missing %q:\n%s", needle, out)
509+
}
510+
}
511+
}
512+
513+
// TestRunEncryptionProbeNodeID_RequiresFlag pins the
514+
// required-flag contract: omitting --full-node-id returns an
515+
// explicit error rather than producing a misleading "probe of
516+
// zero" output.
517+
func TestRunEncryptionProbeNodeID_RequiresFlag(t *testing.T) {
518+
t.Parallel()
519+
var buf bytes.Buffer
520+
err := runEncryptionProbeNodeID([]string{}, &buf)
521+
if err == nil {
522+
t.Fatal("missing --full-node-id: want error, got nil")
523+
}
524+
if !strings.Contains(err.Error(), "--full-node-id is required") {
525+
t.Errorf("error message: want mention of required flag, got %v", err)
526+
}
527+
if buf.Len() != 0 {
528+
t.Errorf("must not print output on error: got %q", buf.String())
529+
}
530+
}
531+
532+
// TestRunEncryptionProbeNodeID_RejectsBadInput pins the
533+
// invalid-input case: non-numeric, non-hex input returns an
534+
// error rather than silently truncating.
535+
func TestRunEncryptionProbeNodeID_RejectsBadInput(t *testing.T) {
536+
t.Parallel()
537+
var buf bytes.Buffer
538+
err := runEncryptionProbeNodeID([]string{"--full-node-id=notanumber"}, &buf)
539+
if err == nil {
540+
t.Fatal("invalid input: want error, got nil")
541+
}
542+
}
543+
544+
// TestRunEncryptionProbeNodeID_HelpFlagExitsZero pins the
545+
// flag.ErrHelp special-case: every other encryption subcommand
546+
// (status, rotate-dek, register-writer, bootstrap) returns nil
547+
// on `-h`/`--help` so shell scripts that test $? on help
548+
// invocations don't trip. probe-node-id must match the
549+
// convention. Codex r2 flagged the absent ErrHelp branch.
550+
func TestRunEncryptionProbeNodeID_HelpFlagExitsZero(t *testing.T) {
551+
t.Parallel()
552+
var buf bytes.Buffer
553+
err := runEncryptionProbeNodeID([]string{"-h"}, &buf)
554+
if err != nil {
555+
t.Fatalf("-h: want nil, got %v", err)
556+
}
557+
}
558+
559+
// TestRunEncryptionProbeNodeID_RejectsPartialInput pins the
560+
// strconv.ParseUint vs fmt.Sscanf distinction: parseUint64WithRadix
561+
// MUST reject inputs where only a prefix is parseable, because
562+
// "0x1234ZZZZ" silently parsing as 0x1234 would mislead the
563+
// operator into joining a node under a different node_id than
564+
// the one they meant to probe. Sscanf accepts partial input;
565+
// ParseUint rejects it.
566+
func TestRunEncryptionProbeNodeID_RejectsPartialInput(t *testing.T) {
567+
t.Parallel()
568+
cases := []string{
569+
"--full-node-id=0x1234ZZZZ", // hex prefix with non-hex tail
570+
"--full-node-id=1234abc", // decimal prefix with letter tail
571+
"--full-node-id=0xDEAD GHI", // hex with embedded space
572+
}
573+
for _, tc := range cases {
574+
var buf bytes.Buffer
575+
err := runEncryptionProbeNodeID([]string{tc}, &buf)
576+
if err == nil {
577+
t.Errorf("partial-input %q: want error, got nil (output=%q)", tc, buf.String())
578+
}
579+
}
580+
}
581+
582+
// TestEncryptionMain_ProbeNodeIDSubcommand pins the dispatch in
583+
// encryptionMain: the new probe-node-id case must route to the
584+
// runner. Without this, a typo in encryptionMain's switch
585+
// statement would route the subcommand to the default branch
586+
// and fail with "unknown subcommand", and the dispatch test
587+
// would not catch it.
588+
func TestEncryptionMain_ProbeNodeIDSubcommand(t *testing.T) {
589+
t.Parallel()
590+
// Help-flag path: probe-node-id without flags errors with
591+
// "required" rather than "unknown subcommand". A successful
592+
// dispatch route reaches runEncryptionProbeNodeID's
593+
// missing-flag branch.
594+
err := encryptionMain([]string{"probe-node-id"})
595+
if err == nil {
596+
t.Fatal("probe-node-id with no flags: want error, got nil")
597+
}
598+
if !strings.Contains(err.Error(), "--full-node-id is required") {
599+
t.Errorf("dispatch reached wrong handler: got %v", err)
600+
}
601+
}

internal/encryption/errors.go

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,4 +208,65 @@ var (
208208
// from forcing a spurious refusal. See §5.5 for the
209209
// IsEncryptionRelevantOpcode rationale.
210210
ErrSidecarBehindRaftLog = errors.New("encryption: sidecar raft_applied_index is behind the raftengine's persisted applied index and the gap covers an encryption-relevant entry; refusing to start (run `encryption resync-sidecar` to advance the sidecar past the encryption-relevant entries before retrying — see §9.1 + §5.5)")
211+
212+
// ErrNodeIDCollision is the §9.1 / 6C-3 startup-refusal guard
213+
// raised when two distinct `full_node_id` values in the local
214+
// route-catalog snapshot map to the same 16-bit `node_id`. The
215+
// derivation is `uint16(full_node_id & 0xFFFF)` — same narrowing
216+
// the writer-registry keying and §4.1 GCM nonce prefix already
217+
// use (see `applier.go` `nodeIDMask`). A collision means the two
218+
// members would write under the same `(node_id, local_epoch)`
219+
// prefix and reuse the GCM counter under the same DEK, which
220+
// breaks AES-GCM's nonce-uniqueness requirement and is a
221+
// catastrophic confidentiality + integrity failure.
222+
//
223+
// The guard reads the LOCAL route-catalog snapshot only — no
224+
// RPC fan-out — because the startup-guard phase runs BEFORE the
225+
// gRPC server is up, and the local Raft log is authoritative
226+
// for cluster membership (every node applies the same
227+
// `ConfChange` entries). See `2026_05_18_proposed_6d_enable_storage_envelope.md`
228+
// §5.1 for the lifecycle rationale and the operator-mitigation
229+
// menu (`probe-node-id` CLI, full_node_id re-roll, etc.).
230+
//
231+
// Skip conditions: encryption disabled (no nonce-reuse risk),
232+
// empty membership snapshot (single-node pre-bootstrap; nothing
233+
// to compare yet).
234+
ErrNodeIDCollision = errors.New("encryption: two distinct full_node_id values in the local route-catalog hash to the same 16-bit node_id, which would reuse GCM nonces under the active DEK; refusing to start (re-roll one of the colliding full_node_id values — use `elastickv-admin encryption probe-node-id --full-node-id=<u64>` to verify before joining — see §9.1 + §5.1 of the 6D design doc)")
235+
236+
// ErrLocalEpochRollback is the §9.1 / 6C-3 startup-refusal
237+
// guard raised when this node's sidecar `local_epoch` for the
238+
// active storage DEK is less than OR equal to the local Pebble
239+
// writer-registry's `LastSeenLocalEpoch` for the
240+
// `(full_node_id, active_storage_dek_id)` row. The strict-ahead
241+
// posture — `sidecar > registry`, not `>=` — is the §5.2
242+
// nonce-monotonicity invariant: a node may only resume issuing
243+
// nonces when its sidecar is STRICTLY ahead of the registry
244+
// record. The equality case would replay the same
245+
// `(node_id, local_epoch)` prefix and reuse the GCM counter
246+
// under the same DEK, identical to the collision scenario
247+
// `ErrNodeIDCollision` catches but at the single-node-restart
248+
// timescale rather than the cluster-membership timescale.
249+
//
250+
// Classic trigger: operator restored the sidecar from an old
251+
// backup, leaving `local_epoch` behind the registry record
252+
// that the node has already used for prior writes.
253+
//
254+
// The guard reads LOCAL Pebble state only — no RPC. The
255+
// writer-registry is local state on every node post-bootstrap
256+
// (every node applies the same OpRegistration and OpBootstrap
257+
// entries), so the local row is authoritative for this node's
258+
// own monotonicity contract.
259+
//
260+
// Skip conditions: encryption disabled, sidecar's
261+
// `Active.Storage == 0` (bootstrap not yet committed). The
262+
// missing-registry-row case is NOT an unconditional skip —
263+
// the §5.2 split fires based on `sidecar.StorageEnvelopeActive`:
264+
// pre-cutover (active=false) the missing-row case is a
265+
// freshly-joined-learner skip and `CheckLocalEpochRollback`
266+
// returns nil; post-cutover (active=true) the missing-row
267+
// case is a missing rollback anchor and the function fires
268+
// `ErrLocalEpochRollback`. See the comprehensive doc on
269+
// `CheckLocalEpochRollback` (`local_epoch_rollback.go`) for
270+
// the full split.
271+
ErrLocalEpochRollback = errors.New("encryption: sidecar local_epoch for the active storage DEK is at or below the local writer-registry's last_seen_local_epoch, which would replay the GCM nonce prefix under the same DEK; refusing to start (verify the sidecar was not restored from an old backup; run `encryption resync-sidecar` if appropriate, or rotate the affected DEK — see §9.1 + §5.2 of the 6D design doc)")
211272
)

0 commit comments

Comments
 (0)