@@ -65,8 +65,8 @@ What bounds a single elastickv deployment today:
6565
66665 . ** Cross-group timestamps via per-group HLC, and not even per-group yet.**
6767 ` ShardedCoordinator.RunHLCLeaseRenewal ` proposes the physical-ceiling
68- renewal ** only to the default group** (` kv/sharded_coordinator.go:1914-1953 ` ,
69- ` group, ok := c.groups[c.defaultGroup] ` at ` :1915 ` ). A node that leads a
68+ renewal ** only to the default group** (` kv/sharded_coordinator.go:1960-1985 ` ,
69+ ` group, ok := c.groups[c.defaultGroup] ` at ` :1961 ` ). A node that leads a
7070 non-default group but is not a member of the default group never advances
7171 its ceiling from that group — exactly the cross-group monotonicity gap the
7272 centralized-TSO doc §1.1 describes. The doc's "near-term fix" (M1: iterate
@@ -158,12 +158,13 @@ memory each group's private cache/memtable pins.
158158 ` multiraft_runtime.go:246-254 ` mean every group in a multi-group process is
159159 single-voter, so there is nothing to transfer a leader * to* . Closing this is
160160 a prerequisite for write-throughput scaling beyond one node, and a companion
161- proposal is being written in parallel:
162- ` 2026_06_12_proposed_multinode_multigroup_bootstrap.md ` (extend
163- ` --raftGroups ` / a per-group members flag to accept a multi-node voter set
164- per group, lifting the ` len(groups)==1 ` guard). This roadmap references it
165- by name and treats it as the unblocking dependency for (b), (e), and the
166- region-balance gap in §3.
161+ proposal is in review as ** PR #955 **
162+ (` docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md ` , branch
163+ ` design/multinode-multigroup-bootstrap ` ): extend ` --raftGroups ` / a per-group
164+ members flag to accept a multi-node voter set per group, lifting the
165+ ` len(groups)==1 ` guard. Once PR #955 lands it is the authoritative spec for
166+ this gap; this roadmap treats it as the unblocking dependency for (b), (e),
167+ and the region-balance gap in §3.
167168
168169### (c) Read throughput
169170
@@ -181,9 +182,14 @@ memory each group's private cache/memtable pins.
181182 has ` add_learner ` / ` promote_learner ` (` cmd/raftadmin/main.go:204-206 ` ), and
182183 ` --raftJoinAsLearner ` exists (` main.go:104 ` ). The learner doc itself scopes
183184 follower-served reads as an explicit non-goal (§2 "Non-goals", §8 OQ-5):
184- learners forward ` LinearizableRead ` to the leader today. So the read-replica
185- attach primitive exists, but the design that consumes it for off-leader
186- reads ** does not** — ** follower-read design is missing** . Requirements to
185+ a ` LinearizableRead ` against a learner is ** not** forwarded by the engine —
186+ ` handleRead ` returns ` ErrNotLeader ` for any non-leader state
187+ (` internal/raftengine/etcd/engine.go:1583 ` ), and the * caller* must forward to
188+ the leader (` docs/raft_learner_operations.md ` , "Serve linearizable reads from
189+ the learner"). So the read-replica attach primitive exists, but the design
190+ that consumes it for off-leader reads ** does not** — ** follower-read design
191+ is missing** , and it must supply the forwarding/proxy or replica-read API the
192+ current primitive deliberately omits. Requirements to
187193 sketch (Gap 3): a leader-issued read-timestamp pipeline so a follower/learner
188194 serves a snapshot read at a ts the leader has vouched for (CLAUDE.md HLC
189195 rule: never use the local wall clock or a follower-issued ts for MVCC
@@ -210,7 +216,12 @@ memory each group's private cache/memtable pins.
210216 leading different groups is still not guaranteed without a shared oracle
211217 (TSO doc §6 "Guarantee"). Cross-group transactions (` kv/transaction.go ` ,
212218 ` kv/txn_codec.go ` ) that span groups led by different nodes need a single
213- ordering source for OCC commit-ts comparability. So: per-group renewal fix
219+ ordering source for OCC commit-ts comparability. (The shared-snapshot
220+ invariant — every operation in one txn reading at the * same* ` startTS ` — is
221+ already upheld: ` nextStartTS ` allocates one ` startTS ` for the whole txn and
222+ propagates it via ` reqs.StartTS ` to every participating group; the gap is the
223+ cross-* node* comparability of the per-txn ` commitTS ` , not the per-txn
224+ ` startTS ` . See OQ-1.) So: per-group renewal fix
214225 is in-scope-soon and load-bearing; the dedicated TSO group is only justified
215226 once (i) multi-node multi-group is real and (ii) cross-group transactions
216227 are common enough that the batch-allocator amortization pays for the extra
@@ -241,7 +252,7 @@ memory each group's private cache/memtable pins.
241252### (f) Operational scaling
242253
243254- ** keyviz** — the per-route load sampler is wired and allocation-free on the
244- hot path (` kv/sharded_coordinator.go:1795-1824 ` ), with proposed extensions
255+ hot path (` observeMutation ` , ` kv/sharded_coordinator.go:1841-1846 ` ), with proposed extensions
245256 for cluster fan-out (` 2026_04_27_proposed_keyviz_cluster_fanout.md ` ),
246257 subrange sampling (` 2026_05_25_proposed_keyviz_subrange_sampling.md ` ),
247258 hot-key top-K (` 2026_05_28_proposed_keyviz_hot_key_topk.md ` ), and per-cell
@@ -281,10 +292,10 @@ dimension. **Rough milestones:** (M1) extend `--raftGroups` / add a per-group
281292members flag to accept a multi-node voter set; lift the guard. (M2)
282293integration harness that stands up multi-voter groups across processes. (M3)
283294Jepsen multi-group multi-node workload. ** Depends-on:** nothing (it is the
284- root unblocker). * A companion proposal,
285- ` 2026_06_12_proposed_multinode_multigroup_bootstrap.md ` , is being written in
286- parallel and is the authoritative spec for this gap; this roadmap defers to
287- it.*
295+ root unblocker). * A companion proposal is in review as ** PR # 955 **
296+ ( ` docs/design/ 2026_06_12_proposed_multinode_multigroup_bootstrap.md` , branch
297+ ` design/multinode-multigroup-bootstrap ` ); once it lands it is the authoritative
298+ spec for this gap and this roadmap defers to it.*
288299
289300### Gap 2 — Shared Pebble cache / resource pools
290301** Problem.** Each group's store allocates a private 256 MiB block cache
@@ -361,7 +372,7 @@ The ordering is driven by unblock-edges, not by perceived value in isolation.
361372 change; closes the cross-group monotonicity gap (§1.5) * before* the
362373 topology that exposes it exists. Land first so multi-node groups are safe
363374 on arrival.
364- 2 . ** Multi-node multi-group bootstrap** (Gap 1 /
375+ 2 . ** Multi-node multi-group bootstrap** (Gap 1 / ** PR # 955 ** ,
365376 ` 2026_06_12_proposed_multinode_multigroup_bootstrap.md ` ). The root
366377 unblocker for (b), (c), (e), Gap 3, Gap 4. Nothing else multi-node-shaped
367378 can land until a group can have voters on more than one node.
@@ -386,13 +397,22 @@ The ordering is driven by unblock-edges, not by perceived value in isolation.
3863979 . ** Range merge** (Gap 5). After step 4 for cross-group merge.
38739810 . ** Streaming transport** (Gap 6). Any time after step 2 makes inter-node
388399 Raft traffic significant; pairs with the S3 blob-fetch RPC.
389- 11 . ** Dedicated TSO group** (TSO doc M6–M7) — only once cross-group
390- transactions across node-spanning groups are common enough to justify it
391- (§2(d)).
400+ 11 . ** Dedicated TSO group** (TSO doc M6–M7) — placed last * on the assumption
401+ that the per-group renewal fix (step 1) is sufficient until cross-group
402+ transactions across node-spanning groups are common enough to amortize the
403+ extra Raft group (§2(d))* . ** This placement must be revisited the moment
404+ step 2 lands** — see OQ-1. The decision is not an amortization tradeoff but
405+ a correctness one: the trigger to pull this forward (potentially to
406+ step 3) is ** the first cross-group transaction whose participating groups
407+ are led by different nodes** . The per-group fix gives per-node monotonicity
408+ only (TSO doc §6 "Guarantee"); commit-ts comparability across coordinators
409+ on different nodes is not covered by it (OQ-1). Treat step 11 as
410+ "deferred-pending-OQ-1-resolution", not "settled-last".
39241112 . ** Auto group lifecycle** (Gap 7) — long-term, after 2/4/8/9.
393412
394- In-flight PRs map cleanly: ** #953 ** is steps 3 (and its PR0 = step 2's intent),
395- ** #945 ** is step 4, ** #951 ** is step 5.
413+ In-flight PRs map cleanly: ** #955 ** is step 2 (Gap 1 bootstrap proposal),
414+ ** #953 ** is step 3 (and its PR0 = step 2's intent), ** #945 ** is step 4,
415+ ** #951 ** is step 5.
396416
397417---
398418
@@ -401,9 +421,58 @@ In-flight PRs map cleanly: **#953** is steps 3 (and its PR0 = step 2's intent),
4014211 . ** Per-group HLC fix vs full TSO ordering.** Is the per-group renewal fix
402422 (step 1) sufficient for cross-group OCC correctness in the multi-node
403423 topology, or does the first node-spanning cross-group transaction force the
404- dedicated TSO group earlier than step 11? The answer depends on how
405- ` kv/transaction.go ` compares commit timestamps issued by different nodes;
406- needs a focused correctness review before step 2 lands.
424+ dedicated TSO group earlier than step 11?
425+
426+ ** Where the timestamps come from (grounded in code).** A cross-group txn's
427+ coordinator issues * both* of its timestamps from one ` *HLC ` — ` c.clock ` on
428+ the coordinator's node:
429+ - ` startTS ` via ` nextStartTS ` (` kv/sharded_coordinator.go:1429 ` ):
430+ ` Observe(maxLatestCommitTS(keys)) ` then ` c.clock.NextFenced() ` . The
431+ ` maxLatestCommitTS ` floor is a per-key read against the store — it pins
432+ ` startTS ` above the latest commit on * the keys this txn touches* , nothing
433+ more.
434+ - ` commitTS ` via ` resolveTxnCommitTS ` → ` nextTxnTSAfter `
435+ (` kv/sharded_coordinator.go:1102 ` , ` :1376 ` ): ` c.clock.NextFenced() ` ,
436+ re-allocated after ` Observe(startTS) ` if it did not strictly exceed
437+ ` startTS ` . A caller-supplied ` commitTS ` is fed through ` c.clock.Observe `
438+ to keep the clock monotonic.
439+
440+ Both calls go through the same ` c.clock ` . The apply-time OCC/ownership check
441+ then compares these timestamps against stored ` CommitTS ` values (the
442+ Composed-1 guard, ` docs/design/2026_05_29_partial_composed1_cross_group_commit_guard.md `
443+ §4.2(a)/§4.4; FSM ` latest > startTS ` write-conflict check). OCC
444+ serializability depends on those ` commitTS ` values being ** mutually
445+ comparable** across all participating groups.
446+
447+ ** Why the per-group fix is necessary but not sufficient.** Today this is
448+ safe only because every group shares the * same* process-wide ` *HLC `
449+ (single-node groups, §1.1/§1.5): one clock issues every timestamp, so all
450+ ` commitTS ` are trivially comparable. The per-group renewal fix (step 1)
451+ extends correctness to * one node leading several groups* — but its guarantee
452+ is explicitly scoped: "all timestamps issued by a single node are strictly
453+ monotonic … monotonicity across nodes that lead * different* groups is not
454+ fully guaranteed without a shared TSO" (TSO doc §6 "Guarantee", §1.1). The
455+ moment step 2 (multi-node bootstrap) makes it possible for the two groups of
456+ a cross-group txn to be coordinated on * different nodes* , two concurrent
457+ cross-group txns can draw ` commitTS ` from two different ` *HLC ` instances
458+ whose ceilings were advanced independently. The ` maxLatestCommitTS ` /` Observe `
459+ floor does not close this: it orders writes only on the specific keys read,
460+ not the global commit order two unrelated cross-group txns need to be
461+ serializable against each other.
462+
463+ ** Conclusion / trigger.** OQ-1 therefore is * not* a "focused correctness
464+ review deferrable until just before step 2 lands" — the per-group fix cannot
465+ answer it in the affirmative for the cross-node case. The trigger to pull the
466+ dedicated TSO group forward from step 11 is concrete: ** the first cross-group
467+ transaction whose participating groups are led by different nodes** (i.e. the
468+ first real exercise of step 2). Until then the per-group fix + shared-` *HLC `
469+ property holds. Open sub-question: whether an interim measure short of the
470+ full TSO group — e.g. pinning every cross-group txn's timestamp allocation to
471+ a single designated group's leader (the default group's ` c.clock ` ), so all
472+ cross-group ` commitTS ` still come from one clock — can bridge the gap between
473+ step 2 and step 11 without the full batch-allocator TSO. That interim option
474+ should be evaluated as part of the step-2 design (PR #955 ) rather than left
475+ to step 11.
4074762 . ** Shared cache (Gap 2) vs per-group isolation (workload isolation doc).** A
408477 single shared block cache trades isolation for density: one hot group can
409478 evict a latency-sensitive group's working set. How does Gap 2's per-group
@@ -426,3 +495,19 @@ In-flight PRs map cleanly: **#953** is steps 3 (and its PR0 = step 2's intent),
4264956 . ** Auto group lifecycle (Gap 7) trigger.** What signal creates a new group —
427496 node join, aggregate range count crossing a threshold, or operator action?
428497 Premature auto-creation interacts badly with merge (create/merge thrash).
498+ 7 . ** Live cutover / rolling-upgrade strategy for the single-node→multi-node
499+ transition (Gap 1).** Moving a deployment from "one process hosts every
500+ group, each single-voter" to genuine multi-node multi-group is a topology
501+ change, not just a flag flip: a group that bootstrapped single-member must
502+ gain voters on other nodes via live ` AddVoter ` conf-changes (the primitive
503+ exists, §2(e)), and the cluster may run mixed binary versions mid-upgrade.
504+ What is the supported path — operator-driven ` AddVoter ` expansion of an
505+ existing single-voter group, blue/green with a dual-write proxy
506+ (` proxy/ ` , the existing Redis-migration pattern), or a fresh cluster +
507+ data migration? And what version-skew / capability-gating discipline keeps
508+ a rolling upgrade safe while only some nodes understand the new per-group
509+ members wiring? This is properly the companion bootstrap proposal's (PR #955 )
510+ to answer; the roadmap flags it here so it is not lost. (Note: the TSO doc
511+ §7 already specifies a phased dual-write/shadow-read/feature-flag cutover for
512+ the * timestamp* migration; the bootstrap cutover should mirror that
513+ structure.)
0 commit comments