Skip to content

Commit 0903148

Browse files
committed
docs(design): address review — scaling roadmap
Issue 1: correct stale line anchors in kv/sharded_coordinator.go - keyviz sampler observeMutation: 1795-1824 -> 1841-1846 - RunHLCLeaseRenewal: 1914-1953 -> 1960-1985; defaultGroup access :1915 -> :1961 Issue 2: soften companion-doc references to in-flight PR #955 form, matching the #945/#951/#953 branch-reference style (3 sites + map line). Issue 3: ground OQ-1 in the actual commitTS logic (nextStartTS/ resolveTxnCommitTS/nextTxnTSAfter all from one c.clock) and annotate §4 step 11 as deferred-pending-OQ-1 with an explicit trigger condition. Inline: fix learner LinearizableRead behaviour (engine returns ErrNotLeader, caller forwards; engine.go:1583); note shared-startTS invariant; add OQ-7 for the single-node->multi-node live cutover / rolling-upgrade strategy.
1 parent a1abe35 commit 0903148

1 file changed

Lines changed: 111 additions & 26 deletions

File tree

docs/design/2026_06_12_proposed_scaling_roadmap.md

Lines changed: 111 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -65,8 +65,8 @@ What bounds a single elastickv deployment today:
6565

6666
5. **Cross-group timestamps via per-group HLC, and not even per-group yet.**
6767
`ShardedCoordinator.RunHLCLeaseRenewal` proposes the physical-ceiling
68-
renewal **only to the default group** (`kv/sharded_coordinator.go:1914-1953`,
69-
`group, ok := c.groups[c.defaultGroup]` at `:1915`). A node that leads a
68+
renewal **only to the default group** (`kv/sharded_coordinator.go:1960-1985`,
69+
`group, ok := c.groups[c.defaultGroup]` at `:1961`). A node that leads a
7070
non-default group but is not a member of the default group never advances
7171
its ceiling from that group — exactly the cross-group monotonicity gap the
7272
centralized-TSO doc §1.1 describes. The doc's "near-term fix" (M1: iterate
@@ -158,12 +158,13 @@ memory each group's private cache/memtable pins.
158158
`multiraft_runtime.go:246-254` mean every group in a multi-group process is
159159
single-voter, so there is nothing to transfer a leader *to*. Closing this is
160160
a prerequisite for write-throughput scaling beyond one node, and a companion
161-
proposal is being written in parallel:
162-
`2026_06_12_proposed_multinode_multigroup_bootstrap.md` (extend
163-
`--raftGroups` / a per-group members flag to accept a multi-node voter set
164-
per group, lifting the `len(groups)==1` guard). This roadmap references it
165-
by name and treats it as the unblocking dependency for (b), (e), and the
166-
region-balance gap in §3.
161+
proposal is in review as **PR #955**
162+
(`docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md`, branch
163+
`design/multinode-multigroup-bootstrap`): extend `--raftGroups` / a per-group
164+
members flag to accept a multi-node voter set per group, lifting the
165+
`len(groups)==1` guard. Once PR #955 lands it is the authoritative spec for
166+
this gap; this roadmap treats it as the unblocking dependency for (b), (e),
167+
and the region-balance gap in §3.
167168

168169
### (c) Read throughput
169170

@@ -181,9 +182,14 @@ memory each group's private cache/memtable pins.
181182
has `add_learner` / `promote_learner` (`cmd/raftadmin/main.go:204-206`), and
182183
`--raftJoinAsLearner` exists (`main.go:104`). The learner doc itself scopes
183184
follower-served reads as an explicit non-goal (§2 "Non-goals", §8 OQ-5):
184-
learners forward `LinearizableRead` to the leader today. So the read-replica
185-
attach primitive exists, but the design that consumes it for off-leader
186-
reads **does not****follower-read design is missing**. Requirements to
185+
a `LinearizableRead` against a learner is **not** forwarded by the engine —
186+
`handleRead` returns `ErrNotLeader` for any non-leader state
187+
(`internal/raftengine/etcd/engine.go:1583`), and the *caller* must forward to
188+
the leader (`docs/raft_learner_operations.md`, "Serve linearizable reads from
189+
the learner"). So the read-replica attach primitive exists, but the design
190+
that consumes it for off-leader reads **does not****follower-read design
191+
is missing**, and it must supply the forwarding/proxy or replica-read API the
192+
current primitive deliberately omits. Requirements to
187193
sketch (Gap 3): a leader-issued read-timestamp pipeline so a follower/learner
188194
serves a snapshot read at a ts the leader has vouched for (CLAUDE.md HLC
189195
rule: never use the local wall clock or a follower-issued ts for MVCC
@@ -210,7 +216,12 @@ memory each group's private cache/memtable pins.
210216
leading different groups is still not guaranteed without a shared oracle
211217
(TSO doc §6 "Guarantee"). Cross-group transactions (`kv/transaction.go`,
212218
`kv/txn_codec.go`) that span groups led by different nodes need a single
213-
ordering source for OCC commit-ts comparability. So: per-group renewal fix
219+
ordering source for OCC commit-ts comparability. (The shared-snapshot
220+
invariant — every operation in one txn reading at the *same* `startTS` — is
221+
already upheld: `nextStartTS` allocates one `startTS` for the whole txn and
222+
propagates it via `reqs.StartTS` to every participating group; the gap is the
223+
cross-*node* comparability of the per-txn `commitTS`, not the per-txn
224+
`startTS`. See OQ-1.) So: per-group renewal fix
214225
is in-scope-soon and load-bearing; the dedicated TSO group is only justified
215226
once (i) multi-node multi-group is real and (ii) cross-group transactions
216227
are common enough that the batch-allocator amortization pays for the extra
@@ -241,7 +252,7 @@ memory each group's private cache/memtable pins.
241252
### (f) Operational scaling
242253

243254
- **keyviz** — the per-route load sampler is wired and allocation-free on the
244-
hot path (`kv/sharded_coordinator.go:1795-1824`), with proposed extensions
255+
hot path (`observeMutation`, `kv/sharded_coordinator.go:1841-1846`), with proposed extensions
245256
for cluster fan-out (`2026_04_27_proposed_keyviz_cluster_fanout.md`),
246257
subrange sampling (`2026_05_25_proposed_keyviz_subrange_sampling.md`),
247258
hot-key top-K (`2026_05_28_proposed_keyviz_hot_key_topk.md`), and per-cell
@@ -281,10 +292,10 @@ dimension. **Rough milestones:** (M1) extend `--raftGroups` / add a per-group
281292
members flag to accept a multi-node voter set; lift the guard. (M2)
282293
integration harness that stands up multi-voter groups across processes. (M3)
283294
Jepsen multi-group multi-node workload. **Depends-on:** nothing (it is the
284-
root unblocker). *A companion proposal,
285-
`2026_06_12_proposed_multinode_multigroup_bootstrap.md`, is being written in
286-
parallel and is the authoritative spec for this gap; this roadmap defers to
287-
it.*
295+
root unblocker). *A companion proposal is in review as **PR #955**
296+
(`docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md`, branch
297+
`design/multinode-multigroup-bootstrap`); once it lands it is the authoritative
298+
spec for this gap and this roadmap defers to it.*
288299

289300
### Gap 2 — Shared Pebble cache / resource pools
290301
**Problem.** Each group's store allocates a private 256 MiB block cache
@@ -361,7 +372,7 @@ The ordering is driven by unblock-edges, not by perceived value in isolation.
361372
change; closes the cross-group monotonicity gap (§1.5) *before* the
362373
topology that exposes it exists. Land first so multi-node groups are safe
363374
on arrival.
364-
2. **Multi-node multi-group bootstrap** (Gap 1 /
375+
2. **Multi-node multi-group bootstrap** (Gap 1 / **PR #955**,
365376
`2026_06_12_proposed_multinode_multigroup_bootstrap.md`). The root
366377
unblocker for (b), (c), (e), Gap 3, Gap 4. Nothing else multi-node-shaped
367378
can land until a group can have voters on more than one node.
@@ -386,13 +397,22 @@ The ordering is driven by unblock-edges, not by perceived value in isolation.
386397
9. **Range merge** (Gap 5). After step 4 for cross-group merge.
387398
10. **Streaming transport** (Gap 6). Any time after step 2 makes inter-node
388399
Raft traffic significant; pairs with the S3 blob-fetch RPC.
389-
11. **Dedicated TSO group** (TSO doc M6–M7) — only once cross-group
390-
transactions across node-spanning groups are common enough to justify it
391-
(§2(d)).
400+
11. **Dedicated TSO group** (TSO doc M6–M7) — placed last *on the assumption
401+
that the per-group renewal fix (step 1) is sufficient until cross-group
402+
transactions across node-spanning groups are common enough to amortize the
403+
extra Raft group (§2(d))*. **This placement must be revisited the moment
404+
step 2 lands** — see OQ-1. The decision is not an amortization tradeoff but
405+
a correctness one: the trigger to pull this forward (potentially to
406+
step 3) is **the first cross-group transaction whose participating groups
407+
are led by different nodes**. The per-group fix gives per-node monotonicity
408+
only (TSO doc §6 "Guarantee"); commit-ts comparability across coordinators
409+
on different nodes is not covered by it (OQ-1). Treat step 11 as
410+
"deferred-pending-OQ-1-resolution", not "settled-last".
392411
12. **Auto group lifecycle** (Gap 7) — long-term, after 2/4/8/9.
393412

394-
In-flight PRs map cleanly: **#953** is steps 3 (and its PR0 = step 2's intent),
395-
**#945** is step 4, **#951** is step 5.
413+
In-flight PRs map cleanly: **#955** is step 2 (Gap 1 bootstrap proposal),
414+
**#953** is step 3 (and its PR0 = step 2's intent), **#945** is step 4,
415+
**#951** is step 5.
396416

397417
---
398418

@@ -401,9 +421,58 @@ In-flight PRs map cleanly: **#953** is steps 3 (and its PR0 = step 2's intent),
401421
1. **Per-group HLC fix vs full TSO ordering.** Is the per-group renewal fix
402422
(step 1) sufficient for cross-group OCC correctness in the multi-node
403423
topology, or does the first node-spanning cross-group transaction force the
404-
dedicated TSO group earlier than step 11? The answer depends on how
405-
`kv/transaction.go` compares commit timestamps issued by different nodes;
406-
needs a focused correctness review before step 2 lands.
424+
dedicated TSO group earlier than step 11?
425+
426+
**Where the timestamps come from (grounded in code).** A cross-group txn's
427+
coordinator issues *both* of its timestamps from one `*HLC``c.clock` on
428+
the coordinator's node:
429+
- `startTS` via `nextStartTS` (`kv/sharded_coordinator.go:1429`):
430+
`Observe(maxLatestCommitTS(keys))` then `c.clock.NextFenced()`. The
431+
`maxLatestCommitTS` floor is a per-key read against the store — it pins
432+
`startTS` above the latest commit on *the keys this txn touches*, nothing
433+
more.
434+
- `commitTS` via `resolveTxnCommitTS``nextTxnTSAfter`
435+
(`kv/sharded_coordinator.go:1102`, `:1376`): `c.clock.NextFenced()`,
436+
re-allocated after `Observe(startTS)` if it did not strictly exceed
437+
`startTS`. A caller-supplied `commitTS` is fed through `c.clock.Observe`
438+
to keep the clock monotonic.
439+
440+
Both calls go through the same `c.clock`. The apply-time OCC/ownership check
441+
then compares these timestamps against stored `CommitTS` values (the
442+
Composed-1 guard, `docs/design/2026_05_29_partial_composed1_cross_group_commit_guard.md`
443+
§4.2(a)/§4.4; FSM `latest > startTS` write-conflict check). OCC
444+
serializability depends on those `commitTS` values being **mutually
445+
comparable** across all participating groups.
446+
447+
**Why the per-group fix is necessary but not sufficient.** Today this is
448+
safe only because every group shares the *same* process-wide `*HLC`
449+
(single-node groups, §1.1/§1.5): one clock issues every timestamp, so all
450+
`commitTS` are trivially comparable. The per-group renewal fix (step 1)
451+
extends correctness to *one node leading several groups* — but its guarantee
452+
is explicitly scoped: "all timestamps issued by a single node are strictly
453+
monotonic … monotonicity across nodes that lead *different* groups is not
454+
fully guaranteed without a shared TSO" (TSO doc §6 "Guarantee", §1.1). The
455+
moment step 2 (multi-node bootstrap) makes it possible for the two groups of
456+
a cross-group txn to be coordinated on *different nodes*, two concurrent
457+
cross-group txns can draw `commitTS` from two different `*HLC` instances
458+
whose ceilings were advanced independently. The `maxLatestCommitTS`/`Observe`
459+
floor does not close this: it orders writes only on the specific keys read,
460+
not the global commit order two unrelated cross-group txns need to be
461+
serializable against each other.
462+
463+
**Conclusion / trigger.** OQ-1 therefore is *not* a "focused correctness
464+
review deferrable until just before step 2 lands" — the per-group fix cannot
465+
answer it in the affirmative for the cross-node case. The trigger to pull the
466+
dedicated TSO group forward from step 11 is concrete: **the first cross-group
467+
transaction whose participating groups are led by different nodes** (i.e. the
468+
first real exercise of step 2). Until then the per-group fix + shared-`*HLC`
469+
property holds. Open sub-question: whether an interim measure short of the
470+
full TSO group — e.g. pinning every cross-group txn's timestamp allocation to
471+
a single designated group's leader (the default group's `c.clock`), so all
472+
cross-group `commitTS` still come from one clock — can bridge the gap between
473+
step 2 and step 11 without the full batch-allocator TSO. That interim option
474+
should be evaluated as part of the step-2 design (PR #955) rather than left
475+
to step 11.
407476
2. **Shared cache (Gap 2) vs per-group isolation (workload isolation doc).** A
408477
single shared block cache trades isolation for density: one hot group can
409478
evict a latency-sensitive group's working set. How does Gap 2's per-group
@@ -426,3 +495,19 @@ In-flight PRs map cleanly: **#953** is steps 3 (and its PR0 = step 2's intent),
426495
6. **Auto group lifecycle (Gap 7) trigger.** What signal creates a new group —
427496
node join, aggregate range count crossing a threshold, or operator action?
428497
Premature auto-creation interacts badly with merge (create/merge thrash).
498+
7. **Live cutover / rolling-upgrade strategy for the single-node→multi-node
499+
transition (Gap 1).** Moving a deployment from "one process hosts every
500+
group, each single-voter" to genuine multi-node multi-group is a topology
501+
change, not just a flag flip: a group that bootstrapped single-member must
502+
gain voters on other nodes via live `AddVoter` conf-changes (the primitive
503+
exists, §2(e)), and the cluster may run mixed binary versions mid-upgrade.
504+
What is the supported path — operator-driven `AddVoter` expansion of an
505+
existing single-voter group, blue/green with a dual-write proxy
506+
(`proxy/`, the existing Redis-migration pattern), or a fresh cluster +
507+
data migration? And what version-skew / capability-gating discipline keeps
508+
a rolling upgrade safe while only some nodes understand the new per-group
509+
members wiring? This is properly the companion bootstrap proposal's (PR #955)
510+
to answer; the roadmap flags it here so it is not lost. (Note: the TSO doc
511+
§7 already specifies a phased dual-write/shadow-read/feature-flag cutover for
512+
the *timestamp* migration; the bootstrap cutover should mirror that
513+
structure.)

0 commit comments

Comments
 (0)