Skip to content

Commit f5d2ad7

Browse files
committed
docs(design): M5 — multi-table workload + post-review revisions (PR #905)
Codex P1 + 3 gemini medium findings on the original PR #905 revision (ffb9c73). All addressed by revising sections 3.2, 3.3, and 4 (milestone breakdown) and adding OQ-5 / OQ-6: * codex P1 — "Don't rely on item-key splits to shard DynamoDB txns." Verified against kv/shard_key.go:94-124: every DynamoDB table-metadata, item, and GSI key normalises to a SINGLE per- table route key (!ddb|route|table|<tableSegment>). Splitting inside a single-table workload's item keyspace cannot put two items on different shards, so the 2PC path (dispatchMultiShardTxn, secondary commits, ErrTxnSecondaryRouteShiftedAfterPrimaryCommit) would never fire — invalidating G2. Fix: replace single-key-split (Option A) with a NEW workload variant dynamodb-append-multi-table-workload that creates N=4 tables (jepsen_append_t1 … jepsen_append_t4) and writes to >=2 distinct tables per TransactWriteItems. The router maps each table to its own route key, so cross-table txns naturally fan out across shards. The setup hook splits the table-route keyspace at !ddb|route|table|jepsen_append_t2. * gemini medium R1 — "Lexicographical Shard Split Issue." The prior /split/<int> split-key prefix was lexicographically smaller than the workload's keyspace ("/" < "0" in ASCII), so every workload key ended up on the rightmost shard and G2 was never exercised. Fix: anchor split keys to the table-route prefix !ddb|route|table|... so the split lands INSIDE the active workload route range. * gemini medium R2 — "Route ID Resolution for SplitRange." Successful SplitRange deletes the parent route ID and creates two child IDs, so a cached ID from a one-time setup-time ListRoutes call is stale on the next shuffle. Fix: nemesis re-queries ListRoutes on every :start invocation, walks the snapshot to find the route covering the chosen split key, and uses that route's ID + snapshot.version as expected_catalog_version. Catalog drift surfaces as ErrCatalogVersionMismatch from the server and the nemesis refreshes on the next tick. * gemini medium R3 — "Gating of Initial Split in Setup Hook." Jepsen db/setup! runs on EVERY node; an ungated initial split would be attempted concurrently by all nodes. Fix: gate the setup-time split on (when (= node (first (:nodes test))) ...) so only the first node attempts it. Also: * Updated §4 milestone table: M5a now ships the new workload variant (not just a setup hook), so it is meaningfully bigger than the original §4 row suggested. * Added OQ-5 (is N=4 the right default?) and OQ-6 (first-node gate semantics) as follow-ups for implementation time. * Resolved OQ-4: PR #900 has merged, so the parent doc rename *_proposed_*.md → *_partial_*.md should now land as a separate small doc-only PR.
1 parent ffb9c73 commit f5d2ad7

1 file changed

Lines changed: 102 additions & 48 deletions

File tree

docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md

Lines changed: 102 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,12 @@ function returning a `jepsen.nemesis/Nemesis` instance:
136136
(reify nemesis/Nemesis
137137
(setup! [this test] ...)
138138
(invoke! [this test op] ...
139-
;; shell out to elastickv-split with a fresh split key
139+
;; 1. call ListRoutes to find the route currently covering
140+
;; the chosen split key — route IDs change after every
141+
;; split, so a cached ID from setup is stale
142+
;; 2. pick a split key inside that route's range
143+
;; 3. shell out to elastickv-split with route-id +
144+
;; split-key + expected-version from ListRoutes
140145
)
141146
(teardown! [this test] ...)))
142147
```
@@ -146,39 +151,76 @@ The nemesis is composed with the existing
146151
via `jepsen.nemesis/compose`. The combined nemesis becomes the
147152
workload's `:nemesis`.
148153

149-
**Split key picking strategy.** A simple monotonically-increasing
150-
counter: every `:start` invocation appends a fresh integer
151-
suffix to a fixed key prefix the workload reserves. This avoids
152-
collisions with the workload's keyspace and guarantees the
153-
split always picks a key that's between existing keys (so the
154-
operation succeeds against a real catalog).
155-
156-
**Expected version.** The nemesis calls `ListRoutes` once at
157-
setup to learn the current catalog version, then increments its
158-
local copy by 1 after each successful split. Catalog drift
159-
(another split landing concurrently) is rare in practice — if it
160-
happens, the nemesis logs and refreshes from `ListRoutes`.
161-
162-
### 3.3 Multi-shard workload guarantee
163-
164-
The existing `dynamodb-append-workload` writes to a per-key
165-
queue. With a single shard layout, every write goes to that
166-
shard — no 2PC, no Composed-1 exposure.
167-
168-
M5 needs the workload to consistently span shards. Two options:
169-
170-
| Option | Mechanism | Pro | Con |
171-
|---|---|---|---|
172-
| **A** Force initial split | The test setup issues one `SplitRange` before the workload starts | Workload runs on 2+ shards from t=0 | Adds a setup step; needs a known split key |
173-
| **B** Multi-key txns | Modify each `:append` op to write to ≥2 keys with deterministic routing across shards | Workload exercises 2PC even on a 1-shard layout | Changes the workload's operation shape (harder to compare against historical runs) |
174-
175-
**Choose A.** Less invasive to the workload, and the
176-
route-shuffle nemesis itself increases the shard count over
177-
time, giving organic multi-shard coverage.
178-
179-
The setup hook (`db/setup!` in Jepsen parlance, or the test's
180-
`:setup` map) runs `elastickv-split` once with a split key in
181-
the middle of the workload's keyspace.
154+
**Split key picking strategy (gemini medium R1).** Pick a split
155+
key from inside the DynamoDB **table-route** key space
156+
(`!ddb|route|table|<tableSegment>` — see `kv/shard_key.go:94-124`).
157+
Concretely, with N tables `jepsen_append_t1`
158+
`jepsen_append_tN` per §3.3, the route key for table `tK` is
159+
`!ddb|route|table|jepsen_append_tK`. Splits happen between
160+
adjacent table-route keys — e.g. between `…jepsen_append_t2`
161+
and `…jepsen_append_t3`. This guarantees:
162+
163+
- The split key falls **inside** the active workload route
164+
range (not lexicographically before or after, which would
165+
leave all workload keys on one side of the split).
166+
- Each side of the split owns a distinct set of tables, so
167+
cross-table `TransactWriteItems` actually exercises 2PC.
168+
169+
A prior revision of this doc proposed a `/split/<int>` prefix.
170+
That was lexicographically smaller than the workload's keyspace
171+
(`/` < `0` in ASCII), so every workload key ended up on the
172+
rightmost shard and the 2PC path was never exercised. Fixed
173+
above by anchoring split keys to the table-route prefix.
174+
175+
**Route ID resolution (gemini medium R2).** The nemesis CANNOT
176+
rely on a single `ListRoutes` call + a local counter — every
177+
successful split deletes the parent route ID and creates two
178+
fresh child IDs, so a cached route ID is stale on the next
179+
shuffle. On every `:start` invocation the nemesis re-queries
180+
`ListRoutes`, walks the returned snapshot to find the route
181+
whose range contains the chosen split key, and uses that
182+
route's ID + the snapshot's `version` as the
183+
`SplitRangeRequest`'s `expected_catalog_version`. Catalog
184+
drift (another split landing concurrently between
185+
`ListRoutes` and `SplitRange`) surfaces as
186+
`ErrCatalogVersionMismatch` from the server; the nemesis logs
187+
and refreshes on the next tick.
188+
189+
### 3.3 Multi-shard workload guarantee (revised post-codex P1)
190+
191+
**Original §3.3 (Option A: single-key split in workload keyspace)
192+
was wrong.** `kv/shard_key.go:94-124` normalises every DynamoDB
193+
table-metadata, item, and GSI key to a single per-table route
194+
key (`!ddb|route|table|<tableSegment>`). So every
195+
`jepsen_append` item resolves to the SAME catalog point
196+
regardless of its partition-key value, and a `SplitRange`
197+
inside the item keyspace cannot put two items on different
198+
shards. The 2PC path (`dispatchMultiShardTxn`, secondary
199+
commits, the new `ErrTxnSecondaryRouteShiftedAfterPrimaryCommit`
200+
sentinel) would never fire — invalidating G2 (codex P1 on
201+
PR #905).
202+
203+
**Revised strategy: multi-table workload.** The M5 workload
204+
creates `N` tables (default `N = 4`): `jepsen_append_t1`
205+
`jepsen_append_t4`. Each `TransactWriteItems` operation writes
206+
to **at least two** distinct tables. The router maps each
207+
table to its own table-route key, so a cross-table txn
208+
naturally fans out across whichever shards own those route
209+
keys. The setup hook splits the table-route keyspace at
210+
`!ddb|route|table|jepsen_append_t2` so tables 1 lives on one
211+
shard and tables 2–4 on another from t=0.
212+
213+
| Concern | Resolution |
214+
|---|---|
215+
| Workload shape change | Append ops still write a single value per row; the change is the table they write to (one per row, ≥2 rows per txn — picked from a per-txn random subset of `t1…tN`). |
216+
| Elle compatibility | The append checker keys on `(table, partition-key)` pairs already (the workload's history shape supports this); cross-table txns appear as multi-key ops, which Elle handles natively. |
217+
| Comparison with historical runs | Historical runs used a single table — the M5 workload is a NEW workload variant `dynamodb-append-multi-table-workload` rather than a modification of `dynamodb-append-workload`. Both ship; the existing one stays for trend comparison. |
218+
219+
The setup hook (Jepsen `db/setup!`) is gated to run only on
220+
the FIRST node (`(when (= node (first (:nodes test))) …)`) so
221+
the initial split is not attempted concurrently by every
222+
cluster node and does not cause catalog-version conflicts
223+
during bootstrap (gemini medium R3).
182224

183225
### 3.4 Success criterion
184226

@@ -221,8 +263,8 @@ mergeable on its own.
221263

222264
| Phase | Title | Scope | Done when |
223265
|---|---|---|---|
224-
| M5a | CLI + workload setup | `cmd/elastickv-split` binary; `dynamodb-append-workload`'s `:setup` issues the initial split; no nemesis yet. | `./scripts/run-jepsen-local.sh` runs unchanged but the cluster starts with 2 shards. Workload finds zero G1c (trivially, no shuffle). |
225-
| M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into `dynamodb-append-workload`'s nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). | A `./scripts/run-jepsen-local.sh` run with `--composed1-route-shuffle` produces zero G1c after ≥10 shuffles during a 5-minute run. |
266+
| M5a | CLI + multi-table workload | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; setup hook (gated to first node) issues the initial split between table-route keys. | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables split across 2 shards; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. |
267+
| M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into the multi-table workload's nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). Nemesis re-queries `ListRoutes` before every split and picks split keys from inside the table-route keyspace. | A `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table --composed1-route-shuffle` run produces zero G1c after ≥10 shuffles during a 5-minute run. |
226268

227269
M5a is a small, focused PR (Go CLI + Clojure setup hook +
228270
docs). M5b carries the nemesis itself plus the cadence-tuning
@@ -238,23 +280,35 @@ analysis.
238280
answer: no, the partition nemesis is enough; adding a
239281
prewrite-interrupt would test `abortPreparedTxn`, which is
240282
out of M5's scope.
241-
- **OQ-2.** Do we ship M5a + M5b in a single PR or two? Two is
242-
cleaner but doubles the review burden. Tentative answer: two
243-
if M5a's CLI work runs ≥150 lines (likely); one if M5a fits in
244-
a single screen. Decide at implementation time.
283+
- **OQ-2.** Do we ship M5a + M5b in a single PR or two? Two
284+
is cleaner but doubles the review burden. With the §3.3
285+
revision M5a is now meaningfully bigger (a new workload
286+
variant, not just a setup hook), so two-PR is now the more
287+
likely shape. Decide at implementation time.
245288
- **OQ-3.** Where does the new `cmd/elastickv-split` slot in
246289
the README and the `make` targets? Likely add it to
247290
`make tools`, mirror in `docs/operations/` (does this dir
248291
exist? — check at implementation). Out of scope for the
249292
design doc itself.
250-
- **OQ-4.** Should the M5 design doc rename happen with PR #900
251-
merge (since M1–M4 ship)? Yes per CLAUDE.md's lifecycle
252-
guidance: rename `*_proposed_*.md``*_partial_*.md` after
253-
PR #900 lands, then this M5 doc tracks the open milestone.
254-
When M5 ships, rename the parent to `*_implemented_*.md` and
255-
this M5 doc to `*_implemented_*.md` as well (or fold the M5
256-
content back into the parent — tentative answer: keep them
257-
separate so the M5 design history isn't lost).
293+
- **OQ-4** (resolved post-PR #900 merge). The parent doc
294+
rename `*_proposed_*.md``*_partial_*.md` should land as a
295+
separate small doc-only PR now that PR #900 is merged. When
296+
M5 ships, rename both this doc and the parent to
297+
`*_implemented_*.md` (tentative — keep both files separate
298+
so the M5 design history isn't lost).
299+
- **OQ-5** (new, codex P1 follow-up). Is `N = 4` tables the
300+
right default? Trade-offs: more tables = better 2PC
301+
fan-out coverage but slower setup and noisier history. The
302+
workload's existing `:concurrency` defaults to 5, so 4
303+
tables means each client touches ~all of them per txn on
304+
average. Defer to implementation; revisit if the workload
305+
becomes I/O-bound on table-meta lookups.
306+
- **OQ-6** (new, gemini medium R3 follow-up). The first-node
307+
gate for setup splits assumes Jepsen's `(first (:nodes test))`
308+
is stable across nodes; verify this matches actual Jepsen
309+
semantics (it should — `:nodes` is the test config, not a
310+
per-node view). Out of scope to design more carefully; will
311+
test at M5a implementation.
258312

259313
---
260314

0 commit comments

Comments
 (0)