docs(design): M5 — multi-table workload + post-review revisions (PR #905)

bootjp · bootjp · commit f5d2ad7ac7c6 · 2026-06-02T16:15:09.000+09:00
Codex P1 + 3 gemini medium findings on the original PR #905 revision (ffb9c73). All addressed by revising sections 3.2, 3.3, and 4 (milestone breakdown) and adding OQ-5 / OQ-6: * codex P1 — "Don't rely on item-key splits to shard DynamoDB txns." Verified against kv/shard_key.go:94-124: every DynamoDB table-metadata, item, and GSI key normalises to a SINGLE per- table route key (!ddb|route|table|<tableSegment>). Splitting inside a single-table workload's item keyspace cannot put two items on different shards, so the 2PC path (dispatchMultiShardTxn, secondary commits, ErrTxnSecondaryRouteShiftedAfterPrimaryCommit) would never fire — invalidating G2. Fix: replace single-key-split (Option A) with a NEW workload variant dynamodb-append-multi-table-workload that creates N=4 tables (jepsen_append_t1 … jepsen_append_t4) and writes to >=2 distinct tables per TransactWriteItems. The router maps each table to its own route key, so cross-table txns naturally fan out across shards. The setup hook splits the table-route keyspace at !ddb|route|table|jepsen_append_t2. * gemini medium R1 — "Lexicographical Shard Split Issue." The prior /split/<int> split-key prefix was lexicographically smaller than the workload's keyspace ("/" < "0" in ASCII), so every workload key ended up on the rightmost shard and G2 was never exercised. Fix: anchor split keys to the table-route prefix !ddb|route|table|... so the split lands INSIDE the active workload route range. * gemini medium R2 — "Route ID Resolution for SplitRange." Successful SplitRange deletes the parent route ID and creates two child IDs, so a cached ID from a one-time setup-time ListRoutes call is stale on the next shuffle. Fix: nemesis re-queries ListRoutes on every :start invocation, walks the snapshot to find the route covering the chosen split key, and uses that route's ID + snapshot.version as expected_catalog_version. Catalog drift surfaces as ErrCatalogVersionMismatch from the server and the nemesis refreshes on the next tick. * gemini medium R3 — "Gating of Initial Split in Setup Hook." Jepsen db/setup! runs on EVERY node; an ungated initial split would be attempted concurrently by all nodes. Fix: gate the setup-time split on (when (= node (first (:nodes test))) ...) so only the first node attempts it. Also: * Updated §4 milestone table: M5a now ships the new workload variant (not just a setup hook), so it is meaningfully bigger than the original §4 row suggested. * Added OQ-5 (is N=4 the right default?) and OQ-6 (first-node gate semantics) as follow-ups for implementation time. * Resolved OQ-4: PR #900 has merged, so the parent doc rename *_proposed_*.md → *_partial_*.md should now land as a separate small doc-only PR.
diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
@@ -136,7 +136,12 @@ function returning a `jepsen.nemesis/Nemesis` instance:
   (reify nemesis/Nemesis
     (setup! [this test] ...)
     (invoke! [this test op] ...
-       ;; shell out to elastickv-split with a fresh split key
+       ;; 1. call ListRoutes to find the route currently covering
+       ;;    the chosen split key — route IDs change after every
+       ;;    split, so a cached ID from setup is stale
+       ;; 2. pick a split key inside that route's range
+       ;; 3. shell out to elastickv-split with route-id +
+       ;;    split-key + expected-version from ListRoutes
        )
     (teardown! [this test] ...)))
 ```
@@ -146,39 +151,76 @@ The nemesis is composed with the existing
 via `jepsen.nemesis/compose`. The combined nemesis becomes the
 workload's `:nemesis`.
 
-**Split key picking strategy.** A simple monotonically-increasing
-counter: every `:start` invocation appends a fresh integer
-suffix to a fixed key prefix the workload reserves. This avoids
-collisions with the workload's keyspace and guarantees the
-split always picks a key that's between existing keys (so the
-operation succeeds against a real catalog).
-
-**Expected version.** The nemesis calls `ListRoutes` once at
-setup to learn the current catalog version, then increments its
-local copy by 1 after each successful split. Catalog drift
-(another split landing concurrently) is rare in practice — if it
-happens, the nemesis logs and refreshes from `ListRoutes`.
-
-### 3.3 Multi-shard workload guarantee
-
-The existing `dynamodb-append-workload` writes to a per-key
-queue. With a single shard layout, every write goes to that
-shard — no 2PC, no Composed-1 exposure.
-
-M5 needs the workload to consistently span shards. Two options:
-
-| Option | Mechanism | Pro | Con |
-|---|---|---|---|
-| **A** Force initial split | The test setup issues one `SplitRange` before the workload starts | Workload runs on 2+ shards from t=0 | Adds a setup step; needs a known split key |
-| **B** Multi-key txns | Modify each `:append` op to write to ≥2 keys with deterministic routing across shards | Workload exercises 2PC even on a 1-shard layout | Changes the workload's operation shape (harder to compare against historical runs) |
-
-**Choose A.** Less invasive to the workload, and the
-route-shuffle nemesis itself increases the shard count over
-time, giving organic multi-shard coverage.
-
-The setup hook (`db/setup!` in Jepsen parlance, or the test's
-`:setup` map) runs `elastickv-split` once with a split key in
-the middle of the workload's keyspace.
+**Split key picking strategy (gemini medium R1).** Pick a split
+key from inside the DynamoDB **table-route** key space
+(`!ddb|route|table|<tableSegment>` — see `kv/shard_key.go:94-124`).
+Concretely, with N tables `jepsen_append_t1` …
+`jepsen_append_tN` per §3.3, the route key for table `tK` is
+`!ddb|route|table|jepsen_append_tK`. Splits happen between
+adjacent table-route keys — e.g. between `…jepsen_append_t2`
+and `…jepsen_append_t3`. This guarantees:
+
+- The split key falls **inside** the active workload route
+  range (not lexicographically before or after, which would
+  leave all workload keys on one side of the split).
+- Each side of the split owns a distinct set of tables, so
+  cross-table `TransactWriteItems` actually exercises 2PC.
+
+A prior revision of this doc proposed a `/split/<int>` prefix.
+That was lexicographically smaller than the workload's keyspace
+(`/` < `0` in ASCII), so every workload key ended up on the
+rightmost shard and the 2PC path was never exercised. Fixed
+above by anchoring split keys to the table-route prefix.
+
+**Route ID resolution (gemini medium R2).** The nemesis CANNOT
+rely on a single `ListRoutes` call + a local counter — every
+successful split deletes the parent route ID and creates two
+fresh child IDs, so a cached route ID is stale on the next
+shuffle. On every `:start` invocation the nemesis re-queries
+`ListRoutes`, walks the returned snapshot to find the route
+whose range contains the chosen split key, and uses that
+route's ID + the snapshot's `version` as the
+`SplitRangeRequest`'s `expected_catalog_version`. Catalog
+drift (another split landing concurrently between
+`ListRoutes` and `SplitRange`) surfaces as
+`ErrCatalogVersionMismatch` from the server; the nemesis logs
+and refreshes on the next tick.
+
+### 3.3 Multi-shard workload guarantee (revised post-codex P1)
+
+**Original §3.3 (Option A: single-key split in workload keyspace)
+was wrong.** `kv/shard_key.go:94-124` normalises every DynamoDB
+table-metadata, item, and GSI key to a single per-table route
+key (`!ddb|route|table|<tableSegment>`). So every
+`jepsen_append` item resolves to the SAME catalog point
+regardless of its partition-key value, and a `SplitRange`
+inside the item keyspace cannot put two items on different
+shards. The 2PC path (`dispatchMultiShardTxn`, secondary
+commits, the new `ErrTxnSecondaryRouteShiftedAfterPrimaryCommit`
+sentinel) would never fire — invalidating G2 (codex P1 on
+PR #905).
+
+**Revised strategy: multi-table workload.** The M5 workload
+creates `N` tables (default `N = 4`): `jepsen_append_t1` …
+`jepsen_append_t4`. Each `TransactWriteItems` operation writes
+to **at least two** distinct tables. The router maps each
+table to its own table-route key, so a cross-table txn
+naturally fans out across whichever shards own those route
+keys. The setup hook splits the table-route keyspace at
+`!ddb|route|table|jepsen_append_t2` so tables 1 lives on one
+shard and tables 2–4 on another from t=0.
+
+| Concern | Resolution |
+|---|---|
+| Workload shape change | Append ops still write a single value per row; the change is the table they write to (one per row, ≥2 rows per txn — picked from a per-txn random subset of `t1…tN`). |
+| Elle compatibility | The append checker keys on `(table, partition-key)` pairs already (the workload's history shape supports this); cross-table txns appear as multi-key ops, which Elle handles natively. |
+| Comparison with historical runs | Historical runs used a single table — the M5 workload is a NEW workload variant `dynamodb-append-multi-table-workload` rather than a modification of `dynamodb-append-workload`. Both ship; the existing one stays for trend comparison. |
+
+The setup hook (Jepsen `db/setup!`) is gated to run only on
+the FIRST node (`(when (= node (first (:nodes test))) …)`) so
+the initial split is not attempted concurrently by every
+cluster node and does not cause catalog-version conflicts
+during bootstrap (gemini medium R3).
 
 ### 3.4 Success criterion
 
@@ -221,8 +263,8 @@ mergeable on its own.
 
 | Phase | Title | Scope | Done when |
 |---|---|---|---|
-| M5a | CLI + workload setup | `cmd/elastickv-split` binary; `dynamodb-append-workload`'s `:setup` issues the initial split; no nemesis yet. | `./scripts/run-jepsen-local.sh` runs unchanged but the cluster starts with 2 shards. Workload finds zero G1c (trivially, no shuffle). |
-| M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into `dynamodb-append-workload`'s nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). | A `./scripts/run-jepsen-local.sh` run with `--composed1-route-shuffle` produces zero G1c after ≥10 shuffles during a 5-minute run. |
+| M5a | CLI + multi-table workload | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; setup hook (gated to first node) issues the initial split between table-route keys. | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables split across 2 shards; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. |
+| M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into the multi-table workload's nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). Nemesis re-queries `ListRoutes` before every split and picks split keys from inside the table-route keyspace. | A `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table --composed1-route-shuffle` run produces zero G1c after ≥10 shuffles during a 5-minute run. |
 
 M5a is a small, focused PR (Go CLI + Clojure setup hook +
 docs). M5b carries the nemesis itself plus the cadence-tuning
@@ -238,23 +280,35 @@ analysis.
   answer: no, the partition nemesis is enough; adding a
   prewrite-interrupt would test `abortPreparedTxn`, which is
   out of M5's scope.
-- **OQ-2.** Do we ship M5a + M5b in a single PR or two? Two is
-  cleaner but doubles the review burden. Tentative answer: two
-  if M5a's CLI work runs ≥150 lines (likely); one if M5a fits in
-  a single screen. Decide at implementation time.
+- **OQ-2.** Do we ship M5a + M5b in a single PR or two? Two
+  is cleaner but doubles the review burden. With the §3.3
+  revision M5a is now meaningfully bigger (a new workload
+  variant, not just a setup hook), so two-PR is now the more
+  likely shape. Decide at implementation time.
 - **OQ-3.** Where does the new `cmd/elastickv-split` slot in
   the README and the `make` targets? Likely add it to
   `make tools`, mirror in `docs/operations/` (does this dir
   exist? — check at implementation). Out of scope for the
   design doc itself.
-- **OQ-4.** Should the M5 design doc rename happen with PR #900
-  merge (since M1–M4 ship)? Yes per CLAUDE.md's lifecycle
-  guidance: rename `*_proposed_*.md` → `*_partial_*.md` after
-  PR #900 lands, then this M5 doc tracks the open milestone.
-  When M5 ships, rename the parent to `*_implemented_*.md` and
-  this M5 doc to `*_implemented_*.md` as well (or fold the M5
-  content back into the parent — tentative answer: keep them
-  separate so the M5 design history isn't lost).
+- **OQ-4** (resolved post-PR #900 merge). The parent doc
+  rename `*_proposed_*.md` → `*_partial_*.md` should land as a
+  separate small doc-only PR now that PR #900 is merged. When
+  M5 ships, rename both this doc and the parent to
+  `*_implemented_*.md` (tentative — keep both files separate
+  so the M5 design history isn't lost).
+- **OQ-5** (new, codex P1 follow-up). Is `N = 4` tables the
+  right default? Trade-offs: more tables = better 2PC
+  fan-out coverage but slower setup and noisier history. The
+  workload's existing `:concurrency` defaults to 5, so 4
+  tables means each client touches ~all of them per txn on
+  average. Defer to implementation; revisit if the workload
+  becomes I/O-bound on table-meta lookups.
+- **OQ-6** (new, gemini medium R3 follow-up). The first-node
+  gate for setup splits assumes Jepsen's `(first (:nodes test))`
+  is stable across nodes; verify this matches actual Jepsen
+  semantics (it should — `:nodes` is the test config, not a
+  per-node view). Out of scope to design more carefully; will
+  test at M5a implementation.
 
 ---