You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Codex P1 + 3 gemini medium findings on the original PR #905
revision (ffb9c73). All addressed by revising sections 3.2,
3.3, and 4 (milestone breakdown) and adding OQ-5 / OQ-6:
* codex P1 — "Don't rely on item-key splits to shard DynamoDB
txns." Verified against kv/shard_key.go:94-124: every DynamoDB
table-metadata, item, and GSI key normalises to a SINGLE per-
table route key (!ddb|route|table|<tableSegment>). Splitting
inside a single-table workload's item keyspace cannot put two
items on different shards, so the 2PC path
(dispatchMultiShardTxn, secondary commits,
ErrTxnSecondaryRouteShiftedAfterPrimaryCommit) would never
fire — invalidating G2.
Fix: replace single-key-split (Option A) with a NEW workload
variant dynamodb-append-multi-table-workload that creates N=4
tables (jepsen_append_t1 … jepsen_append_t4) and writes to >=2
distinct tables per TransactWriteItems. The router maps each
table to its own route key, so cross-table txns naturally fan
out across shards. The setup hook splits the table-route
keyspace at !ddb|route|table|jepsen_append_t2.
* gemini medium R1 — "Lexicographical Shard Split Issue." The
prior /split/<int> split-key prefix was lexicographically
smaller than the workload's keyspace ("/" < "0" in ASCII),
so every workload key ended up on the rightmost shard and
G2 was never exercised.
Fix: anchor split keys to the table-route prefix
!ddb|route|table|... so the split lands INSIDE the active
workload route range.
* gemini medium R2 — "Route ID Resolution for SplitRange."
Successful SplitRange deletes the parent route ID and
creates two child IDs, so a cached ID from a one-time
setup-time ListRoutes call is stale on the next shuffle.
Fix: nemesis re-queries ListRoutes on every :start
invocation, walks the snapshot to find the route covering
the chosen split key, and uses that route's ID +
snapshot.version as expected_catalog_version. Catalog drift
surfaces as ErrCatalogVersionMismatch from the server and the
nemesis refreshes on the next tick.
* gemini medium R3 — "Gating of Initial Split in Setup Hook."
Jepsen db/setup! runs on EVERY node; an ungated initial
split would be attempted concurrently by all nodes.
Fix: gate the setup-time split on
(when (= node (first (:nodes test))) ...) so only the first
node attempts it.
Also:
* Updated §4 milestone table: M5a now ships the new workload
variant (not just a setup hook), so it is meaningfully
bigger than the original §4 row suggested.
* Added OQ-5 (is N=4 the right default?) and OQ-6 (first-node
gate semantics) as follow-ups for implementation time.
* Resolved OQ-4: PR #900 has merged, so the parent doc rename
*_proposed_*.md → *_partial_*.md should now land as a
separate small doc-only PR.
@@ -136,7 +136,12 @@ function returning a `jepsen.nemesis/Nemesis` instance:
136
136
(reify nemesis/Nemesis
137
137
(setup! [this test] ...)
138
138
(invoke! [this test op] ...
139
-
;; shell out to elastickv-split with a fresh split key
139
+
;; 1. call ListRoutes to find the route currently covering
140
+
;; the chosen split key — route IDs change after every
141
+
;; split, so a cached ID from setup is stale
142
+
;; 2. pick a split key inside that route's range
143
+
;; 3. shell out to elastickv-split with route-id +
144
+
;; split-key + expected-version from ListRoutes
140
145
)
141
146
(teardown! [this test] ...)))
142
147
```
@@ -146,39 +151,76 @@ The nemesis is composed with the existing
146
151
via `jepsen.nemesis/compose`. The combined nemesis becomes the
147
152
workload's `:nemesis`.
148
153
149
-
**Split key picking strategy.** A simple monotonically-increasing
150
-
counter: every `:start` invocation appends a fresh integer
151
-
suffix to a fixed key prefix the workload reserves. This avoids
152
-
collisions with the workload's keyspace and guarantees the
153
-
split always picks a key that's between existing keys (so the
154
-
operation succeeds against a real catalog).
155
-
156
-
**Expected version.** The nemesis calls `ListRoutes` once at
157
-
setup to learn the current catalog version, then increments its
158
-
local copy by 1 after each successful split. Catalog drift
159
-
(another split landing concurrently) is rare in practice — if it
160
-
happens, the nemesis logs and refreshes from `ListRoutes`.
161
-
162
-
### 3.3 Multi-shard workload guarantee
163
-
164
-
The existing `dynamodb-append-workload` writes to a per-key
165
-
queue. With a single shard layout, every write goes to that
166
-
shard — no 2PC, no Composed-1 exposure.
167
-
168
-
M5 needs the workload to consistently span shards. Two options:
169
-
170
-
| Option | Mechanism | Pro | Con |
171
-
|---|---|---|---|
172
-
|**A** Force initial split | The test setup issues one `SplitRange` before the workload starts | Workload runs on 2+ shards from t=0 | Adds a setup step; needs a known split key |
173
-
|**B** Multi-key txns | Modify each `:append` op to write to ≥2 keys with deterministic routing across shards | Workload exercises 2PC even on a 1-shard layout | Changes the workload's operation shape (harder to compare against historical runs) |
174
-
175
-
**Choose A.** Less invasive to the workload, and the
176
-
route-shuffle nemesis itself increases the shard count over
177
-
time, giving organic multi-shard coverage.
178
-
179
-
The setup hook (`db/setup!` in Jepsen parlance, or the test's
180
-
`:setup` map) runs `elastickv-split` once with a split key in
181
-
the middle of the workload's keyspace.
154
+
**Split key picking strategy (gemini medium R1).** Pick a split
155
+
key from inside the DynamoDB **table-route** key space
156
+
(`!ddb|route|table|<tableSegment>` — see `kv/shard_key.go:94-124`).
157
+
Concretely, with N tables `jepsen_append_t1` …
158
+
`jepsen_append_tN` per §3.3, the route key for table `tK` is
159
+
`!ddb|route|table|jepsen_append_tK`. Splits happen between
160
+
adjacent table-route keys — e.g. between `…jepsen_append_t2`
161
+
and `…jepsen_append_t3`. This guarantees:
162
+
163
+
- The split key falls **inside** the active workload route
164
+
range (not lexicographically before or after, which would
165
+
leave all workload keys on one side of the split).
166
+
- Each side of the split owns a distinct set of tables, so
`jepsen_append_t4`. Each `TransactWriteItems` operation writes
206
+
to **at least two** distinct tables. The router maps each
207
+
table to its own table-route key, so a cross-table txn
208
+
naturally fans out across whichever shards own those route
209
+
keys. The setup hook splits the table-route keyspace at
210
+
`!ddb|route|table|jepsen_append_t2` so tables 1 lives on one
211
+
shard and tables 2–4 on another from t=0.
212
+
213
+
| Concern | Resolution |
214
+
|---|---|
215
+
| Workload shape change | Append ops still write a single value per row; the change is the table they write to (one per row, ≥2 rows per txn — picked from a per-txn random subset of `t1…tN`). |
216
+
| Elle compatibility | The append checker keys on `(table, partition-key)` pairs already (the workload's history shape supports this); cross-table txns appear as multi-key ops, which Elle handles natively. |
217
+
| Comparison with historical runs | Historical runs used a single table — the M5 workload is a NEW workload variant `dynamodb-append-multi-table-workload` rather than a modification of `dynamodb-append-workload`. Both ship; the existing one stays for trend comparison. |
218
+
219
+
The setup hook (Jepsen `db/setup!`) is gated to run only on
220
+
the FIRST node (`(when (= node (first (:nodes test))) …)`) so
221
+
the initial split is not attempted concurrently by every
222
+
cluster node and does not cause catalog-version conflicts
223
+
during bootstrap (gemini medium R3).
182
224
183
225
### 3.4 Success criterion
184
226
@@ -221,8 +263,8 @@ mergeable on its own.
221
263
222
264
| Phase | Title | Scope | Done when |
223
265
|---|---|---|---|
224
-
| M5a | CLI + workload setup |`cmd/elastickv-split` binary; `dynamodb-append-workload`'s `:setup`issues the initial split; no nemesis yet. |`./scripts/run-jepsen-local.sh` runs unchanged but the cluster starts with 2 shards. Workload finds zero G1c (trivially, no shuffle). |
225
-
| M5b | Route-shuffle nemesis |`jepsen/src/elastickv/composed1_nemesis.clj`; compose into `dynamodb-append-workload`'s nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). | A `./scripts/run-jepsen-local.sh` run with `--composed1-route-shuffle` produces zero G1c after ≥10 shuffles during a 5-minute run. |
266
+
| M5a | CLI + multi-table workload |`cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; setup hook (gated to first node) issues the initial split between table-route keys. |`./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables split across 2 shards; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. |
267
+
| M5b | Route-shuffle nemesis |`jepsen/src/elastickv/composed1_nemesis.clj`; compose into the multi-table workload's nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). Nemesis re-queries `ListRoutes` before every split and picks split keys from inside the table-route keyspace. | A `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table --composed1-route-shuffle` run produces zero G1c after ≥10 shuffles during a 5-minute run. |
226
268
227
269
M5a is a small, focused PR (Go CLI + Clojure setup hook +
228
270
docs). M5b carries the nemesis itself plus the cadence-tuning
@@ -238,23 +280,35 @@ analysis.
238
280
answer: no, the partition nemesis is enough; adding a
239
281
prewrite-interrupt would test `abortPreparedTxn`, which is
240
282
out of M5's scope.
241
-
-**OQ-2.** Do we ship M5a + M5b in a single PR or two? Two is
242
-
cleaner but doubles the review burden. Tentative answer: two
243
-
if M5a's CLI work runs ≥150 lines (likely); one if M5a fits in
244
-
a single screen. Decide at implementation time.
283
+
-**OQ-2.** Do we ship M5a + M5b in a single PR or two? Two
284
+
is cleaner but doubles the review burden. With the §3.3
285
+
revision M5a is now meaningfully bigger (a new workload
286
+
variant, not just a setup hook), so two-PR is now the more
287
+
likely shape. Decide at implementation time.
245
288
-**OQ-3.** Where does the new `cmd/elastickv-split` slot in
246
289
the README and the `make` targets? Likely add it to
247
290
`make tools`, mirror in `docs/operations/` (does this dir
248
291
exist? — check at implementation). Out of scope for the
249
292
design doc itself.
250
-
-**OQ-4.** Should the M5 design doc rename happen with PR #900
251
-
merge (since M1–M4 ship)? Yes per CLAUDE.md's lifecycle
252
-
guidance: rename `*_proposed_*.md` → `*_partial_*.md` after
253
-
PR #900 lands, then this M5 doc tracks the open milestone.
254
-
When M5 ships, rename the parent to `*_implemented_*.md` and
255
-
this M5 doc to `*_implemented_*.md` as well (or fold the M5
256
-
content back into the parent — tentative answer: keep them
257
-
separate so the M5 design history isn't lost).
293
+
-**OQ-4** (resolved post-PR #900 merge). The parent doc
294
+
rename `*_proposed_*.md` → `*_partial_*.md` should land as a
295
+
separate small doc-only PR now that PR #900 is merged. When
296
+
M5 ships, rename both this doc and the parent to
297
+
`*_implemented_*.md` (tentative — keep both files separate
298
+
so the M5 design history isn't lost).
299
+
-**OQ-5** (new, codex P1 follow-up). Is `N = 4` tables the
300
+
right default? Trade-offs: more tables = better 2PC
301
+
fan-out coverage but slower setup and noisier history. The
302
+
workload's existing `:concurrency` defaults to 5, so 4
303
+
tables means each client touches ~all of them per txn on
304
+
average. Defer to implementation; revisit if the workload
305
+
becomes I/O-bound on table-meta lookups.
306
+
-**OQ-6** (new, gemini medium R3 follow-up). The first-node
307
+
gate for setup splits assumes Jepsen's `(first (:nodes test))`
308
+
is stable across nodes; verify this matches actual Jepsen
309
+
semantics (it should — `:nodes` is the test config, not a
310
+
per-node view). Out of scope to design more carefully; will
0 commit comments