Skip to content
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
338 changes: 338 additions & 0 deletions docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,338 @@
# Composed-1 M5 — Jepsen route-shuffle workload

Status: Proposed
Author: bootjp
Date: 2026-06-02
Parent design:
[`2026_05_29_proposed_composed1_cross_group_commit_guard.md`](2026_05_29_proposed_composed1_cross_group_commit_guard.md)

> **Forward-looking proposal, same posture as the parent doc.**
> Today's `SplitRange` is same-group only (per CLAUDE.md and
> `adapter/distribution_server.go`'s implementation), so the
> Composed-1 hazard the M3/M4 guard catches cannot yet be
> *triggered* in production. M5 ships the integration-test
> scaffolding — workload shape, nemesis, success criterion — so
> that:
>
> 1. The current `SplitRange` is exercised under realistic
> concurrent multi-shard write load and proved non-regressing
> (the workload finds **no** G1c, which is the baseline M4
> contract).
> 2. When a future PR introduces a route-mutating RPC that DOES
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated
> shift ownership across groups (cross-group `SplitRange`,
> `MoveRange`, online rebalancer), the M5 workload — with a
> one-line nemesis change to call the new RPC — becomes the
> integration-level proof that M3+M4 hold under cross-group
> churn.

---

## 1. Goals and non-goals

### 1.1 Goals

- **G1.** Add a `route-shuffle` nemesis to the DynamoDB Jepsen
suite that issues `SplitRange` against the cluster at a
configurable cadence concurrently with the existing DynamoDB
workload's `TransactWriteItems` traffic.
- **G2.** Force the workload to issue **multi-shard** transactions
with high probability so the 2PC path (`dispatchMultiShardTxn`,
`commitSecondaryTxns`, the new
`ErrTxnSecondaryRouteShiftedAfterPrimaryCommit` sentinel) is
exercised.
- **G3.** Verify the workload's Elle checker reports **zero G1c**
cycles after the run. This is M4's "Done when" criterion
promoted from a unit test to an integration test.
- **G4.** Land a tiny CLI helper (`cmd/elastickv-split`, or a
subcommand in `elastickv-admin`) that issues a single
`SplitRange` RPC by route ID + split key + expected version.
The Jepsen nemesis shells out to it rather than re-implementing
the gRPC client in Clojure.

### 1.2 Non-goals

- **NG1.** A cross-group `SplitRange` or `MoveRange` RPC. The
parent doc explicitly defers this; M5 must not depend on it.
- **NG2.** Reproducing a real Composed-1 anomaly. With same-group
`SplitRange` only, no such anomaly is reachable; the workload's
job is to prove the gate is non-regressing today and to be
ready for tomorrow.
- **NG3.** New Jepsen workload primitives (new operation types,
new generators outside the route-shuffle nemesis). The
existing `dynamodb-append-workload` is the right surface.
- **NG4.** Changing the DynamoDB adapter or Composed-1 code on
the server side. M5 is purely test-harness work.

---

## 2. Why this matters now

PR #900 lands M1+M2+M3+M4 on `feat/composed1-m4-retry`. The unit
tests cover each milestone in isolation. Two gaps remain that
only an integration test can close:

1. **End-to-end ordering.** The M3 gate runs inside the FSM
apply path; the M4 retry runs inside `ShardedCoordinator`;
the new `ErrTxnSecondaryRouteShiftedAfterPrimaryCommit`
sentinel surfaces from `commitSecondaryTxns`. Each is unit-
tested in isolation. None of the tests run the full
prewrite → primary-commit → secondary-commit chain on real
Raft groups under concurrent client load, which is where
subtle apply-ordering bugs hide.
2. **Workload realism.** PR #900's nine review rounds each
surfaced an auto-pin overreach (read-write, caller-StartTS,
2PC secondary, resolver-claimed). The review process is
thorough, but a Jepsen run is the empirical check: if the
workload's Elle checker finds G1c against the M4 build, the
review missed something.

If we ship M4 to `main` without M5, every later change to
routing, OCC, or the FSM apply path lacks the integration-level
sentinel that would catch a regression. M5 closes that gap.

---

## 3. Design

### 3.1 SplitRange invocation CLI (`cmd/elastickv-split`)

A standalone Go binary, ~80 lines:

```
elastickv-split \
--address 127.0.0.1:50051 \
--route-id 100 \
--split-key /q1/00001 \
--expected-version 7
```
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated

Reads the four flags, dials the leader, issues
`proto.Distribution/SplitRange`, prints the new catalog version
and the two child route IDs on success. Non-zero exit on any
error so the Jepsen nemesis sees the failure.

The CLI lives in `cmd/elastickv-split/main.go`. No tests beyond
a smoke test (`main_test.go`) that runs `elastickv-split --help`
and asserts non-zero exit on missing flags. The real coverage
is the Jepsen run itself.

**Alternative considered:** add a `split` subcommand to
`cmd/elastickv-admin/`. Rejected because `elastickv-admin` is
the HTTP fanout admin and conflating it with a gRPC control-
plane invocation would muddy its scope. A standalone tool is
clearer.

### 3.2 Route-shuffle nemesis (`jepsen/src/elastickv/composed1_nemesis.clj`)

A new Clojure file with a single `route-shuffle-nemesis`
function returning a `jepsen.nemesis/Nemesis` instance:

```
(defn route-shuffle-nemesis
"Periodically invokes elastickv-split against the cluster.
:start -> shuffle one route (pick a non-edge split key)
:stop -> no-op (splits are durable, no rollback)"
[opts]
(reify nemesis/Nemesis
(setup! [this test] ...)
(invoke! [this test op] ...
;; 1. call ListRoutes to find the route currently covering
;; the chosen split key — route IDs change after every
;; split, so a cached ID from setup is stale
;; 2. pick a split key inside that route's range
;; 3. shell out to elastickv-split with route-id +
;; split-key + expected-version from ListRoutes
)
(teardown! [this test] ...)))
```

The nemesis is composed with the existing
`jepsen.nemesis.combined/nemesis-package` (partitions + kills)
via `jepsen.nemesis/compose`. The combined nemesis becomes the
workload's `:nemesis`.

**Split key picking strategy (gemini medium R1).** Pick a split
key from inside the DynamoDB **table-route** key space
(`!ddb|route|table|<tableSegment>` — see `kv/shard_key.go:94-124`).
Concretely, with N tables `jepsen_append_t1` …
`jepsen_append_tN` per §3.3, the route key for table `tK` is
`!ddb|route|table|jepsen_append_tK`. Splits happen between
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Encode table segments before choosing split keys

The proposed split key uses the raw table name, but DynamoDB route keys are built from encodeDynamoSegment(tableName) (base64 RawURLEncoding in adapter/dynamodb.go) and kv.dynamoRouteTableKey appends that encoded segment. For tables like jepsen_append_t1, the real route keys start with !ddb|route|table|am..., so splitting at !ddb|route|table|jepsen_append_tK sorts after all of the workload's table-route keys and leaves every table on the same side; the route-shuffle workload would still miss the intended 2PC path.

Useful? React with 👍 / 👎.

adjacent table-route keys — e.g. between `…jepsen_append_t2`
and `…jepsen_append_t3`. This guarantees:

- The split key falls **inside** the active workload route
range (not lexicographically before or after, which would
leave all workload keys on one side of the split).
- Each side of the split owns a distinct set of tables, so
cross-table `TransactWriteItems` actually exercises 2PC.

A prior revision of this doc proposed a `/split/<int>` prefix.
That was lexicographically smaller than the workload's keyspace
(`/` < `0` in ASCII), so every workload key ended up on the
rightmost shard and the 2PC path was never exercised. Fixed
above by anchoring split keys to the table-route prefix.

**Route ID resolution (gemini medium R2).** The nemesis CANNOT
rely on a single `ListRoutes` call + a local counter — every
successful split deletes the parent route ID and creates two
fresh child IDs, so a cached route ID is stale on the next
shuffle. On every `:start` invocation the nemesis re-queries
`ListRoutes`, walks the returned snapshot to find the route
whose range contains the chosen split key, and uses that
route's ID + the snapshot's `version` as the
`SplitRangeRequest`'s `expected_catalog_version`. Catalog
drift (another split landing concurrently between
`ListRoutes` and `SplitRange`) surfaces as
`ErrCatalogVersionMismatch` from the server; the nemesis logs
and refreshes on the next tick.

### 3.3 Multi-shard workload guarantee (revised post-codex P1)

**Original §3.3 (Option A: single-key split in workload keyspace)
was wrong.** `kv/shard_key.go:94-124` normalises every DynamoDB
table-metadata, item, and GSI key to a single per-table route
key (`!ddb|route|table|<tableSegment>`). So every
`jepsen_append` item resolves to the SAME catalog point
regardless of its partition-key value, and a `SplitRange`
inside the item keyspace cannot put two items on different
shards. The 2PC path (`dispatchMultiShardTxn`, secondary
commits, the new `ErrTxnSecondaryRouteShiftedAfterPrimaryCommit`
sentinel) would never fire — invalidating G2 (codex P1 on
PR #905).

**Revised strategy: multi-table workload.** The M5 workload
creates `N` tables (default `N = 4`): `jepsen_append_t1` …
`jepsen_append_t4`. Each `TransactWriteItems` operation writes
to **at least two** distinct tables. The router maps each
table to its own table-route key, so a cross-table txn
naturally fans out across whichever shards own those route
keys. The setup hook splits the table-route keyspace at
`!ddb|route|table|jepsen_append_t2` so tables 1 lives on one
shard and tables 2–4 on another from t=0.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Seed cross-group routes before claiming 2PC coverage

Fresh evidence is that the revised setup hook now relies on SplitRange at the table-route boundary, but the current SplitRange only creates child routes with the parent's GroupID, while dispatchTxn enters dispatchMultiShardTxn only when mutations/read keys group to more than one Raft group. With just this initial split, tables 1 and 2–4 become separate ranges in the same group and the workload can satisfy the documented setup while still taking the single-shard transaction path; M5a needs an initial --shardRanges/routing plan that puts the table route keys on distinct groups (or a future cross-group mover) before using it as 2PC evidence.

Useful? React with 👍 / 👎.


| Concern | Resolution |
|---|---|
| Workload shape change | Append ops still write a single value per row; the change is the table they write to (one per row, ≥2 rows per txn — picked from a per-txn random subset of `t1…tN`). |
| Elle compatibility | The append checker keys on `(table, partition-key)` pairs already (the workload's history shape supports this); cross-table txns appear as multi-key ops, which Elle handles natively. |
| Comparison with historical runs | Historical runs used a single table — the M5 workload is a NEW workload variant `dynamodb-append-multi-table-workload` rather than a modification of `dynamodb-append-workload`. Both ship; the existing one stays for trend comparison. |

The setup hook (Jepsen `db/setup!`) is gated to run only on
the FIRST node (`(when (= node (first (:nodes test))) …)`) so
the initial split is not attempted concurrently by every
cluster node and does not cause catalog-version conflicts
during bootstrap (gemini medium R3).

### 3.4 Success criterion

The workload's existing Elle checker emits a `:valid?` boolean
and a list of detected cycles (`:G0`, `:G1a`, `:G1b`, `:G1c`,
`:G-single`, etc.). M5's pass condition:

```
(and (:valid? results)
(zero? (count (filter #(= (:type %) :G1c) (:anomalies results))))
(zero? (count (filter #(= (:type %) :G-single) (:anomalies results)))))
```

`G1c` is the parent doc's explicit safety violation; `G-single`
is the closely-related single-item anomaly we already chase in
the existing workload. Other anomaly types (G0, G1a, G1b)
indicate orthogonal bugs and should also fail the run, but the
parent doc's M5 row names G1c as the headline criterion.

### 3.5 Cadence

Default `:route-shuffle-interval` = `30s`. Configurable via the
test CLI. Rationale: the workload's typical txn rate is ~10
ops/sec across 5 concurrent clients (= 50 ops/sec), so a
30s shuffle puts ~1500 txns between shuffles — enough to
plausibly catch a mid-2PC race, but rare enough that the run
doesn't degenerate into "every txn races a split."

The route-shuffle nemesis composes with the existing
partition+kill nemesis. The combined nemesis fires at the
shortest of its members' intervals (Jepsen default
behaviour); kills/partitions remain at their existing 40s.

---

## 4. Milestone breakdown

Two phases. The phases land in this order; the first is
mergeable on its own.

| Phase | Title | Scope | Done when |
|---|---|---|---|
| M5a | CLI + multi-table workload | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; setup hook (gated to first node) issues the initial split between table-route keys. | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables split across 2 shards; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. |
| M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into the multi-table workload's nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). Nemesis re-queries `ListRoutes` before every split and picks split keys from inside the table-route keyspace. | A `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table --composed1-route-shuffle` run produces zero G1c after ≥10 shuffles during a 5-minute run. |

M5a is a small, focused PR (Go CLI + Clojure setup hook +
docs). M5b carries the nemesis itself plus the cadence-tuning
analysis.

---

## 5. Open questions

- **OQ-1.** Should the nemesis also issue an `Abort`-shaped
fault that interrupts an in-flight 2PC mid-prewrite? The
existing partition nemesis effectively does this. Tentative
answer: no, the partition nemesis is enough; adding a
prewrite-interrupt would test `abortPreparedTxn`, which is
out of M5's scope.
- **OQ-2.** Do we ship M5a + M5b in a single PR or two? Two
is cleaner but doubles the review burden. With the §3.3
revision M5a is now meaningfully bigger (a new workload
variant, not just a setup hook), so two-PR is now the more
likely shape. Decide at implementation time.
- **OQ-3.** Where does the new `cmd/elastickv-split` slot in
the README and the `make` targets? Likely add it to
`make tools`, mirror in `docs/operations/` (does this dir
exist? — check at implementation). Out of scope for the
design doc itself.
- **OQ-4** (resolved post-PR #900 merge). The parent doc
rename `*_proposed_*.md` → `*_partial_*.md` should land as a
separate small doc-only PR now that PR #900 is merged. When
M5 ships, rename both this doc and the parent to
`*_implemented_*.md` (tentative — keep both files separate
so the M5 design history isn't lost).
- **OQ-5** (new, codex P1 follow-up). Is `N = 4` tables the
right default? Trade-offs: more tables = better 2PC
fan-out coverage but slower setup and noisier history. The
workload's existing `:concurrency` defaults to 5, so 4
tables means each client touches ~all of them per txn on
average. Defer to implementation; revisit if the workload
becomes I/O-bound on table-meta lookups.
- **OQ-6** (new, gemini medium R3 follow-up). The first-node
gate for setup splits assumes Jepsen's `(first (:nodes test))`
is stable across nodes; verify this matches actual Jepsen
semantics (it should — `:nodes` is the test config, not a
per-node view). Out of scope to design more carefully; will
test at M5a implementation.

---

## 6. Self-review summary

Five-pass per CLAUDE.md:

1. **Data loss.** No new write paths; the CLI invokes the
existing `SplitRange` RPC which already has full unit + e2e
coverage. Nemesis-driven calls of an existing RPC can't lose
committed writes (worst case: a split fails and the test
fails, no data effect).
2. **Concurrency / distributed failures.** The nemesis runs
under Jepsen's existing concurrency harness alongside
partitions + kills. Combined behaviour is the *point* of the
test — if anything breaks, the workload finds it. No new
server-side concurrency code is being introduced.
3. **Performance.** Nemesis fires every 30s; CLI invocation is a
single short-lived gRPC call. No measurable impact on hot
paths.
4. **Data consistency.** This IS the data-consistency check
(G1c = serializability violation). The success criterion is
the property we want.
5. **Test coverage.** M5 ships the integration test; the
smoke test on the CLI is the only unit-level coverage,
correctly. The CLI's logic is thin enough that a smoke test
plus the Jepsen run constitute adequate coverage.
Loading