-
Notifications
You must be signed in to change notification settings - Fork 2
docs(design): Composed-1 M5 — Jepsen route-shuffle workload proposal #905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
ffb9c73
f5d2ad7
2481286
673bee8
3ca2a7f
87dcfe3
b752a89
6f024aa
f92a029
58d6ac5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,338 @@ | ||
| # Composed-1 M5 — Jepsen route-shuffle workload | ||
|
|
||
| Status: Proposed | ||
| Author: bootjp | ||
| Date: 2026-06-02 | ||
| Parent design: | ||
| [`2026_05_29_proposed_composed1_cross_group_commit_guard.md`](2026_05_29_proposed_composed1_cross_group_commit_guard.md) | ||
|
|
||
| > **Forward-looking proposal, same posture as the parent doc.** | ||
| > Today's `SplitRange` is same-group only (per CLAUDE.md and | ||
| > `adapter/distribution_server.go`'s implementation), so the | ||
| > Composed-1 hazard the M3/M4 guard catches cannot yet be | ||
| > *triggered* in production. M5 ships the integration-test | ||
| > scaffolding — workload shape, nemesis, success criterion — so | ||
| > that: | ||
| > | ||
| > 1. The current `SplitRange` is exercised under realistic | ||
| > concurrent multi-shard write load and proved non-regressing | ||
| > (the workload finds **no** G1c, which is the baseline M4 | ||
| > contract). | ||
| > 2. When a future PR introduces a route-mutating RPC that DOES | ||
| > shift ownership across groups (cross-group `SplitRange`, | ||
| > `MoveRange`, online rebalancer), the M5 workload — with a | ||
| > one-line nemesis change to call the new RPC — becomes the | ||
| > integration-level proof that M3+M4 hold under cross-group | ||
| > churn. | ||
|
|
||
| --- | ||
|
|
||
| ## 1. Goals and non-goals | ||
|
|
||
| ### 1.1 Goals | ||
|
|
||
| - **G1.** Add a `route-shuffle` nemesis to the DynamoDB Jepsen | ||
| suite that issues `SplitRange` against the cluster at a | ||
| configurable cadence concurrently with the existing DynamoDB | ||
| workload's `TransactWriteItems` traffic. | ||
| - **G2.** Force the workload to issue **multi-shard** transactions | ||
| with high probability so the 2PC path (`dispatchMultiShardTxn`, | ||
| `commitSecondaryTxns`, the new | ||
| `ErrTxnSecondaryRouteShiftedAfterPrimaryCommit` sentinel) is | ||
| exercised. | ||
| - **G3.** Verify the workload's Elle checker reports **zero G1c** | ||
| cycles after the run. This is M4's "Done when" criterion | ||
| promoted from a unit test to an integration test. | ||
| - **G4.** Land a tiny CLI helper (`cmd/elastickv-split`, or a | ||
| subcommand in `elastickv-admin`) that issues a single | ||
| `SplitRange` RPC by route ID + split key + expected version. | ||
| The Jepsen nemesis shells out to it rather than re-implementing | ||
| the gRPC client in Clojure. | ||
|
|
||
| ### 1.2 Non-goals | ||
|
|
||
| - **NG1.** A cross-group `SplitRange` or `MoveRange` RPC. The | ||
| parent doc explicitly defers this; M5 must not depend on it. | ||
| - **NG2.** Reproducing a real Composed-1 anomaly. With same-group | ||
| `SplitRange` only, no such anomaly is reachable; the workload's | ||
| job is to prove the gate is non-regressing today and to be | ||
| ready for tomorrow. | ||
| - **NG3.** New Jepsen workload primitives (new operation types, | ||
| new generators outside the route-shuffle nemesis). The | ||
| existing `dynamodb-append-workload` is the right surface. | ||
| - **NG4.** Changing the DynamoDB adapter or Composed-1 code on | ||
| the server side. M5 is purely test-harness work. | ||
|
|
||
| --- | ||
|
|
||
| ## 2. Why this matters now | ||
|
|
||
| PR #900 lands M1+M2+M3+M4 on `feat/composed1-m4-retry`. The unit | ||
| tests cover each milestone in isolation. Two gaps remain that | ||
| only an integration test can close: | ||
|
|
||
| 1. **End-to-end ordering.** The M3 gate runs inside the FSM | ||
| apply path; the M4 retry runs inside `ShardedCoordinator`; | ||
| the new `ErrTxnSecondaryRouteShiftedAfterPrimaryCommit` | ||
| sentinel surfaces from `commitSecondaryTxns`. Each is unit- | ||
| tested in isolation. None of the tests run the full | ||
| prewrite → primary-commit → secondary-commit chain on real | ||
| Raft groups under concurrent client load, which is where | ||
| subtle apply-ordering bugs hide. | ||
| 2. **Workload realism.** PR #900's nine review rounds each | ||
| surfaced an auto-pin overreach (read-write, caller-StartTS, | ||
| 2PC secondary, resolver-claimed). The review process is | ||
| thorough, but a Jepsen run is the empirical check: if the | ||
| workload's Elle checker finds G1c against the M4 build, the | ||
| review missed something. | ||
|
|
||
| If we ship M4 to `main` without M5, every later change to | ||
| routing, OCC, or the FSM apply path lacks the integration-level | ||
| sentinel that would catch a regression. M5 closes that gap. | ||
|
|
||
| --- | ||
|
|
||
| ## 3. Design | ||
|
|
||
| ### 3.1 SplitRange invocation CLI (`cmd/elastickv-split`) | ||
|
|
||
| A standalone Go binary, ~80 lines: | ||
|
|
||
| ``` | ||
| elastickv-split \ | ||
| --address 127.0.0.1:50051 \ | ||
| --route-id 100 \ | ||
| --split-key /q1/00001 \ | ||
| --expected-version 7 | ||
| ``` | ||
|
coderabbitai[bot] marked this conversation as resolved.
Outdated
|
||
|
|
||
| Reads the four flags, dials the leader, issues | ||
| `proto.Distribution/SplitRange`, prints the new catalog version | ||
| and the two child route IDs on success. Non-zero exit on any | ||
| error so the Jepsen nemesis sees the failure. | ||
|
|
||
| The CLI lives in `cmd/elastickv-split/main.go`. No tests beyond | ||
| a smoke test (`main_test.go`) that runs `elastickv-split --help` | ||
| and asserts non-zero exit on missing flags. The real coverage | ||
| is the Jepsen run itself. | ||
|
|
||
| **Alternative considered:** add a `split` subcommand to | ||
| `cmd/elastickv-admin/`. Rejected because `elastickv-admin` is | ||
| the HTTP fanout admin and conflating it with a gRPC control- | ||
| plane invocation would muddy its scope. A standalone tool is | ||
| clearer. | ||
|
|
||
| ### 3.2 Route-shuffle nemesis (`jepsen/src/elastickv/composed1_nemesis.clj`) | ||
|
|
||
| A new Clojure file with a single `route-shuffle-nemesis` | ||
| function returning a `jepsen.nemesis/Nemesis` instance: | ||
|
|
||
| ``` | ||
| (defn route-shuffle-nemesis | ||
| "Periodically invokes elastickv-split against the cluster. | ||
| :start -> shuffle one route (pick a non-edge split key) | ||
| :stop -> no-op (splits are durable, no rollback)" | ||
| [opts] | ||
| (reify nemesis/Nemesis | ||
| (setup! [this test] ...) | ||
| (invoke! [this test op] ... | ||
| ;; 1. call ListRoutes to find the route currently covering | ||
| ;; the chosen split key — route IDs change after every | ||
| ;; split, so a cached ID from setup is stale | ||
| ;; 2. pick a split key inside that route's range | ||
| ;; 3. shell out to elastickv-split with route-id + | ||
| ;; split-key + expected-version from ListRoutes | ||
| ) | ||
| (teardown! [this test] ...))) | ||
| ``` | ||
|
|
||
| The nemesis is composed with the existing | ||
| `jepsen.nemesis.combined/nemesis-package` (partitions + kills) | ||
| via `jepsen.nemesis/compose`. The combined nemesis becomes the | ||
| workload's `:nemesis`. | ||
|
|
||
| **Split key picking strategy (gemini medium R1).** Pick a split | ||
| key from inside the DynamoDB **table-route** key space | ||
| (`!ddb|route|table|<tableSegment>` — see `kv/shard_key.go:94-124`). | ||
| Concretely, with N tables `jepsen_append_t1` … | ||
| `jepsen_append_tN` per §3.3, the route key for table `tK` is | ||
| `!ddb|route|table|jepsen_append_tK`. Splits happen between | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The proposed split key uses the raw table name, but DynamoDB route keys are built from Useful? React with 👍 / 👎. |
||
| adjacent table-route keys — e.g. between `…jepsen_append_t2` | ||
| and `…jepsen_append_t3`. This guarantees: | ||
|
|
||
| - The split key falls **inside** the active workload route | ||
| range (not lexicographically before or after, which would | ||
| leave all workload keys on one side of the split). | ||
| - Each side of the split owns a distinct set of tables, so | ||
| cross-table `TransactWriteItems` actually exercises 2PC. | ||
|
|
||
| A prior revision of this doc proposed a `/split/<int>` prefix. | ||
| That was lexicographically smaller than the workload's keyspace | ||
| (`/` < `0` in ASCII), so every workload key ended up on the | ||
| rightmost shard and the 2PC path was never exercised. Fixed | ||
| above by anchoring split keys to the table-route prefix. | ||
|
|
||
| **Route ID resolution (gemini medium R2).** The nemesis CANNOT | ||
| rely on a single `ListRoutes` call + a local counter — every | ||
| successful split deletes the parent route ID and creates two | ||
| fresh child IDs, so a cached route ID is stale on the next | ||
| shuffle. On every `:start` invocation the nemesis re-queries | ||
| `ListRoutes`, walks the returned snapshot to find the route | ||
| whose range contains the chosen split key, and uses that | ||
| route's ID + the snapshot's `version` as the | ||
| `SplitRangeRequest`'s `expected_catalog_version`. Catalog | ||
| drift (another split landing concurrently between | ||
| `ListRoutes` and `SplitRange`) surfaces as | ||
| `ErrCatalogVersionMismatch` from the server; the nemesis logs | ||
| and refreshes on the next tick. | ||
|
|
||
| ### 3.3 Multi-shard workload guarantee (revised post-codex P1) | ||
|
|
||
| **Original §3.3 (Option A: single-key split in workload keyspace) | ||
| was wrong.** `kv/shard_key.go:94-124` normalises every DynamoDB | ||
| table-metadata, item, and GSI key to a single per-table route | ||
| key (`!ddb|route|table|<tableSegment>`). So every | ||
| `jepsen_append` item resolves to the SAME catalog point | ||
| regardless of its partition-key value, and a `SplitRange` | ||
| inside the item keyspace cannot put two items on different | ||
| shards. The 2PC path (`dispatchMultiShardTxn`, secondary | ||
| commits, the new `ErrTxnSecondaryRouteShiftedAfterPrimaryCommit` | ||
| sentinel) would never fire — invalidating G2 (codex P1 on | ||
| PR #905). | ||
|
|
||
| **Revised strategy: multi-table workload.** The M5 workload | ||
| creates `N` tables (default `N = 4`): `jepsen_append_t1` … | ||
| `jepsen_append_t4`. Each `TransactWriteItems` operation writes | ||
| to **at least two** distinct tables. The router maps each | ||
| table to its own table-route key, so a cross-table txn | ||
| naturally fans out across whichever shards own those route | ||
| keys. The setup hook splits the table-route keyspace at | ||
| `!ddb|route|table|jepsen_append_t2` so tables 1 lives on one | ||
| shard and tables 2–4 on another from t=0. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Fresh evidence is that the revised setup hook now relies on Useful? React with 👍 / 👎. |
||
|
|
||
| | Concern | Resolution | | ||
| |---|---| | ||
| | Workload shape change | Append ops still write a single value per row; the change is the table they write to (one per row, ≥2 rows per txn — picked from a per-txn random subset of `t1…tN`). | | ||
| | Elle compatibility | The append checker keys on `(table, partition-key)` pairs already (the workload's history shape supports this); cross-table txns appear as multi-key ops, which Elle handles natively. | | ||
| | Comparison with historical runs | Historical runs used a single table — the M5 workload is a NEW workload variant `dynamodb-append-multi-table-workload` rather than a modification of `dynamodb-append-workload`. Both ship; the existing one stays for trend comparison. | | ||
|
|
||
| The setup hook (Jepsen `db/setup!`) is gated to run only on | ||
| the FIRST node (`(when (= node (first (:nodes test))) …)`) so | ||
| the initial split is not attempted concurrently by every | ||
| cluster node and does not cause catalog-version conflicts | ||
| during bootstrap (gemini medium R3). | ||
|
|
||
| ### 3.4 Success criterion | ||
|
|
||
| The workload's existing Elle checker emits a `:valid?` boolean | ||
| and a list of detected cycles (`:G0`, `:G1a`, `:G1b`, `:G1c`, | ||
| `:G-single`, etc.). M5's pass condition: | ||
|
|
||
| ``` | ||
| (and (:valid? results) | ||
| (zero? (count (filter #(= (:type %) :G1c) (:anomalies results)))) | ||
| (zero? (count (filter #(= (:type %) :G-single) (:anomalies results))))) | ||
| ``` | ||
|
|
||
| `G1c` is the parent doc's explicit safety violation; `G-single` | ||
| is the closely-related single-item anomaly we already chase in | ||
| the existing workload. Other anomaly types (G0, G1a, G1b) | ||
| indicate orthogonal bugs and should also fail the run, but the | ||
| parent doc's M5 row names G1c as the headline criterion. | ||
|
|
||
| ### 3.5 Cadence | ||
|
|
||
| Default `:route-shuffle-interval` = `30s`. Configurable via the | ||
| test CLI. Rationale: the workload's typical txn rate is ~10 | ||
| ops/sec across 5 concurrent clients (= 50 ops/sec), so a | ||
| 30s shuffle puts ~1500 txns between shuffles — enough to | ||
| plausibly catch a mid-2PC race, but rare enough that the run | ||
| doesn't degenerate into "every txn races a split." | ||
|
|
||
| The route-shuffle nemesis composes with the existing | ||
| partition+kill nemesis. The combined nemesis fires at the | ||
| shortest of its members' intervals (Jepsen default | ||
| behaviour); kills/partitions remain at their existing 40s. | ||
|
|
||
| --- | ||
|
|
||
| ## 4. Milestone breakdown | ||
|
|
||
| Two phases. The phases land in this order; the first is | ||
| mergeable on its own. | ||
|
|
||
| | Phase | Title | Scope | Done when | | ||
| |---|---|---|---| | ||
| | M5a | CLI + multi-table workload | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; setup hook (gated to first node) issues the initial split between table-route keys. | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables split across 2 shards; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. | | ||
| | M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into the multi-table workload's nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). Nemesis re-queries `ListRoutes` before every split and picks split keys from inside the table-route keyspace. | A `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table --composed1-route-shuffle` run produces zero G1c after ≥10 shuffles during a 5-minute run. | | ||
|
|
||
| M5a is a small, focused PR (Go CLI + Clojure setup hook + | ||
| docs). M5b carries the nemesis itself plus the cadence-tuning | ||
| analysis. | ||
|
|
||
| --- | ||
|
|
||
| ## 5. Open questions | ||
|
|
||
| - **OQ-1.** Should the nemesis also issue an `Abort`-shaped | ||
| fault that interrupts an in-flight 2PC mid-prewrite? The | ||
| existing partition nemesis effectively does this. Tentative | ||
| answer: no, the partition nemesis is enough; adding a | ||
| prewrite-interrupt would test `abortPreparedTxn`, which is | ||
| out of M5's scope. | ||
| - **OQ-2.** Do we ship M5a + M5b in a single PR or two? Two | ||
| is cleaner but doubles the review burden. With the §3.3 | ||
| revision M5a is now meaningfully bigger (a new workload | ||
| variant, not just a setup hook), so two-PR is now the more | ||
| likely shape. Decide at implementation time. | ||
| - **OQ-3.** Where does the new `cmd/elastickv-split` slot in | ||
| the README and the `make` targets? Likely add it to | ||
| `make tools`, mirror in `docs/operations/` (does this dir | ||
| exist? — check at implementation). Out of scope for the | ||
| design doc itself. | ||
| - **OQ-4** (resolved post-PR #900 merge). The parent doc | ||
| rename `*_proposed_*.md` → `*_partial_*.md` should land as a | ||
| separate small doc-only PR now that PR #900 is merged. When | ||
| M5 ships, rename both this doc and the parent to | ||
| `*_implemented_*.md` (tentative — keep both files separate | ||
| so the M5 design history isn't lost). | ||
| - **OQ-5** (new, codex P1 follow-up). Is `N = 4` tables the | ||
| right default? Trade-offs: more tables = better 2PC | ||
| fan-out coverage but slower setup and noisier history. The | ||
| workload's existing `:concurrency` defaults to 5, so 4 | ||
| tables means each client touches ~all of them per txn on | ||
| average. Defer to implementation; revisit if the workload | ||
| becomes I/O-bound on table-meta lookups. | ||
| - **OQ-6** (new, gemini medium R3 follow-up). The first-node | ||
| gate for setup splits assumes Jepsen's `(first (:nodes test))` | ||
| is stable across nodes; verify this matches actual Jepsen | ||
| semantics (it should — `:nodes` is the test config, not a | ||
| per-node view). Out of scope to design more carefully; will | ||
| test at M5a implementation. | ||
|
|
||
| --- | ||
|
|
||
| ## 6. Self-review summary | ||
|
|
||
| Five-pass per CLAUDE.md: | ||
|
|
||
| 1. **Data loss.** No new write paths; the CLI invokes the | ||
| existing `SplitRange` RPC which already has full unit + e2e | ||
| coverage. Nemesis-driven calls of an existing RPC can't lose | ||
| committed writes (worst case: a split fails and the test | ||
| fails, no data effect). | ||
| 2. **Concurrency / distributed failures.** The nemesis runs | ||
| under Jepsen's existing concurrency harness alongside | ||
| partitions + kills. Combined behaviour is the *point* of the | ||
| test — if anything breaks, the workload finds it. No new | ||
| server-side concurrency code is being introduced. | ||
| 3. **Performance.** Nemesis fires every 30s; CLI invocation is a | ||
| single short-lived gRPC call. No measurable impact on hot | ||
| paths. | ||
| 4. **Data consistency.** This IS the data-consistency check | ||
| (G1c = serializability violation). The success criterion is | ||
| the property we want. | ||
| 5. **Test coverage.** M5 ships the integration test; the | ||
| smoke test on the CLI is the only unit-level coverage, | ||
| correctly. The CLI's logic is thin enough that a smoke test | ||
| plus the Jepsen run constitute adequate coverage. | ||
Uh oh!
There was an error while loading. Please reload this page.