Commit afd2b1c
authored
feat(sqs): Jepsen HT-FIFO workload (Phase 3.D PR 7b) (#738)
## Summary
Phase 3.D PR 7b — Jepsen HT-FIFO workload that stresses partitioned-FIFO
queues against the three contracts AWS HT-FIFO is supposed to honour
even under partition and node-loss faults: **within-group ordering**,
**no message loss**, **no duplicates**.
Pattern follows [aphyr's Jepsen RabbitMQ
analysis](https://aphyr.com/posts/315-jepsen-rabbitmq): track every
`:send` and `:recv` in the operation history, then a custom checker
verifies the contracts against the recorded events at the end of the
run.
## What's in this PR
- **`jepsen/project.clj`** — Adds `com.cognitect.aws/sqs` at the same
version as the existing dynamodb dep, so the SDK wire protocol (auth,
retry classification, error parsing) is exercised end-to-end against
elastickv rather than a hand-rolled HTTP layer.
- **`jepsen/src/elastickv/db.clj`** — Extends `start-node!` to accept
`:sqs-port` (port spec like `:dynamo-port`) and `:sqs-region`. Both are
optional, so existing dynamodb / s3 / redis test specs are
byte-identical at the args level when `sqs-port` is absent.
- **`jepsen/src/elastickv/jepsen_test.clj`** — Registers
`elastickv-sqs-htfifo-test` alongside the other workloads.
- **`jepsen/src/elastickv/sqs_htfifo_workload.clj`** (new, ~430 lines) —
The workload. Uses cognitect/aws-api SQS, creates an HT-FIFO queue with
`PartitionCount=4` + `ContentBasedDeduplication`, runs sends and
receives across N `MessageGroupId` values, and the custom
`ht-fifo-checker` validates the three contracts.
- **`jepsen/test/elastickv/sqs_htfifo_workload_test.clj`** (new) —
Pure-function tests for the checker plus integration smoke tests for the
test-spec builder. 11 tests / 27 assertions.
## Checker contracts
For each `MessageGroupId` independently:
1. **Within-group ordering** — the sequence of received `seq` values,
sorted by global completion time across all consumers, is monotonically
non-decreasing.
2. **No loss** — every `(group, seq)` successfully `:sent` eventually
appears in the `:recv` history. Sends with `:info` status are treated as
possibly-committed and not counted as lost.
3. **No duplicates** — every `(group, seq)` appears at most once in the
`:recv` history. `ContentBasedDeduplication` on the queue + a unique
`(group, seq)` body is what enforces this server-side; a duplicate here
is a real bug (e.g. a deletion that did not commit).
## Open-endpoint mode
The elastickv server starts without `--sqsCredentialsFile`, so the SQS
adapter accepts any signed request (mirroring how the S3 adapter is
wired in jepsen today). The SDK client signs with dummy credentials, so
the SigV4 path still exercises end-to-end at the protocol level.
## Self-review (5 lenses)
1. **Data loss** — N/A; this is a test-only PR. The workload's whole
purpose is to *detect* data loss in the system under test.
2. **Concurrency** — The shared per-group `seq-counter` is an `atom`
updated via `swap!` (CAS-based), so concurrent sends from different
worker threads always assign distinct seqs. The checker is pure; no
shared mutable state.
3. **Performance** — Test-only code, runs at low rate (5
ops/sec/worker). Not on any hot path.
4. **Data consistency** — The checker compares committed sends against
the receive history globally, so all the consistency assertions are at
end-of-run with a complete picture. Sends with `:info` (uncertain
commit) are correctly excluded from the loss set, matching Jepsen's
standard approach.
5. **Test coverage** — 11 unit tests for the checker pin the contract
surface (clean / loss / info-not-loss / duplicates / within-group
ordering / cross-group interleaving / failed-send-not-counted /
empty-receive). Integration smoke tests pin the test-spec builder. The
workload itself is exercised end-to-end on a real cluster via `lein run
-m elastickv.sqs-htfifo-workload`.
## Test plan
- [x] `lein test elastickv.sqs-htfifo-workload-test` — 11 tests / 27
assertions pass
- [x] `lein test` for non-redis suite (dynamodb / dynamodb-types / s3 /
cli / sqs-htfifo) — 21 tests / 41 assertions pass
- [ ] End-to-end live cluster run — operator-driven (out of scope for
the merge gate; relies on a 3-node cluster setup)
The `elastickv.redis-workload` namespace fails to load due to the empty
`redis/src/` tree, which is pre-existing on main and unrelated to this
PR.
## Out of scope (next milestones)
- Wiring the workload into `scripts/run-jepsen-local.sh` — the existing
script is dynamodb-only; an sqs counterpart lands as a follow-up.
- Multi-shard cluster topology that lands distinct partitions on
distinct Raft groups. This PR's `PartitionCount=4` routes to the default
group on a single-shard cluster — partitioning logic (different keys per
partition, ordering preserved within group) is fully exercised, but the
cross-shard scaling story is gated on separate work.
- Design-doc lifecycle rename (`*_proposed_*.md` → `*_partial_*.md`) —
that is §11 PR 8 in the design doc and is tracked separately.
## Refs
- `docs/design/2026_04_26_proposed_sqs_split_queue_fifo.md` §11 PR 7.
- Closes the testing half of §11 PR 7. PR 7a (metrics) shipped at #737.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added AWS SQS integration with HT‑FIFO support, SQS port/region
configuration, and runtime options to exercise FIFO dedupe/order
semantics
* **Tests**
* Added comprehensive unit and workload tests validating ordering,
no‑loss, no‑duplicates, and option handling
* **Chores**
* CI updated to run the SQS HT‑FIFO workload as part of Jepsen test runs
<!-- end of auto-generated comment: release notes by coderabbit.ai -->7 files changed
Lines changed: 823 additions & 3 deletions
File tree
- .github/workflows
- jepsen
- src/elastickv
- test/elastickv
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
| 52 | + | |
52 | 53 | | |
53 | 54 | | |
54 | 55 | | |
| |||
57 | 58 | | |
58 | 59 | | |
59 | 60 | | |
| 61 | + | |
60 | 62 | | |
61 | 63 | | |
62 | 64 | | |
| |||
65 | 67 | | |
66 | 68 | | |
67 | 69 | | |
| 70 | + | |
68 | 71 | | |
69 | 72 | | |
70 | 73 | | |
71 | 74 | | |
72 | | - | |
| 75 | + | |
73 | 76 | | |
74 | 77 | | |
75 | 78 | | |
76 | | - | |
| 79 | + | |
| 80 | + | |
77 | 81 | | |
78 | 82 | | |
79 | 83 | | |
| |||
142 | 146 | | |
143 | 147 | | |
144 | 148 | | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
145 | 169 | | |
146 | 170 | | |
147 | 171 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
84 | 84 | | |
85 | 85 | | |
86 | 86 | | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
87 | 103 | | |
88 | 104 | | |
89 | 105 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| 16 | + | |
16 | 17 | | |
17 | 18 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
97 | 97 | | |
98 | 98 | | |
99 | 99 | | |
100 | | - | |
| 100 | + | |
101 | 101 | | |
102 | 102 | | |
103 | 103 | | |
| |||
110 | 110 | | |
111 | 111 | | |
112 | 112 | | |
| 113 | + | |
| 114 | + | |
113 | 115 | | |
114 | 116 | | |
115 | 117 | | |
| |||
121 | 123 | | |
122 | 124 | | |
123 | 125 | | |
| 126 | + | |
| 127 | + | |
124 | 128 | | |
125 | 129 | | |
126 | 130 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| 7 | + | |
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
| |||
19 | 20 | | |
20 | 21 | | |
21 | 22 | | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
22 | 31 | | |
23 | 32 | | |
24 | 33 | | |
0 commit comments