[testnet] ScyllaDB: eliminate per-batch read in write_batch via two-phase writes (take 2)#6344
Open
ma2bd wants to merge 11 commits into
Open
[testnet] ScyllaDB: eliminate per-batch read in write_batch via two-phase writes (take 2)#6344ma2bd wants to merge 11 commits into
ma2bd wants to merge 11 commits into
Conversation
added 5 commits
May 19, 2026 17:26
Previously, every ScyllaDB `write_batch` containing a `DeletePrefix` that overlapped a `Put` in the same batch paid a `find_keys_by_prefix` read inside `UnorderedBatch::expand_colliding_prefix_deletions`. The expansion existed because all statements in a CQL unlogged batch share one write timestamp by default, and at equal timestamps a range tombstone shadows a same-batch insert. This fired on the hot path (notably `reset_chain_manager` clearing `pending_*_blobs` views). Split the write into two sequential unlogged batches: phase 1 issues the prefix-deletes only, phase 2 issues the puts and single-key deletes. The `.await` between phases plus token-aware routing gives phase 2 a strictly later coordinator timestamp, so LWW resolves the intended ordering without any read. In exclusive mode we additionally set explicit client timestamps `T` and `T+1` (via `Batch::set_timestamp`) to keep ordering robust against client/coordinator clock interactions; the per-store timestamp floor is seeded once on first write by reading `WRITETIME` of a reserved sentinel row (empty clustering key, which is unused by views and journaling alike), then advanced monotonically by 2 µs per batch. Shared mode uses coordinator-assigned timestamps. Drop the now-unused `expand_colliding_prefix_deletions`. The `SimplifiedBatch::from_batch` impl for `UnorderedBatch` becomes pure (`Ok(batch.simplify())`), so the per-batch read disappears at the journaling-layer boundary.
In exclusive mode we own the write timestamps, so there is no need to split write_batch into two sequential CQL batches (the split exists only for shared mode, where the .await between phases lets the coordinator stamp the data later). Issuing two batches also broke the atomicity that write_batch callers rely on: if the prefix-delete batch committed and the data batch failed, the partition was left with prefixes deleted but their replacement data missing. Issue the whole exclusive write as one unlogged batch with explicit per-statement USING TIMESTAMP: prefix-deletions at T, single-key deletions, insertions, and the sentinel at T+1. The higher data timestamp keeps a range tombstone from shadowing a same-batch insert, and ordering is pinned by the timestamps rather than by send order, so atomicity is preserved. Shared mode keeps its two-phase, coordinator-timestamped path.
afck
approved these changes
May 20, 2026
Co-authored-by: Andreas Fackler <afck@users.noreply.github.com> Signed-off-by: Mathieu Baudet <1105398+ma2bd@users.noreply.github.com>
added 4 commits
May 20, 2026 21:28
In exclusive mode, `write_batch` writes a per-store timestamp sentinel at the reserved empty clustering key. That row is an internal implementation detail, but `find_keys_by_prefix` / `find_key_values_by_prefix` did not hide it: an empty-prefix scan (`k >= ''`) matches everything, including the sentinel. The row then reached the `ValueSplittingDatabase` wrapper, whose `read_index_from_key` requires a >=4-byte chunk-index suffix and failed the 0-byte sentinel key with `TooShortKey` (seen as a `test_reads_scylla_db` panic in CI). Skip the sentinel row in both prefix-scan helpers. It can only ever match an empty-prefix scan, since for any non-empty prefix `p`, `k >= p` excludes the empty key.
Exclusive mode reserves the empty clustering key (WRITETIME_SENTINEL_KEY) for the per-store timestamp sentinel, and prefix scans now deliberately hide that key. So any caller content written at the empty key would be silently invisible to reads (besides colliding with the sentinel row). Enforce the reservation: `write_batch` rejects any insertion or single-key deletion with a zero-length key via a new `ZeroLengthKey` error, so a violation fails loudly. This also matches DynamoDB, which already forbids zero-length keys. Prefix deletions are left untouched: an empty prefix is a legitimate range operation, and in exclusive mode the sentinel is rewritten at T+1 within the same batch.
`d.as_micros()` returns u128; the `as i64` casts in the timestamp-floor seeding tripped `clippy::cast_possible_truncation` (denied in CI). Convert via `i64::try_from(..).ok()`, falling back to the existing default. Also drop the redundant turbofish on `batch_values` (the element type is inferred from the pushes), matching the other two batch helpers.
Align run_reads with `main`: iterate prefixes of length 0..=len (was 1..=len) so the empty-prefix scan is exercised. This covers `find_keys_by_prefix(&[])` against the exclusive ScyllaDB store, which must not surface the timestamp sentinel row.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Reland of #6334 (merged then reverted), now fixing the atomicity regression that the first attempt introduced.
Previously, every ScyllaDB
write_batchcontaining aDeletePrefixthat overlapped aPutin the same batch paid afind_keys_by_prefixread insideUnorderedBatch::expand_colliding_prefix_deletions. The expansion existed because all statements in a CQL unlogged batch share one write timestamp by default, and at equal timestamps a range tombstone shadows a same-batch insert. This fired on the hot path (notablyreset_chain_managerclearingpending_*_blobsviews).Proposal
Resolve the in-batch prefix/insert collision with explicit write timestamps instead of a read, while preserving
write_batchatomicity.USING TIMESTAMP— prefix-deletions atT, single-key deletions, insertions, and a sentinel atT+1. The higher data timestamp keeps a range tombstone from shadowing a same-batch insert, and the intended ordering is pinned by the timestamps rather than by send order. The per-store timestamp floor is seeded once on first write by reading theWRITETIMEof a reserved sentinel row (empty clustering key, unused by views and journaling alike), then advanced monotonically.This supersedes the reverted #6334: that version also split the exclusive write into two sequential batches, which broke atomicity — a failure between phases could leave prefixes deleted but their replacement data missing. Since exclusive mode controls the timestamps, the ordering no longer depends on issuing two batches, so the whole write is now one atomic batch.
Drop the now-unused
expand_colliding_prefix_deletions.UnorderedBatch'sSimplifiedBatch::from_batchbecomes pure (Ok(batch.simplify())), so the per-batch read disappears at the journaling-layer boundary.SimpleUnorderedBatchis unchanged and still used by DynamoDB, which needs full expansion as it has no native range-delete primitive.Test Plan
CI, plus new exclusive-mode tests in
linera-views/tests/store_tests.rs:test_scylla_db_writes_from_state_exclusive— routes theDeletePrefix-overlapping-Putcases through the explicitT/T+1path.test_scylla_db_exclusive_seed_after_restart— checks that a reconnected store reseeds its timestamp floor from the persisted sentinel and resumes strictly above prior data.Release Plan
front-port to
main