Skip to content

[testnet] ScyllaDB: eliminate per-batch read in write_batch via two-phase writes (take 2)#6344

Open
ma2bd wants to merge 11 commits into
testnet_conwayfrom
timestamp_scylladb
Open

[testnet] ScyllaDB: eliminate per-batch read in write_batch via two-phase writes (take 2)#6344
ma2bd wants to merge 11 commits into
testnet_conwayfrom
timestamp_scylladb

Conversation

@ma2bd
Copy link
Copy Markdown
Contributor

@ma2bd ma2bd commented May 20, 2026

Motivation

Reland of #6334 (merged then reverted), now fixing the atomicity regression that the first attempt introduced.

Previously, every ScyllaDB write_batch containing a DeletePrefix that overlapped a Put in the same batch paid a find_keys_by_prefix read inside UnorderedBatch::expand_colliding_prefix_deletions. The expansion existed because all statements in a CQL unlogged batch share one write timestamp by default, and at equal timestamps a range tombstone shadows a same-batch insert. This fired on the hot path (notably reset_chain_manager clearing pending_*_blobs views).

Proposal

Resolve the in-batch prefix/insert collision with explicit write timestamps instead of a read, while preserving write_batch atomicity.

  • Exclusive mode (single writer per partition): we own the write timestamps, so we issue the whole write as a single atomic unlogged batch with explicit per-statement USING TIMESTAMP — prefix-deletions at T, single-key deletions, insertions, and a sentinel at T+1. The higher data timestamp keeps a range tombstone from shadowing a same-batch insert, and the intended ordering is pinned by the timestamps rather than by send order. The per-store timestamp floor is seeded once on first write by reading the WRITETIME of a reserved sentinel row (empty clustering key, unused by views and journaling alike), then advanced monotonically.
  • Shared mode (coordinator-assigned timestamps): we split the write into two sequential unlogged batches — phase 1 the prefix-deletes, phase 2 the puts and single-key deletes.

This supersedes the reverted #6334: that version also split the exclusive write into two sequential batches, which broke atomicity — a failure between phases could leave prefixes deleted but their replacement data missing. Since exclusive mode controls the timestamps, the ordering no longer depends on issuing two batches, so the whole write is now one atomic batch.

Drop the now-unused expand_colliding_prefix_deletions. UnorderedBatch's SimplifiedBatch::from_batch becomes pure (Ok(batch.simplify())), so the per-batch read disappears at the journaling-layer boundary. SimpleUnorderedBatch is unchanged and still used by DynamoDB, which needs full expansion as it has no native range-delete primitive.

Test Plan

CI, plus new exclusive-mode tests in linera-views/tests/store_tests.rs:

  • test_scylla_db_writes_from_state_exclusive — routes the DeletePrefix-overlapping-Put cases through the explicit T/T+1 path.
  • test_scylla_db_exclusive_seed_after_restart — checks that a reconnected store reseeds its timestamp floor from the persisted sentinel and resumes strictly above prior data.

Release Plan

front-port to main

Mathieu Baudet added 5 commits May 19, 2026 17:26
Previously, every ScyllaDB `write_batch` containing a `DeletePrefix`
that overlapped a `Put` in the same batch paid a `find_keys_by_prefix`
read inside `UnorderedBatch::expand_colliding_prefix_deletions`. The
expansion existed because all statements in a CQL unlogged batch share
one write timestamp by default, and at equal timestamps a range
tombstone shadows a same-batch insert. This fired on the hot path
(notably `reset_chain_manager` clearing `pending_*_blobs` views).

Split the write into two sequential unlogged batches: phase 1 issues
the prefix-deletes only, phase 2 issues the puts and single-key
deletes. The `.await` between phases plus token-aware routing gives
phase 2 a strictly later coordinator timestamp, so LWW resolves the
intended ordering without any read. In exclusive mode we additionally
set explicit client timestamps `T` and `T+1` (via `Batch::set_timestamp`)
to keep ordering robust against client/coordinator clock interactions;
the per-store timestamp floor is seeded once on first write by reading
`WRITETIME` of a reserved sentinel row (empty clustering key, which is
unused by views and journaling alike), then advanced monotonically by
2 µs per batch. Shared mode uses coordinator-assigned timestamps.

Drop the now-unused `expand_colliding_prefix_deletions`. The
`SimplifiedBatch::from_batch` impl for `UnorderedBatch` becomes pure
(`Ok(batch.simplify())`), so the per-batch read disappears at the
journaling-layer boundary.
In exclusive mode we own the write timestamps, so there is no need to
split write_batch into two sequential CQL batches (the split exists only
for shared mode, where the .await between phases lets the coordinator
stamp the data later). Issuing two batches also broke the atomicity that
write_batch callers rely on: if the prefix-delete batch committed and the
data batch failed, the partition was left with prefixes deleted but their
replacement data missing.

Issue the whole exclusive write as one unlogged batch with explicit
per-statement USING TIMESTAMP: prefix-deletions at T, single-key
deletions, insertions, and the sentinel at T+1. The higher data timestamp
keeps a range tombstone from shadowing a same-batch insert, and ordering
is pinned by the timestamps rather than by send order, so atomicity is
preserved. Shared mode keeps its two-phase, coordinator-timestamped path.
@ma2bd ma2bd requested review from Twey and afck May 20, 2026 14:29
Comment thread linera-views/src/backends/scylla_db.rs Outdated
Comment thread linera-views/src/backends/scylla_db.rs Outdated
Comment thread linera-views/src/backends/scylla_db.rs
ma2bd and others added 2 commits May 20, 2026 12:38
Co-authored-by: Andreas Fackler <afck@users.noreply.github.com>
Signed-off-by: Mathieu Baudet <1105398+ma2bd@users.noreply.github.com>
Mathieu Baudet added 4 commits May 20, 2026 21:28
In exclusive mode, `write_batch` writes a per-store timestamp sentinel at
the reserved empty clustering key. That row is an internal implementation
detail, but `find_keys_by_prefix` / `find_key_values_by_prefix` did not
hide it: an empty-prefix scan (`k >= ''`) matches everything, including
the sentinel. The row then reached the `ValueSplittingDatabase` wrapper,
whose `read_index_from_key` requires a >=4-byte chunk-index suffix and
failed the 0-byte sentinel key with `TooShortKey` (seen as a
`test_reads_scylla_db` panic in CI).

Skip the sentinel row in both prefix-scan helpers. It can only ever match
an empty-prefix scan, since for any non-empty prefix `p`, `k >= p`
excludes the empty key.
Exclusive mode reserves the empty clustering key (WRITETIME_SENTINEL_KEY)
for the per-store timestamp sentinel, and prefix scans now deliberately
hide that key. So any caller content written at the empty key would be
silently invisible to reads (besides colliding with the sentinel row).

Enforce the reservation: `write_batch` rejects any insertion or single-key
deletion with a zero-length key via a new `ZeroLengthKey` error, so a
violation fails loudly. This also matches DynamoDB, which already forbids
zero-length keys.

Prefix deletions are left untouched: an empty prefix is a legitimate
range operation, and in exclusive mode the sentinel is rewritten at T+1
within the same batch.
`d.as_micros()` returns u128; the `as i64` casts in the timestamp-floor
seeding tripped `clippy::cast_possible_truncation` (denied in CI). Convert
via `i64::try_from(..).ok()`, falling back to the existing default. Also
drop the redundant turbofish on `batch_values` (the element type is
inferred from the pushes), matching the other two batch helpers.
Align run_reads with `main`: iterate prefixes of length 0..=len (was
1..=len) so the empty-prefix scan is exercised. This covers
`find_keys_by_prefix(&[])` against the exclusive ScyllaDB store, which
must not surface the timestamp sentinel row.
@ma2bd ma2bd force-pushed the timestamp_scylladb branch from b6fdfe8 to aaf2dc5 Compare May 21, 2026 02:03
@ma2bd ma2bd changed the title [testnet] ScyllaDB: eliminate per-batch read in write_batch via two-phase writes [testnet] ScyllaDB: eliminate per-batch read in write_batch via two-phase writes (take 2) May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants