[testnet] ScyllaDB: eliminate per-batch read in write_batch via two-phase writes (take 2) by ma2bd · Pull Request #6344 · linera-io/linera-protocol

ma2bd · 2026-05-20T14:26:28Z

Motivation

Reland of #6334 (merged then reverted), now fixing the atomicity regression that the first attempt introduced.

Previously, every ScyllaDB write_batch containing a DeletePrefix that overlapped a Put in the same batch paid a find_keys_by_prefix read inside UnorderedBatch::expand_colliding_prefix_deletions. The expansion existed because all statements in a CQL unlogged batch share one write timestamp by default, and at equal timestamps a range tombstone shadows a same-batch insert. This fired on the hot path (notably reset_chain_manager clearing pending_*_blobs views).

Proposal

Resolve the in-batch prefix/insert collision with explicit write timestamps instead of a read, while preserving write_batch atomicity.

Exclusive mode (single writer per partition): we own the write timestamps, so we issue the whole write as a single atomic unlogged batch with explicit per-statement USING TIMESTAMP — prefix-deletions at T, single-key deletions, insertions, and a sentinel at T+1. The higher data timestamp keeps a range tombstone from shadowing a same-batch insert, and the intended ordering is pinned by the timestamps rather than by send order. The per-store timestamp floor is seeded once on first write by reading the WRITETIME of a reserved sentinel row (empty clustering key, unused by views and journaling alike), then advanced monotonically.
Shared mode (coordinator-assigned timestamps): we split the write into two sequential unlogged batches — phase 1 the prefix-deletes, phase 2 the puts and single-key deletes.

This supersedes the reverted #6334: that version also split the exclusive write into two sequential batches, which broke atomicity — a failure between phases could leave prefixes deleted but their replacement data missing. Since exclusive mode controls the timestamps, the ordering no longer depends on issuing two batches, so the whole write is now one atomic batch.

Drop the now-unused expand_colliding_prefix_deletions. UnorderedBatch's SimplifiedBatch::from_batch becomes pure (Ok(batch.simplify())), so the per-batch read disappears at the journaling-layer boundary. SimpleUnorderedBatch is unchanged and still used by DynamoDB, which needs full expansion as it has no native range-delete primitive.

Test Plan

CI, plus new exclusive-mode tests in linera-views/tests/store_tests.rs:

test_scylla_db_writes_from_state_exclusive — routes the DeletePrefix-overlapping-Put cases through the explicit T/T+1 path.
test_scylla_db_exclusive_seed_after_restart — checks that a reconnected store reseeds its timestamp floor from the persisted sentinel and resumes strictly above prior data.

Release Plan

front-port to main

Previously, every ScyllaDB `write_batch` containing a `DeletePrefix` that overlapped a `Put` in the same batch paid a `find_keys_by_prefix` read inside `UnorderedBatch::expand_colliding_prefix_deletions`. The expansion existed because all statements in a CQL unlogged batch share one write timestamp by default, and at equal timestamps a range tombstone shadows a same-batch insert. This fired on the hot path (notably `reset_chain_manager` clearing `pending_*_blobs` views). Split the write into two sequential unlogged batches: phase 1 issues the prefix-deletes only, phase 2 issues the puts and single-key deletes. The `.await` between phases plus token-aware routing gives phase 2 a strictly later coordinator timestamp, so LWW resolves the intended ordering without any read. In exclusive mode we additionally set explicit client timestamps `T` and `T+1` (via `Batch::set_timestamp`) to keep ordering robust against client/coordinator clock interactions; the per-store timestamp floor is seeded once on first write by reading `WRITETIME` of a reserved sentinel row (empty clustering key, which is unused by views and journaling alike), then advanced monotonically by 2 µs per batch. Shared mode uses coordinator-assigned timestamps. Drop the now-unused `expand_colliding_prefix_deletions`. The `SimplifiedBatch::from_batch` impl for `UnorderedBatch` becomes pure (`Ok(batch.simplify())`), so the per-batch read disappears at the journaling-layer boundary.

In exclusive mode we own the write timestamps, so there is no need to split write_batch into two sequential CQL batches (the split exists only for shared mode, where the .await between phases lets the coordinator stamp the data later). Issuing two batches also broke the atomicity that write_batch callers rely on: if the prefix-delete batch committed and the data batch failed, the partition was left with prefixes deleted but their replacement data missing. Issue the whole exclusive write as one unlogged batch with explicit per-statement USING TIMESTAMP: prefix-deletions at T, single-key deletions, insertions, and the sentinel at T+1. The higher data timestamp keeps a range tombstone from shadowing a same-batch insert, and ordering is pinned by the timestamps rather than by send order, so atomicity is preserved. Shared mode keeps its two-phase, coordinator-timestamped path.

Co-authored-by: Andreas Fackler <afck@users.noreply.github.com> Signed-off-by: Mathieu Baudet <1105398+ma2bd@users.noreply.github.com>

In exclusive mode, `write_batch` writes a per-store timestamp sentinel at the reserved empty clustering key. That row is an internal implementation detail, but `find_keys_by_prefix` / `find_key_values_by_prefix` did not hide it: an empty-prefix scan (`k >= ''`) matches everything, including the sentinel. The row then reached the `ValueSplittingDatabase` wrapper, whose `read_index_from_key` requires a >=4-byte chunk-index suffix and failed the 0-byte sentinel key with `TooShortKey` (seen as a `test_reads_scylla_db` panic in CI). Skip the sentinel row in both prefix-scan helpers. It can only ever match an empty-prefix scan, since for any non-empty prefix `p`, `k >= p` excludes the empty key.

Exclusive mode reserves the empty clustering key (WRITETIME_SENTINEL_KEY) for the per-store timestamp sentinel, and prefix scans now deliberately hide that key. So any caller content written at the empty key would be silently invisible to reads (besides colliding with the sentinel row). Enforce the reservation: `write_batch` rejects any insertion or single-key deletion with a zero-length key via a new `ZeroLengthKey` error, so a violation fails loudly. This also matches DynamoDB, which already forbids zero-length keys. Prefix deletions are left untouched: an empty prefix is a legitimate range operation, and in exclusive mode the sentinel is rewritten at T+1 within the same batch.

`d.as_micros()` returns u128; the `as i64` casts in the timestamp-floor seeding tripped `clippy::cast_possible_truncation` (denied in CI). Convert via `i64::try_from(..).ok()`, falling back to the existing default. Also drop the redundant turbofish on `batch_values` (the element type is inferred from the pushes), matching the other two batch helpers.

Align run_reads with `main`: iterate prefixes of length 0..=len (was 1..=len) so the empty-prefix scan is exercised. This covers `find_keys_by_prefix(&[])` against the exclusive ScyllaDB store, which must not surface the timestamp sentinel row.

Mathieu Baudet added 5 commits May 19, 2026 17:26

fix formatting

11c4ec1

nit: simplify writetime arithmetic

52ebbc3

add tests

00b3ce1

ma2bd requested review from Twey and afck May 20, 2026 14:29

afck approved these changes May 20, 2026

View reviewed changes

Comment thread linera-views/src/backends/scylla_db.rs Outdated

Comment thread linera-views/src/backends/scylla_db.rs Outdated

Comment thread linera-views/src/backends/scylla_db.rs

ma2bd and others added 2 commits May 20, 2026 12:38

Update linera-views/src/backends/scylla_db.rs

b2f7175

Co-authored-by: Andreas Fackler <afck@users.noreply.github.com> Signed-off-by: Mathieu Baudet <1105398+ma2bd@users.noreply.github.com>

more nits

5c0c290

ma2bd force-pushed the timestamp_scylladb branch from a771d20 to 5c0c290 Compare May 20, 2026 16:44

ma2bd mentioned this pull request May 20, 2026

[main] ScyllaDB: eliminate per-batch read in write_batch via two-phase writes #6342

Open

ma2bd requested review from ndr-ds and removed request for Twey May 20, 2026 23:29

Mathieu Baudet added 4 commits May 20, 2026 21:28

ma2bd force-pushed the timestamp_scylladb branch from b6fdfe8 to aaf2dc5 Compare May 21, 2026 02:03

ma2bd changed the title ~~[testnet] ScyllaDB: eliminate per-batch read in write_batch via two-phase writes~~ [testnet] ScyllaDB: eliminate per-batch read in write_batch via two-phase writes (take 2) May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[testnet] ScyllaDB: eliminate per-batch read in write_batch via two-phase writes (take 2)#6344

[testnet] ScyllaDB: eliminate per-batch read in write_batch via two-phase writes (take 2)#6344
ma2bd wants to merge 11 commits into
testnet_conwayfrom
timestamp_scylladb

ma2bd commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ma2bd commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Proposal

Test Plan

Release Plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ma2bd commented May 20, 2026 •

edited

Loading