You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
lsm: fix snapshot::get() skipping tombstones in SST files
A run (buildkite/82930, h/t Willem) of DebugRowsTest
test_read_and_write_rows caught a metastore validation failure where
partition metadata appeared to still exist after all rows had been
deleted. The partition_validator found stale metadata (via a get())
while correctly seeing that extents and terms were deleted (via iterator
scans).
In this particular test, a metastore flush happened to happen in between
deleting our rows and the post-test validation running, which unearthed
a bug in the LSM.
The SST point lookup path (snapshot::get -> impl::get -> version::get
-> sst::reader::internal_get) constructs an internal seek key with
the snapshot's seqno and value_type::value (=0). However, our LSM
encodes tombstone=1 and value=0. Internal keys sort by (seqno, type)
descending, so a tombstone (type=1) sorts before a value (type=0) at the
same seqno. The seek for the first entry >= the lookup key therefore
skips the tombstone and lands on the stale value from a prior seqno.
Iterator-based reads are unaffected because the db_iterator resolves
tombstones during iteration by deduplicating entries per user key.
The memtable's get() is also unaffected because it strips the type
before doing a lower_bound lookup.
This commit fixes snapshot::get() and database::get() to use
value_type::tombstone for the seek key, to use the highest type value,
so the seek key is the earliest possible internal key for a given
(user_key, seqno) pair. Also removes the now-stale dassert in
memtable::get() that rejected non-value lookup key types.
(cherry picked from commit 76009fc)
0 commit comments