Skip to content

Implement forward and reverse iterators for the find_{keys,key_values}_by_prefix#6202

Open
MathieuDutSik wants to merge 27 commits into
linera-io:mainfrom
MathieuDutSik:find_key_async_iterator
Open

Implement forward and reverse iterators for the find_{keys,key_values}_by_prefix#6202
MathieuDutSik wants to merge 27 commits into
linera-io:mainfrom
MathieuDutSik:find_key_async_iterator

Conversation

@MathieuDutSik
Copy link
Copy Markdown
Contributor

Motivation

The find_{keys,key_values}_by_prefix functions are building sets that are intrinsically big.
Async iterators are actually the right solution here.

Proposal

Add the functions in the trait and a default implementation. The default implementation is used for the linera-storage-service, indexed-db, System-api and ViewContainer.

The types are inspired from #4975
The reverse iterators are inspired from #6183 which is itself inspired by #6171

For RocksDB, DynamoDB, ScyllaDB, the functionality essentially already exists in the code.
The `ValueSplitting is the more challenging part.

Test Plan

CI

Tests have been added to the run_reads functions.

Release Plan

Can be backported to testnet_conway.

Links

None

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Instruction Count Benchmark Results

Baseline: 43bf309cfa

Deterministic metrics — reproducible across runs (34 benchmarks)
Benchmark Instructions Total R+W
Cold Load
load_1000 693,857 (+0.00%) 1,010,338 (+0.00%)
CollectionView
indices_100 192,361 (-0.01%) 267,630 (-0.01%)
load_all_100_from_storage 638,229 (-0.00%) 901,200 (-0.00%)
load_all_100_in_memory 340,885 (No change) 477,162 (No change)
pre_save_100 265,880 (-0.01%) 367,671 (-0.01%)
try_load_10_from_100 100,570 (+0.02%) 142,401 (+0.02%)
MapView
contains_key_10_from_100 52,512 (-0.04%) 74,525 (-0.03%)
contains_key_10_from_1000 355,093 (No change) 501,852 (No change)
get_10_from_100 54,865 (-0.30%) 77,973 (-0.29%)
get_10_from_1000 357,605 (-0.01%) 505,527 (-0.01%)
get_100_missing_from_1000 610,063 (-0.00%) 851,207 (-0.00%)
indices_100 100,440 (-0.01%) 138,318 (-0.01%)
indices_1000 948,231 (+0.00%) 1,322,215 (+0.00%)
insert_100 257,243 (No change) 355,688 (No change)
insert_1000 2,963,505 (No change) 4,013,290 (No change)
post_save_1000 1,027,028 (No change) 1,481,044 (No change)
pre_save_100 332,564 (-0.01%) 462,461 (-0.01%)
pre_save_1000 3,381,713 (-0.00%) 4,758,733 (-0.00%)
remove_500_from_1000 1,189,614 (No change) 1,660,519 (No change)
QueueView / BucketQueueView
delete_500_from_1000 22,407 (No change) 34,370 (No change)
front_100_from_1000 5,701 (No change) 8,420 (No change)
pre_save_1000 43,122 (No change) 60,496 (No change)
push_1000 24,367 (No change) 33,322 (No change)
delete_500_from_1000 10,243 (No change) 12,351 (No change)
front_100_from_1000 9,137 (No change) 13,881 (No change)
pre_save_1000 1,042,902 (No change) 1,498,608 (No change)
push_1000 24,294 (No change) 33,225 (No change)
ReentrantCollectionView
contains_key_10_from_100 141,897 (No change) 201,820 (No change)
indices_100 237,116 (No change) 332,341 (No change)
load_all_100_from_storage 803,398 (No change) 1,132,456 (-0.00%)
load_all_100_in_memory 411,214 (-0.01%) 566,246 (-0.00%)
pre_save_100 350,753 (-0.02%) 488,392 (-0.01%)
RegisterView
get_set_100 81,292 (+0.20%) 120,126 (+0.19%)
pre_save 5,485 (No change) 8,089 (No change)

Regression threshold: 1%${\color{red}\textbf{red}}$ = regression, ${\color{green}\textbf{green}}$ = improvement.

Cache-dependent metrics — expect fluctuations between runs (34 benchmarks)
Benchmark L1 Hits LLC Hits RAM Hits Est. Cycles
Cold Load
load_1000 1,001,810 (+0.00%) 8,353 (-0.06%) 175 (No change) 1,049,700 (+0.00%)
CollectionView
indices_100 266,374 (-0.01%) 860 (+0.82%) 396 (No change) 284,534 (+0.00%)
load_all_100_from_storage 896,638 (-0.00%) 3,884 (+0.05%) 678 (${\color{red}\textbf{+2.11\%%}}$) 939,788 (+0.05%)
load_all_100_in_memory 475,012 (-0.00%) 1,398 (+0.72%) 752 (${\color{red}\textbf{+1.35\%%}}$) 508,322 (+0.07%)
pre_save_100 365,732 (-0.01%) 1,342 (+0.15%) 597 (No change) 393,337 (-0.00%)
try_load_10_from_100 141,538 (+0.02%) 634 (-0.63%) 229 (+0.88%) 152,723 (+0.05%)
MapView
contains_key_10_from_100 74,228 (-0.03%) 90 (${\color{green}\textbf{-2.17\%%}}$) 207 (-0.48%) 81,923 (-0.08%)
contains_key_10_from_1000 498,664 (-0.00%) 2,980 (+0.10%) 208 (No change) 520,844 (+0.00%)
get_10_from_100 77,659 (-0.31%) 102 (${\color{red}\textbf{+17.24\%%}}$) 212 (-0.47%) 85,589 (-0.23%)
get_10_from_1000 502,332 (-0.01%) 2,983 (+0.10%) 212 (-0.47%) 524,667 (-0.01%)
get_100_missing_from_1000 847,990 (-0.00%) 2,988 (+0.20%) 229 (No change) 870,945 (-0.00%)
indices_100 137,689 (-0.02%) 226 (+0.44%) 403 (${\color{red}\textbf{+1.77\%%}}$) 152,924 (+0.15%)
indices_1000 1,314,540 (+0.00%) 6,487 (-0.08%) 1,188 (+0.76%) 1,388,555 (+0.02%)
insert_100 354,944 (-0.00%) 89 (${\color{red}\textbf{+3.49\%%}}$) 655 (+0.77%) 378,314 (+0.05%)
insert_1000 4,006,263 (-0.00%) 3,041 (-0.10%) 3,986 (+0.10%) 4,160,978 (+0.00%)
post_save_1000 1,469,655 (-0.00%) 11,206 (+0.02%) 183 (No change) 1,532,090 (+0.00%)
pre_save_100 461,079 (-0.01%) 771 (+0.78%) 611 (-0.65%) 486,319 (-0.03%)
pre_save_1000 4,744,816 (-0.00%) 10,104 (-0.09%) 3,813 (-0.10%) 4,928,791 (-0.00%)
remove_500_from_1000 1,656,138 (-0.00%) 4,202 (No change) 179 (+0.56%) 1,683,413 (+0.00%)
QueueView / BucketQueueView
delete_500_from_1000 34,174 (+0.01%) 33 (${\color{green}\textbf{-10.81\%%}}$) 163 (No change) 40,044 (-0.04%)
front_100_from_1000 8,247 (+0.01%) 36 (${\color{green}\textbf{-2.70\%%}}$) 137 (No change) 13,222 (-0.03%)
pre_save_1000 60,152 (-0.00%) 61 (${\color{green}\textbf{-3.17\%%}}$) 283 (${\color{red}\textbf{+1.07\%%}}$) 70,362 (+0.13%)
push_1000 33,115 (+0.02%) 46 (${\color{green}\textbf{-11.54\%%}}$) 161 (-0.62%) 38,980 (-0.15%)
delete_500_from_1000 12,180 (+0.04%) 34 (${\color{green}\textbf{-10.53\%%}}$) 137 (-0.72%) 17,145 (-0.29%)
front_100_from_1000 13,684 (+0.01%) 34 (${\color{green}\textbf{-2.86\%%}}$) 163 (No change) 19,559 (-0.02%)
pre_save_1000 1,493,909 (No change) 2,737 (No change) 1,962 (No change) 1,576,264 (No change)
push_1000 33,021 (+0.02%) 46 (${\color{green}\textbf{-8.00\%%}}$) 158 (${\color{green}\textbf{-1.25\%%}}$) 38,781 (-0.22%)
ReentrantCollectionView
contains_key_10_from_100 200,593 (-0.00%) 1,027 (+0.79%) 200 (+0.50%) 212,728 (+0.03%)
indices_100 330,770 (-0.00%) 1,198 (+0.34%) 373 (+0.27%) 349,815 (+0.01%)
load_all_100_from_storage 1,125,888 (-0.00%) 6,162 (+0.28%) 406 (-0.25%) 1,170,908 (+0.00%)
load_all_100_in_memory 563,853 (-0.01%) 1,849 (+0.43%) 544 (-0.55%) 592,138 (-0.02%)
pre_save_100 485,466 (-0.02%) 2,238 (+0.31%) 688 (No change) 520,736 (-0.01%)
RegisterView
get_set_100 119,905 (+0.18%) 41 (${\color{red}\textbf{+28.12\%%}}$) 180 (-0.55%) 126,410 (+0.18%)
pre_save 7,885 (No change) 41 (${\color{green}\textbf{-2.38\%%}}$) 163 (+0.62%) 13,795 (+0.22%)

Cache metrics fluctuate because anything that changes the virtual memory layout
shifts which data lands on which cache lines, changing the L1/LLC/RAM distribution.
Probable causes: ASLR (even across identical binaries), executable binary size changes,
shared library size changes, and even filename length differences.

Cachegrind simulates a two-level cache (L1 + LLC) auto-detected from the host CPU.
Est. Cycles = L1 hits + 5 × LLC hits + 35 × RAM hits.

Runner cache sizes: L1d cache: 64 KiB (2 instances);L1i cache: 64 KiB (2 instances) L2 cache: 1 MiB (2 instances);L3 cache: 32 MiB (1 instance)

@MathieuDutSik MathieuDutSik force-pushed the find_key_async_iterator branch from 8a2d3e7 to ab7fa4b Compare May 2, 2026 06:30
@MathieuDutSik MathieuDutSik marked this pull request as ready for review May 2, 2026 09:43
@MathieuDutSik MathieuDutSik requested review from afck, ma2bd and ndr-ds May 2, 2026 09:43
@MathieuDutSik MathieuDutSik changed the title Implement forward and Reverse iterators for the find_{keys,key_values}_by_prefix Implement forward and reverse iterators for the find_{keys,key_values}_by_prefix May 2, 2026
@ma2bd
Copy link
Copy Markdown
Contributor

ma2bd commented May 4, 2026

If I had to choose, I think I prefer this one over #6183.

To which extend do you the tests cover the new code? Especially value-splitting and LRU-caching?

@MathieuDutSik
Copy link
Copy Markdown
Contributor Author

If I had to choose, I think I prefer this one over #6183.

To which extend do you the tests cover the new code? Especially value-splitting and LRU-caching?

The test test_lru_cache_serves_find_by_prefix does the test when the entry is present in the cache.

The test test_value_splitting4_find_key_iters_with_leftovers does the test of ValueSplitting when there are some leftover keys.

The run_reads does the systematic check of writing (key-values) to storage and exercising the read functionality. It is in my opinion fairly good, but does not have perfect coverage.

Note that PR #6183 also implements some function for MapView that are not in this PR.

@MathieuDutSik MathieuDutSik force-pushed the find_key_async_iterator branch from a461945 to 42adda2 Compare May 4, 2026 10:06
Comment thread linera-views/src/test_utils/mod.rs Outdated
}
assert_eq!(set_key_value1, set_key_value2);
// Streaming variants must agree with the eager methods.
let keys_iter: Vec<Vec<u8>> = store
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The policy is to put type annotations like these on the method instead (try_collect in this case); also below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, corrected.

Comment thread linera-views/src/backends/journaling.rs Outdated
Comment on lines +204 to +209
Box::pin(async_stream::stream! {
let mut stream = self.store.find_keys_by_prefix_iter(key_prefix);
while let Some(item) = stream.next().await {
yield item.map_err(JournalingError::Inner);
}
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just stream.map(|item| item.map_err(JournalingError::Inner))?
(Also below.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works.

}
let mut stream = self.store.find_keys_by_prefix_iter(key_prefix);
while let Some(item) = stream.next().await {
yield item;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not cache them?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iterator can be prematurely ended (for example when doing a find_first_key_in_prefix). So, it may not be possible to cache it. But let me look at that, maybe we can do it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, yes we can.

Comment thread linera-views/src/backends/rocks_db.rs Outdated
/// uses a fixed-length prefix extractor; without it, `seek_for_prev` would
/// only search within the bloom-prefix scope and could miss keys whose
/// extractor-prefix differs from the seek target.
fn get_find_prefix_reverse_iterator(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Elsewhere we use rev_iter rather than reverse_iterator.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what would you prefer?
For myself, rev_iter.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer the short form, too, yes. 👍

Comment thread linera-views/src/backends/rocks_db.rs Outdated
Comment on lines +607 to +611
Box::pin(async_stream::stream! {
if let Err(error) = check_key_size(&key_prefix) {
yield Err(error);
return;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work, too?

Suggested change
Box::pin(async_stream::stream! {
if let Err(error) = check_key_size(&key_prefix) {
yield Err(error);
return;
}
Box::pin(async_stream::try_stream! {
check_key_size(&key_prefix)?;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done. Use try_stream when possible.

Comment thread linera-views/src/backends/scylla_db.rs
None => false,
};
if continues {
state.as_mut().unwrap().2.push(value);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this unwrap can be avoided if it's done inside the match arm.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done.

Comment on lines +362 to +370
let mut big_value = Vec::new();
for (rev_pos, val) in segs.iter().rev().enumerate() {
let idx = rev_pos as u32;
if idx == 0 {
big_value.extend_from_slice(&val[4..]);
} else if idx < count {
big_value.extend_from_slice(val);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut big_value = Vec::new();
for (rev_pos, val) in segs.iter().rev().enumerate() {
let idx = rev_pos as u32;
if idx == 0 {
big_value.extend_from_slice(&val[4..]);
} else if idx < count {
big_value.extend_from_slice(val);
}
}
let big_value = vec![segment_zero_value[4..].to_vec()];
for val in segs[1..count].iter().rev() {
big_value.extend_from_slice(val);
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, changed.

Comment thread linera-views/src/store.rs
fn find_keys_by_prefix_iter<'a>(
&'a self,
key_prefix: &'a [u8],
) -> FindKeysStream<'a, Self::Error> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @Twey had the idea to make the return type impl Stream<Item = Result<Vec<u8>>>; would that make it Send or !Send automatically as needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the design that I propose (and exercised first on the read_multi_values_bytes_iter) it compiles on Wasm32.
So, I am not sure what problem is left to resolve.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just that it would be simpler (if it works!) and use less conditional compilation.

Comment thread linera-views/src/store.rs
yield Ok(key_value);
}
})
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these all use map instead of async_stream::stream!?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the trait implementation, we can improve but that is not so simple since for a start the _iter function is not async.

yield item;
}
let mut cache = cache.lock().unwrap();
cache.insert_find_keys(key_prefix.to_vec(), &accumulated);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, maybe that was a bad idea: A lot of time could have passed since we loaded all those values; couldn't they have changed in the meantime?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caching of find_key_values_by_prefix is only done when we have exclusive access. So, not with blobs, events, and similar.

So, could the state have changed meanwhile? The answer is yes: The construction of the iterator does not create a lock, in general. It certainly does not for DynamoDB and ScyllaDB. But if some keys are inserted or deleted during the operations then your iterator will be unstable. It cannot guarantee that it works correctly.

So, my verdict: Yes, it is fine to save the keys because yes, if that has changed, then your code would anyway not have worked correctly.

But this makes me realize another concern. Could it be that establishing an iterator prevents other operations from working? In other words, is the following code valid:

let mut iter = store.find_keys_by_prefix_iter(&key)?;
let mut values = Vec::new();
while let Some(key) = iter.next().await? {
  let value = store.read_value(&key).await?;
  values.push(value);
}

And the answer is that it is a problem for ScyllaDB. So, do not acquire a semaphore for the creation of an iterator so as not to introduce a deadlock. Correction done. I also added a test for that deadlocking in run_reads.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if the iterator can be unstable when the underlying state changes while it iterates, it should actually prevent such changes? But I guess that depends on what the use cases are.
Would it make the most sense if only operations on the keys that match the iterator's prefix were blocked? But perhaps it is not easy to achieve.

};
if index == 0 {
let (key, top, segs) = state.take().unwrap();
let count = Self::read_count_from_value(segs.last().unwrap())?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want these unwraps? I see a few

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants