Implement forward and reverse iterators for the `find_{keys,key_values}_by_prefix` by MathieuDutSik · Pull Request #6202 · linera-io/linera-protocol

MathieuDutSik · 2026-05-01T20:05:20Z

Motivation

The find_{keys,key_values}_by_prefix functions are building sets that are intrinsically big.
Async iterators are actually the right solution here.

Proposal

Add the functions in the trait and a default implementation. The default implementation is used for the linera-storage-service, indexed-db, System-api and ViewContainer.

The types are inspired from #4975
The reverse iterators are inspired from #6183 which is itself inspired by #6171

For RocksDB, DynamoDB, ScyllaDB, the functionality essentially already exists in the code.
The `ValueSplitting is the more challenging part.

Test Plan

CI

Tests have been added to the run_reads functions.

Release Plan

Can be backported to testnet_conway.

Links

None

github-actions · 2026-05-01T20:10:04Z

Instruction Count Benchmark Results

Baseline: 43bf309cfa

Deterministic metrics — reproducible across runs (34 benchmarks)

Benchmark	Instructions	Total R+W
Cold Load
`load_1000`	693,857 (+0.00%)	1,010,338 (+0.00%)
CollectionView
`indices_100`	192,361 (-0.01%)	267,630 (-0.01%)
`load_all_100_from_storage`	638,229 (-0.00%)	901,200 (-0.00%)
`load_all_100_in_memory`	340,885 (No change)	477,162 (No change)
`pre_save_100`	265,880 (-0.01%)	367,671 (-0.01%)
`try_load_10_from_100`	100,570 (+0.02%)	142,401 (+0.02%)
MapView
`contains_key_10_from_100`	52,512 (-0.04%)	74,525 (-0.03%)
`contains_key_10_from_1000`	355,093 (No change)	501,852 (No change)
`get_10_from_100`	54,865 (-0.30%)	77,973 (-0.29%)
`get_10_from_1000`	357,605 (-0.01%)	505,527 (-0.01%)
`get_100_missing_from_1000`	610,063 (-0.00%)	851,207 (-0.00%)
`indices_100`	100,440 (-0.01%)	138,318 (-0.01%)
`indices_1000`	948,231 (+0.00%)	1,322,215 (+0.00%)
`insert_100`	257,243 (No change)	355,688 (No change)
`insert_1000`	2,963,505 (No change)	4,013,290 (No change)
`post_save_1000`	1,027,028 (No change)	1,481,044 (No change)
`pre_save_100`	332,564 (-0.01%)	462,461 (-0.01%)
`pre_save_1000`	3,381,713 (-0.00%)	4,758,733 (-0.00%)
`remove_500_from_1000`	1,189,614 (No change)	1,660,519 (No change)
QueueView / BucketQueueView
`delete_500_from_1000`	22,407 (No change)	34,370 (No change)
`front_100_from_1000`	5,701 (No change)	8,420 (No change)
`pre_save_1000`	43,122 (No change)	60,496 (No change)
`push_1000`	24,367 (No change)	33,322 (No change)
`delete_500_from_1000`	10,243 (No change)	12,351 (No change)
`front_100_from_1000`	9,137 (No change)	13,881 (No change)
`pre_save_1000`	1,042,902 (No change)	1,498,608 (No change)
`push_1000`	24,294 (No change)	33,225 (No change)
ReentrantCollectionView
`contains_key_10_from_100`	141,897 (No change)	201,820 (No change)
`indices_100`	237,116 (No change)	332,341 (No change)
`load_all_100_from_storage`	803,398 (No change)	1,132,456 (-0.00%)
`load_all_100_in_memory`	411,214 (-0.01%)	566,246 (-0.00%)
`pre_save_100`	350,753 (-0.02%)	488,392 (-0.01%)
RegisterView
`get_set_100`	81,292 (+0.20%)	120,126 (+0.19%)
`pre_save`	5,485 (No change)	8,089 (No change)

Regression threshold: 1% — ${\color{red}\textbf{red}}$ = regression, ${\color{green}\textbf{green}}$ = improvement.

Cache-dependent metrics — expect fluctuations between runs (34 benchmarks)

Benchmark	L1 Hits	LLC Hits	RAM Hits	Est. Cycles
Cold Load
`load_1000`	1,001,810 (+0.00%)	8,353 (-0.06%)	175 (No change)	1,049,700 (+0.00%)
CollectionView
`indices_100`	266,374 (-0.01%)	860 (+0.82%)	396 (No change)	284,534 (+0.00%)
`load_all_100_from_storage`	896,638 (-0.00%)	3,884 (+0.05%)	678 (${\color{red}\textbf{+2.11\%%}}$)	939,788 (+0.05%)
`load_all_100_in_memory`	475,012 (-0.00%)	1,398 (+0.72%)	752 (${\color{red}\textbf{+1.35\%%}}$)	508,322 (+0.07%)
`pre_save_100`	365,732 (-0.01%)	1,342 (+0.15%)	597 (No change)	393,337 (-0.00%)
`try_load_10_from_100`	141,538 (+0.02%)	634 (-0.63%)	229 (+0.88%)	152,723 (+0.05%)
MapView
`contains_key_10_from_100`	74,228 (-0.03%)	90 (${\color{green}\textbf{-2.17\%%}}$)	207 (-0.48%)	81,923 (-0.08%)
`contains_key_10_from_1000`	498,664 (-0.00%)	2,980 (+0.10%)	208 (No change)	520,844 (+0.00%)
`get_10_from_100`	77,659 (-0.31%)	102 (${\color{red}\textbf{+17.24\%%}}$)	212 (-0.47%)	85,589 (-0.23%)
`get_10_from_1000`	502,332 (-0.01%)	2,983 (+0.10%)	212 (-0.47%)	524,667 (-0.01%)
`get_100_missing_from_1000`	847,990 (-0.00%)	2,988 (+0.20%)	229 (No change)	870,945 (-0.00%)
`indices_100`	137,689 (-0.02%)	226 (+0.44%)	403 (${\color{red}\textbf{+1.77\%%}}$)	152,924 (+0.15%)
`indices_1000`	1,314,540 (+0.00%)	6,487 (-0.08%)	1,188 (+0.76%)	1,388,555 (+0.02%)
`insert_100`	354,944 (-0.00%)	89 (${\color{red}\textbf{+3.49\%%}}$)	655 (+0.77%)	378,314 (+0.05%)
`insert_1000`	4,006,263 (-0.00%)	3,041 (-0.10%)	3,986 (+0.10%)	4,160,978 (+0.00%)
`post_save_1000`	1,469,655 (-0.00%)	11,206 (+0.02%)	183 (No change)	1,532,090 (+0.00%)
`pre_save_100`	461,079 (-0.01%)	771 (+0.78%)	611 (-0.65%)	486,319 (-0.03%)
`pre_save_1000`	4,744,816 (-0.00%)	10,104 (-0.09%)	3,813 (-0.10%)	4,928,791 (-0.00%)
`remove_500_from_1000`	1,656,138 (-0.00%)	4,202 (No change)	179 (+0.56%)	1,683,413 (+0.00%)
QueueView / BucketQueueView
`delete_500_from_1000`	34,174 (+0.01%)	33 (${\color{green}\textbf{-10.81\%%}}$)	163 (No change)	40,044 (-0.04%)
`front_100_from_1000`	8,247 (+0.01%)	36 (${\color{green}\textbf{-2.70\%%}}$)	137 (No change)	13,222 (-0.03%)
`pre_save_1000`	60,152 (-0.00%)	61 (${\color{green}\textbf{-3.17\%%}}$)	283 (${\color{red}\textbf{+1.07\%%}}$)	70,362 (+0.13%)
`push_1000`	33,115 (+0.02%)	46 (${\color{green}\textbf{-11.54\%%}}$)	161 (-0.62%)	38,980 (-0.15%)
`delete_500_from_1000`	12,180 (+0.04%)	34 (${\color{green}\textbf{-10.53\%%}}$)	137 (-0.72%)	17,145 (-0.29%)
`front_100_from_1000`	13,684 (+0.01%)	34 (${\color{green}\textbf{-2.86\%%}}$)	163 (No change)	19,559 (-0.02%)
`pre_save_1000`	1,493,909 (No change)	2,737 (No change)	1,962 (No change)	1,576,264 (No change)
`push_1000`	33,021 (+0.02%)	46 (${\color{green}\textbf{-8.00\%%}}$)	158 (${\color{green}\textbf{-1.25\%%}}$)	38,781 (-0.22%)
ReentrantCollectionView
`contains_key_10_from_100`	200,593 (-0.00%)	1,027 (+0.79%)	200 (+0.50%)	212,728 (+0.03%)
`indices_100`	330,770 (-0.00%)	1,198 (+0.34%)	373 (+0.27%)	349,815 (+0.01%)
`load_all_100_from_storage`	1,125,888 (-0.00%)	6,162 (+0.28%)	406 (-0.25%)	1,170,908 (+0.00%)
`load_all_100_in_memory`	563,853 (-0.01%)	1,849 (+0.43%)	544 (-0.55%)	592,138 (-0.02%)
`pre_save_100`	485,466 (-0.02%)	2,238 (+0.31%)	688 (No change)	520,736 (-0.01%)
RegisterView
`get_set_100`	119,905 (+0.18%)	41 (${\color{red}\textbf{+28.12\%%}}$)	180 (-0.55%)	126,410 (+0.18%)
`pre_save`	7,885 (No change)	41 (${\color{green}\textbf{-2.38\%%}}$)	163 (+0.62%)	13,795 (+0.22%)

Cache metrics fluctuate because anything that changes the virtual memory layout
shifts which data lands on which cache lines, changing the L1/LLC/RAM distribution.
Probable causes: ASLR (even across identical binaries), executable binary size changes,
shared library size changes, and even filename length differences.

Cachegrind simulates a two-level cache (L1 + LLC) auto-detected from the host CPU.
Est. Cycles = L1 hits + 5 × LLC hits + 35 × RAM hits.

Runner cache sizes: L1d cache: 64 KiB (2 instances);L1i cache: 64 KiB (2 instances) L2 cache: 1 MiB (2 instances);L3 cache: 32 MiB (1 instance)

ma2bd · 2026-05-04T04:53:24Z

If I had to choose, I think I prefer this one over #6183.

To which extend do you the tests cover the new code? Especially value-splitting and LRU-caching?

MathieuDutSik · 2026-05-04T08:53:27Z

If I had to choose, I think I prefer this one over #6183.

To which extend do you the tests cover the new code? Especially value-splitting and LRU-caching?

The test test_lru_cache_serves_find_by_prefix does the test when the entry is present in the cache.

The test test_value_splitting4_find_key_iters_with_leftovers does the test of ValueSplitting when there are some leftover keys.

The run_reads does the systematic check of writing (key-values) to storage and exercising the read functionality. It is in my opinion fairly good, but does not have perfect coverage.

Note that PR #6183 also implements some function for MapView that are not in this PR.

afck · 2026-05-04T11:53:22Z

        }
        assert_eq!(set_key_value1, set_key_value2);
+        // Streaming variants must agree with the eager methods.
+        let keys_iter: Vec<Vec<u8>> = store


The policy is to put type annotations like these on the method instead (try_collect in this case); also below.

Ok, corrected.

afck · 2026-05-04T12:03:12Z

+        Box::pin(async_stream::stream! {
+            let mut stream = self.store.find_keys_by_prefix_iter(key_prefix);
+            while let Some(item) = stream.next().await {
+                yield item.map_err(JournalingError::Inner);
+            }
+        })


Why not just stream.map(|item| item.map_err(JournalingError::Inner))?
(Also below.)

That works.

afck · 2026-05-04T12:06:29Z

+            }
+            let mut stream = self.store.find_keys_by_prefix_iter(key_prefix);
+            while let Some(item) = stream.next().await {
+                yield item;


Why not cache them?

The iterator can be prematurely ended (for example when doing a find_first_key_in_prefix). So, it may not be possible to cache it. But let me look at that, maybe we can do it.

So, yes we can.

afck · 2026-05-04T12:08:47Z

+    /// uses a fixed-length prefix extractor; without it, `seek_for_prev` would
+    /// only search within the bloom-prefix scope and could miss keys whose
+    /// extractor-prefix differs from the seek target.
+    fn get_find_prefix_reverse_iterator(


(Elsewhere we use rev_iter rather than reverse_iterator.)

So, what would you prefer?
For myself, rev_iter.

I think I'd prefer the short form, too, yes. 👍

afck · 2026-05-04T12:12:37Z

+        Box::pin(async_stream::stream! {
+            if let Err(error) = check_key_size(&key_prefix) {
+                yield Err(error);
+                return;
+            }


Does this work, too?

Suggested change

Box::pin(async_stream::stream! {

if let Err(error) = check_key_size(&key_prefix) {

yield Err(error);

return;

}

Box::pin(async_stream::try_stream! {

check_key_size(&key_prefix)?;

Ok, done. Use try_stream when possible.

afck · 2026-05-04T12:34:48Z

+                    None => false,
+                };
+                if continues {
+                    state.as_mut().unwrap().2.push(value);


Maybe this unwrap can be avoided if it's done inside the match arm.

afck · 2026-05-04T12:38:39Z

+                    let mut big_value = Vec::new();
+                    for (rev_pos, val) in segs.iter().rev().enumerate() {
+                        let idx = rev_pos as u32;
+                        if idx == 0 {
+                            big_value.extend_from_slice(&val[4..]);
+                        } else if idx < count {
+                            big_value.extend_from_slice(val);
+                        }
+                    }


Suggested change

let mut big_value = Vec::new();

for (rev_pos, val) in segs.iter().rev().enumerate() {

let idx = rev_pos as u32;

if idx == 0 {

big_value.extend_from_slice(&val[4..]);

} else if idx < count {

big_value.extend_from_slice(val);

}

}

let big_value = vec![segment_zero_value[4..].to_vec()];

for val in segs[1..count].iter().rev() {

big_value.extend_from_slice(val);

}

Ok, changed.

afck · 2026-05-04T12:41:36Z

+    fn find_keys_by_prefix_iter<'a>(
+        &'a self,
+        key_prefix: &'a [u8],
+    ) -> FindKeysStream<'a, Self::Error> {


I think @Twey had the idea to make the return type impl Stream<Item = Result<Vec<u8>>>; would that make it Send or !Send automatically as needed?

With the design that I propose (and exercised first on the read_multi_values_bytes_iter) it compiles on Wasm32.
So, I am not sure what problem is left to resolve.

Just that it would be simpler (if it works!) and use less conditional compilation.

afck · 2026-05-04T12:42:12Z

+                yield Ok(key_value);
+            }
+        })
+    }


Could these all use map instead of async_stream::stream!?

For the trait implementation, we can improve but that is not so simple since for a start the _iter function is not async.

…v_iter.

afck · 2026-05-04T15:44:42Z

-                yield item;
            }
+            let mut cache = cache.lock().unwrap();
+            cache.insert_find_keys(key_prefix.to_vec(), &accumulated);


OK, maybe that was a bad idea: A lot of time could have passed since we loaded all those values; couldn't they have changed in the meantime?

The caching of find_key_values_by_prefix is only done when we have exclusive access. So, not with blobs, events, and similar.

So, could the state have changed meanwhile? The answer is yes: The construction of the iterator does not create a lock, in general. It certainly does not for DynamoDB and ScyllaDB. But if some keys are inserted or deleted during the operations then your iterator will be unstable. It cannot guarantee that it works correctly.

So, my verdict: Yes, it is fine to save the keys because yes, if that has changed, then your code would anyway not have worked correctly.

But this makes me realize another concern. Could it be that establishing an iterator prevents other operations from working? In other words, is the following code valid:

let mut iter = store.find_keys_by_prefix_iter(&key)?; let mut values = Vec::new(); while let Some(key) = iter.next().await? { let value = store.read_value(&key).await?; values.push(value); }

And the answer is that it is a problem for ScyllaDB. So, do not acquire a semaphore for the creation of an iterator so as not to introduce a deadlock. Correction done. I also added a test for that deadlocking in run_reads.

Maybe if the iterator can be unstable when the underlying state changes while it iterates, it should actually prevent such changes? But I guess that depends on what the use cases are.
Would it make the most sense if only operations on the keys that match the iterator's prefix were blocked? But perhaps it is not easy to achieve.

…cks in run_reads.

ndr-ds · 2026-05-15T15:17:20Z

+                };
+                if index == 0 {
+                    let (key, top, segs) = state.take().unwrap();
+                    let count = Self::read_count_from_value(segs.last().unwrap())?;


Do we want these unwraps? I see a few

MathieuDutSik force-pushed the find_key_async_iterator branch from 8a2d3e7 to ab7fa4b Compare May 2, 2026 06:30

MathieuDutSik marked this pull request as ready for review May 2, 2026 09:43

MathieuDutSik requested review from afck, ma2bd and ndr-ds May 2, 2026 09:43

MathieuDutSik changed the title ~~Implement forward and Reverse iterators for the find_{keys,key_values}_by_prefix~~ Implement forward and reverse iterators for the find_{keys,key_values}_by_prefix May 2, 2026

MathieuDutSik added 14 commits May 4, 2026 12:06

Implement the find_keys_by_prefix_iter (first step)

bedefae

Correct the rocks_db implementation.

dbc767e

Some simplification.

56d7ec3

Demonstrate the use of the find_keys_by_prefix to the functionality.

5a073b9

Implement the reverse iterators.

0b3fab6

Simplify the code.

d0f173e

Simplify the DynamoDB code for the iterators.

99869fa

Reformat.

69c23a7

Resolve the bug in value splitting.

5af87f1

Correct the Cargo.lock

a04cfa8

Some corrections for CI.

39ab1a4

A forgotten entry in the Cargo.lock

0c19c82

Add a test that exercise the ValueSplitting part of the code.

65c36dd

Update the tests.

42adda2

MathieuDutSik force-pushed the find_key_async_iterator branch from a461945 to 42adda2 Compare May 4, 2026 10:06

Correction from CI.

65b9b7a

afck reviewed May 4, 2026

View reviewed changes

MathieuDutSik added 4 commits May 4, 2026 14:59

Add another test for an entry forgotten from the coverage.

46bed05

Changes to dual and journaling.

e463c9a

Add the insertion of the values in the cache if the iterator ends.

d867a2e

Simplify the LRU caching code.

3fce0d5

MathieuDutSik added 7 commits May 4, 2026 16:23

Some additional cleanups by using try_stream.

41075b9

More type annotations eliminated.

a46c0a5

More type annotation eliminated.

3b14200

Simplify the trait implementation and rename _reverse_iterator as _re…

d46cfc4

…v_iter.

Some edit.

6bb04f0

Reformat.

4483dc8

Restructure the code.

2dafd78

afck reviewed May 4, 2026

View reviewed changes

Remove a possible deadlock in ScyllaDB and add a test for such deadlo…

1dc8cad

…cks in run_reads.

ndr-ds reviewed May 15, 2026

View reviewed changes

Conversation

MathieuDutSik commented May 1, 2026

Motivation

Proposal

Test Plan

Release Plan

Links

Uh oh!

github-actions Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Instruction Count Benchmark Results

Uh oh!

ma2bd commented May 4, 2026

Uh oh!

MathieuDutSik commented May 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions Bot commented May 1, 2026 •

edited

Loading