bench: add benchmark for map_extract function#21251
bench: add benchmark for map_extract function#21251lyne7-sc wants to merge 3 commits intoapache:mainfrom
Conversation
| .invoke_with_args(ScalarFunctionArgs { | ||
| args: vec![map_arg.clone(), key_arg.clone()], | ||
| arg_fields: arg_fields.clone(), | ||
| number_rows, | ||
| return_field: Arc::clone(&return_field), | ||
| config_options: Arc::clone(&config_options), | ||
| }) |
There was a problem hiding this comment.
I think it will be cleaner to extract the ScalarFunctionArgs creation and just do clone here
| }); | ||
| } | ||
|
|
||
| fn criterion_benchmark(c: &mut Criterion) { |
There was a problem hiding this comment.
can you also add benchmark when not found
| gen_unique_values(rng, |value| value) | ||
| } | ||
|
|
||
| fn list_array(values: ArrayRef, row_count: usize, values_per_row: usize) -> ArrayRef { |
There was a problem hiding this comment.
the number of values per row should not be the same for each list/map to simulate real world data
| let offsets = (0..=row_count) | ||
| .map(|index| (index * values_per_row) as i32) | ||
| .collect::<Vec<_>>(); |
There was a problem hiding this comment.
this is incorrect as the offset length must be the row_count + 1
to make it simpler and fix the issue I just outlined you can use OffsetBuffer::from_lengths(<iter>)
| ))) as ArrayRef; | ||
| let values = list_array(values, MAP_ROWS, MAP_KEYS_PER_ROW); | ||
|
|
||
| let map_extract_cases = [ |
There was a problem hiding this comment.
I find it hard to understand the data for the benchmark, can you make it cleaner please
| ))) as ArrayRef; | ||
| let values = list_array(values, MAP_ROWS, MAP_KEYS_PER_ROW); | ||
|
|
||
| let map_extract_cases = [ |
There was a problem hiding this comment.
Couple of problems I see here with the data:
- it looks like in every list/map the value to find is always in the same position
- It looks like the key to extract is always in the same position
- the key to extract is the same, which is fine but I would mention that in the benchmark description and instead of getting array with all values being the same I would get a
Scalarwhich is what you will get when the argument is literal - all the list/maps have the same lists/maps (i.e.
list[i] == list[j]) which is not real world case
| fn gen_utf8_values(rng: &mut ThreadRng) -> Vec<String> { | ||
| gen_unique_values(rng, |value| value.to_string()) | ||
| } | ||
|
|
||
| fn gen_binary_values(rng: &mut ThreadRng) -> Vec<Vec<u8>> { | ||
| gen_unique_values(rng, |value| value.to_le_bytes().to_vec()) | ||
| } |
There was a problem hiding this comment.
This makes that every binary item is the same length which is not that common.
| use std::sync::Arc; | ||
|
|
||
| const MAP_ROWS: usize = 1000; | ||
| const MAP_KEYS_PER_ROW: usize = 1000; |
There was a problem hiding this comment.
This is possible but not very common, I would have number of entries much smaller ranging from 0-10
|
@rluvaton Thanks for the detailed review. I’ve reworked the benchmark data generation to better reflect common map_extract usage. could you please take another look? |
Which issue does this PR close?
Rationale for this change
This PR separates the
map_extractbenchmark changes from #21237 for easier review and performance comparison.What changes are included in this PR?
add a dedicated benchmark for
map_extractAre these changes tested?
this PR contains benchmark-only changes.
Are there any user-facing changes?
No.