Skip to content

bench: add benchmark for map_extract function#21251

Open
lyne7-sc wants to merge 3 commits intoapache:mainfrom
lyne7-sc:tests/map_extract
Open

bench: add benchmark for map_extract function#21251
lyne7-sc wants to merge 3 commits intoapache:mainfrom
lyne7-sc:tests/map_extract

Conversation

@lyne7-sc
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

  • Closes #.

Rationale for this change

This PR separates the map_extract benchmark changes from #21237 for easier review and performance comparison.

What changes are included in this PR?

add a dedicated benchmark for map_extract

Are these changes tested?

this PR contains benchmark-only changes.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the functions Changes to functions implementation label Mar 30, 2026
Comment on lines +141 to +147
.invoke_with_args(ScalarFunctionArgs {
args: vec![map_arg.clone(), key_arg.clone()],
arg_fields: arg_fields.clone(),
number_rows,
return_field: Arc::clone(&return_field),
config_options: Arc::clone(&config_options),
})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be cleaner to extract the ScalarFunctionArgs creation and just do clone here

});
}

fn criterion_benchmark(c: &mut Criterion) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add benchmark when not found

gen_unique_values(rng, |value| value)
}

fn list_array(values: ArrayRef, row_count: usize, values_per_row: usize) -> ArrayRef {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of values per row should not be the same for each list/map to simulate real world data

Comment on lines +78 to +80
let offsets = (0..=row_count)
.map(|index| (index * values_per_row) as i32)
.collect::<Vec<_>>();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is incorrect as the offset length must be the row_count + 1
to make it simpler and fix the issue I just outlined you can use OffsetBuffer::from_lengths(<iter>)

))) as ArrayRef;
let values = list_array(values, MAP_ROWS, MAP_KEYS_PER_ROW);

let map_extract_cases = [
Copy link
Copy Markdown
Member

@rluvaton rluvaton Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it hard to understand the data for the benchmark, can you make it cleaner please

))) as ArrayRef;
let values = list_array(values, MAP_ROWS, MAP_KEYS_PER_ROW);

let map_extract_cases = [
Copy link
Copy Markdown
Member

@rluvaton rluvaton Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of problems I see here with the data:

  1. it looks like in every list/map the value to find is always in the same position
  2. It looks like the key to extract is always in the same position
  3. the key to extract is the same, which is fine but I would mention that in the benchmark description and instead of getting array with all values being the same I would get a Scalar which is what you will get when the argument is literal
  4. all the list/maps have the same lists/maps (i.e. list[i] == list[j]) which is not real world case

Comment on lines +65 to +71
fn gen_utf8_values(rng: &mut ThreadRng) -> Vec<String> {
gen_unique_values(rng, |value| value.to_string())
}

fn gen_binary_values(rng: &mut ThreadRng) -> Vec<Vec<u8>> {
gen_unique_values(rng, |value| value.to_le_bytes().to_vec())
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes that every binary item is the same length which is not that common.

use std::sync::Arc;

const MAP_ROWS: usize = 1000;
const MAP_KEYS_PER_ROW: usize = 1000;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possible but not very common, I would have number of entries much smaller ranging from 0-10

@lyne7-sc
Copy link
Copy Markdown
Contributor Author

lyne7-sc commented Apr 7, 2026

@rluvaton Thanks for the detailed review. I’ve reworked the benchmark data generation to better reflect common map_extract usage. could you please take another look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants