Skip to content

Optimize Parquet metadata row-group level statistics collection#22462

Open
AdamGS wants to merge 3 commits into
apache:mainfrom
AdamGS:adamg/parquet-stats-perf
Open

Optimize Parquet metadata row-group level statistics collection#22462
AdamGS wants to merge 3 commits into
apache:mainfrom
AdamGS:adamg/parquet-stats-perf

Conversation

@AdamGS
Copy link
Copy Markdown
Contributor

@AdamGS AdamGS commented May 22, 2026

Which issue does this PR close?

  • Closes #.

Rationale for this change

The current stats aggregation does a bunch of unnecessary work, this PR tries to do the minimal amount of work at every step.

What changes are included in this PR?

In addition to splitting up the summarization logic into some clearer functions and a reusable function for min/max, I've tried to do the minimal amount of work at each step:

  1. Only allocate boolean masks if there's a mix of exact/inexact stats between row groups.
  2. No need to allocate an Arrow array for null count.
  3. No need to re-calculate the parquet column index - its already in stats_converter, as far as I can tell its exactly the same code path.
  4. No need to recalculate the number of rows - we already know it.

I've also included a benchmark, the effect on my laptop is:

parquet_metadata_statistics/wide_one_row_group
                        time:   [2.9945 ms 3.0313 ms 3.0487 ms]
                        change: [−44.473% −43.790% −43.044%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking parquet_metadata_statistics/moderate_width_many_row_groups: Collecting 10 samples in estimated 5
parquet_metadata_statistics/moderate_width_many_row_groups
                        time:   [236.75 µs 237.37 µs 238.48 µs]
                        change: [−22.330% −21.550% −20.794%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Benchmarking parquet_metadata_statistics/wide_many_row_groups: Collecting 10 samples in estimated 5.0127 s (7
parquet_metadata_statistics/wide_many_row_groups
                        time:   [628.67 µs 636.88 µs 645.79 µs]
                        change: [−29.409% −28.225% −26.999%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Are these changes tested?

Existing tests and few additional small unit tests.

Are there any user-facing changes?

None

AdamGS added 2 commits May 22, 2026 14:58
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
@github-actions github-actions Bot added the datasource Changes to the datasource crate label May 22, 2026
@AdamGS
Copy link
Copy Markdown
Contributor Author

AdamGS commented May 22, 2026

@asolimando here is my change

},
};

// This is the same logic as parquet_column but we start from arrow schema index
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this value is already stats_converter.parquet_column_index()

Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
Copy link
Copy Markdown
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really impressive numbers (22-44% improvement) and the refactoring improves readability via focused helpers, and the ExactnessSummary enum is a nice touch.

I could only find two concerns (see inline comments), but I should caveat that I'm not deeply familiar yet with this area of the codebase so my review shouldn't be considered comprehensive.

.columns()
.get(parquet_idx)
.and_then(|column| column.statistics())
.map(|statistics| statistics.null_count_opt().unwrap_or(0))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it seems that we aren't honoring StatisticsConverter::missing_null_counts_as_zero when set to false, due to the .unwrap_or(0), the old code was explicitly checking the config property and I think it's worth restoring that behavior. WDYT?

{
ExactnessSummary::AllExact => Some(true),
ExactnessSummary::NoneExact => Some(false),
ExactnessSummary::Mixed => {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed it, but Mixed case doesn't seem to be covered in benchmarks, while it's interesting IMO as it's the edge case where things aren't "obvious". If it's confirmed to be uncovered, would you mind adding it to the benchmarks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants