Optimize Parquet metadata row-group level statistics collection by AdamGS · Pull Request #22462 · apache/datafusion

AdamGS · 2026-05-22T15:39:22Z

Which issue does this PR close?

Closes #.

Rationale for this change

The current stats aggregation does a bunch of unnecessary work, this PR tries to do the minimal amount of work at every step.

What changes are included in this PR?

In addition to splitting up the summarization logic into some clearer functions and a reusable function for min/max, I've tried to do the minimal amount of work at each step:

Only allocate boolean masks if there's a mix of exact/inexact stats between row groups.
No need to allocate an Arrow array for null count.
No need to re-calculate the parquet column index - its already in stats_converter, as far as I can tell its exactly the same code path.
No need to recalculate the number of rows - we already know it.

I've also included a benchmark, the effect on my laptop is:

parquet_metadata_statistics/wide_one_row_group
                        time:   [2.9945 ms 3.0313 ms 3.0487 ms]
                        change: [−44.473% −43.790% −43.044%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking parquet_metadata_statistics/moderate_width_many_row_groups: Collecting 10 samples in estimated 5
parquet_metadata_statistics/moderate_width_many_row_groups
                        time:   [236.75 µs 237.37 µs 238.48 µs]
                        change: [−22.330% −21.550% −20.794%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Benchmarking parquet_metadata_statistics/wide_many_row_groups: Collecting 10 samples in estimated 5.0127 s (7
parquet_metadata_statistics/wide_many_row_groups
                        time:   [628.67 µs 636.88 µs 645.79 µs]
                        change: [−29.409% −28.225% −26.999%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Are these changes tested?

Existing tests and few additional small unit tests.

Are there any user-facing changes?

None

Signed-off-by: Adam Gutglick <adamgsal@gmail.com>

AdamGS · 2026-05-22T15:40:07Z

@asolimando here is my change

AdamGS · 2026-05-22T15:40:50Z

-        },
-    };
-
-    // This is the same logic as parquet_column but we start from arrow schema index


this value is already stats_converter.parquet_column_index()

Signed-off-by: Adam Gutglick <adamgsal@gmail.com>

asolimando

Really impressive numbers (22-44% improvement) and the refactoring improves readability via focused helpers, and the ExactnessSummary enum is a nice touch.

I could only find two concerns (see inline comments), but I should caveat that I'm not deeply familiar yet with this area of the codebase so my review shouldn't be considered comprehensive.

asolimando · 2026-05-22T21:35:12Z

+            .columns()
+            .get(parquet_idx)
+            .and_then(|column| column.statistics())
+            .map(|statistics| statistics.null_count_opt().unwrap_or(0))


Here it seems that we aren't honoring StatisticsConverter::missing_null_counts_as_zero when set to false, due to the .unwrap_or(0), the old code was explicitly checking the config property and I think it's worth restoring that behavior. WDYT?

asolimando · 2026-05-22T21:37:13Z

+        {
+            ExactnessSummary::AllExact => Some(true),
+            ExactnessSummary::NoneExact => Some(false),
+            ExactnessSummary::Mixed => {


I might have missed it, but Mixed case doesn't seem to be covered in benchmarks, while it's interesting IMO as it's the edge case where things aren't "obvious". If it's confirmed to be uncovered, would you mind adding it to the benchmarks?

AdamGS added 2 commits May 22, 2026 14:58

work

5e9c69d

Signed-off-by: Adam Gutglick <adamgsal@gmail.com>

benchmark

902b156

Signed-off-by: Adam Gutglick <adamgsal@gmail.com>

github-actions Bot added the datasource Changes to the datasource crate label May 22, 2026

AdamGS commented May 22, 2026

View reviewed changes

fix clippy

9aede6b

Signed-off-by: Adam Gutglick <adamgsal@gmail.com>

asolimando reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Parquet metadata row-group level statistics collection#22462

Optimize Parquet metadata row-group level statistics collection#22462
AdamGS wants to merge 3 commits into
apache:mainfrom
AdamGS:adamg/parquet-stats-perf

AdamGS commented May 22, 2026

Uh oh!

AdamGS commented May 22, 2026

Uh oh!

AdamGS May 22, 2026

Uh oh!

asolimando left a comment

Uh oh!

asolimando May 22, 2026

Uh oh!

asolimando May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdamGS commented May 22, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

AdamGS commented May 22, 2026

Uh oh!

AdamGS May 22, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando left a comment

Choose a reason for hiding this comment

Uh oh!

asolimando May 22, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants