Optimize Parquet metadata row-group level statistics collection#22462
Optimize Parquet metadata row-group level statistics collection#22462AdamGS wants to merge 3 commits into
Conversation
|
@asolimando here is my change |
| }, | ||
| }; | ||
|
|
||
| // This is the same logic as parquet_column but we start from arrow schema index |
There was a problem hiding this comment.
this value is already stats_converter.parquet_column_index()
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
asolimando
left a comment
There was a problem hiding this comment.
Really impressive numbers (22-44% improvement) and the refactoring improves readability via focused helpers, and the ExactnessSummary enum is a nice touch.
I could only find two concerns (see inline comments), but I should caveat that I'm not deeply familiar yet with this area of the codebase so my review shouldn't be considered comprehensive.
| .columns() | ||
| .get(parquet_idx) | ||
| .and_then(|column| column.statistics()) | ||
| .map(|statistics| statistics.null_count_opt().unwrap_or(0)) |
There was a problem hiding this comment.
Here it seems that we aren't honoring StatisticsConverter::missing_null_counts_as_zero when set to false, due to the .unwrap_or(0), the old code was explicitly checking the config property and I think it's worth restoring that behavior. WDYT?
| { | ||
| ExactnessSummary::AllExact => Some(true), | ||
| ExactnessSummary::NoneExact => Some(false), | ||
| ExactnessSummary::Mixed => { |
There was a problem hiding this comment.
I might have missed it, but Mixed case doesn't seem to be covered in benchmarks, while it's interesting IMO as it's the edge case where things aren't "obvious". If it's confirmed to be uncovered, would you mind adding it to the benchmarks?
Which issue does this PR close?
Rationale for this change
The current stats aggregation does a bunch of unnecessary work, this PR tries to do the minimal amount of work at every step.
What changes are included in this PR?
In addition to splitting up the summarization logic into some clearer functions and a reusable function for min/max, I've tried to do the minimal amount of work at each step:
stats_converter, as far as I can tell its exactly the same code path.I've also included a benchmark, the effect on my laptop is:
Are these changes tested?
Existing tests and few additional small unit tests.
Are there any user-facing changes?
None