You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add max_row_group_bytes option to ParquetOptions (apache#22649)
## Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closesapache#123` indicates that this PR will close issue apache#123.
-->
- Closesapache#22650.
## Rationale for this change
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
arrow-rs 58.0 added WriterProperties::set_max_row_group_bytes (PR:
apache/arrow-rs#9357
Issue: apache/arrow-rs#1213), which flushes a row group when either the
row-count or the byte limit is reached, whichever comes first, matching
parquet-mr's parquet.block.size. DataFusion already consumes atleast
this version of arrow but does not yet expose this new byte-based setter
through its config.
## What changes are included in this PR?
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->
- Add `max_row_group_bytes: Option<usize>` (default None) to
ParquetOptions in `datafusion/common/src/config.rs`.
- Wire it through `ParquetOptions::into_writer_properties_builder` to
`WriterPropertiesBuilder::set_max_row_group_bytes`, with a guard that
rejects Some(0) as a configuration error (arrow-rs panics on a zero byte
limit).
- Plumb the field through protobuf serialization - add it to the
ParquetOptions proto message and the proto-common/proto conversions,
with regenerated bindings.
- Exposed as the max_row_group_bytes COPY / CREATE EXTERNAL TABLE format
option alongside max_row_group_size.
- Update the generated config docs and the format options table doc.
## Are these changes tested?
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code
If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
Yes - run locally and passing:
Unit (datafusion-common, parquet_writer.rs):
- defaults to None, so no byte limit is propagated to WriterProperties.
- a configured value propagates to WriterProperties.
- Some(0) is rejected with a configuration error.
- the existing table_parquet_opts_to_writer_props round-trip and
test_defaults_match tests were extended to cover the new field.
Protobuf round-trip (datafusion-proto-common):
- new test_parquet_options_max_row_group_bytes_round_trip confirms the
option survives serialization to protobuf and back.
SLTs:
- new test_files/parquet_max_row_group_bytes.slt writes Parquet with the
option set (via both COPY ... OPTIONS and session config), reads it
back, asserts the data round-trips, and asserts a zero value is
rejected.
- copy.slt exercises the option inside the existing "all supported
statement overrides" COPY test.
- information_schema.slt updated for the new option in SHOW ALL.
Commands run locally (all pass):
cargo test -p datafusion-common --features parquet
cargo test -p datafusion-proto-common
cargo test -p datafusion-proto
cargo test --test sqllogictests -- parquet_max_row_group_bytes
cargo test --test sqllogictests -- information_schema
cargo test --test sqllogictests -- copy
## Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
Additive only, does not affect existing options.
<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
---------
Co-authored-by: Yongting You <2010youy01@gmail.com>
0 commit comments