refactor: give parquet CDC options an explicit enabled flag#22632
Open
kszucs wants to merge 1 commit into
Open
refactor: give parquet CDC options an explicit enabled flag#22632kszucs wants to merge 1 commit into
enabled flag#22632kszucs wants to merge 1 commit into
Conversation
Content-defined chunking (CDC) write options were added in apache#21110 and have not been released yet (current workspace is 53.x; CDC is slated for 54.0.0), so the config and proto surfaces can still be changed freely. This reworks it before it ships. What changes: * Rename the `ParquetOptions` field `use_content_defined_chunking` -> `content_defined_chunking`. * `CdcOptions` becomes a plain `config_namespace!` with an explicit `enabled: bool` field alongside the chunking parameters, and the field is a bare `CdcOptions` (no longer `Option<CdcOptions>`). CDC is on iff `content_defined_chunking.enabled` is true. Add `CdcOptions::enabled()` / `CdcOptions::disabled()` shorthand constructors. * Drop the bespoke `impl ConfigField for CdcOptions` / `impl ConfigField for Option<CdcOptions>` and the `#[expect(clippy::should_implement_trait)]` workaround that backed the old bare-boolean form. Everything is now generated by the macro. * Add an `enabled` field to the proto `CdcOptions` message so the proto <-> config mapping is a direct field copy, dropping the previous presence-encoding and the zero-sentinel fallback for the chunk sizes. Why this is better: * Naming matches parquet-rs. parquet's `WriterProperties` exposes `content_defined_chunking()` / `set_content_defined_chunking(...)` with no `use_` prefix; the field name now lines up across the boundary. * Explicit, not magic. CDC is toggled with a real `content_defined_chunking.enabled = true|false` key instead of a special bare-boolean parse, and setting a chunking parameter no longer silently turns CDC on. * No order-dependence on the SQL side. Format options in `COPY ... OPTIONS` and `CREATE EXTERNAL TABLE ... OPTIONS` are applied from a `HashMap`, i.e. in non-deterministic order. With a separate `enabled` flag, the flag and the parameters are set independently, so the resolved config never depends on the order in which the keys happen to be applied. * Simpler. No hand-written `ConfigField` impls, no clippy hack, and the proto serialization is a plain field copy in both directions. Tests, generated config docs, and the information_schema snapshot are updated accordingly; a new `parquet_cdc_config.slt` documents the resolution behavior (enable toggle, parameter-does-not-enable, order independence).
3e66f6e to
30090c1
Compare
Member
Author
|
I had second thoughts about the cdc options configuration and actually found some weird implicit behavior (e.g. order dependent configuration), so I switched to a more verbose but explicit one (I also plan to follow this approach in iceberg). @alamb could we include it in the release so we don't need to break the API later on? |
55 tasks
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
The CDC options currently work as
use_content_defined_chunking: Option<CdcOptions>with aConfigFieldimpl that accepts a bareuse_content_defined_chunking = true|falseand otherwise enables CDC implicitly when any sub-field is set. This has a few problems:WriterPropertiesexposescontent_defined_chunking()/set_content_defined_chunking(Option<CdcOptions>)with nouse_prefix.COPY ... OPTIONS/CREATE EXTERNAL TABLE ... OPTIONSare applied from aHashMap(non-deterministic order). With the old bare-boolean form, mixing... = falsewith a sub-field, or setting a sub-field after= false, could resolve to enabled or disabled depending on iteration order.impl ConfigField for CdcOptions+impl ConfigField for Option<CdcOptions>and a#[expect(clippy::should_implement_trait)]workaround, plus a zero-sentinel fallback in the proto mapping.Since CDC is unreleased, the config/proto surface can still be changed freely.
What changes are included in this PR?
ParquetOptionsfielduse_content_defined_chunking->content_defined_chunking(matches parquet-rs).CdcOptionsa plainconfig_namespace!with an explicitenabled: boolfield alongside the chunking parameters; the field is a bareCdcOptions(no longerOption<CdcOptions>). CDC is on ifcontent_defined_chunking.enabledis true. Setting a parameter no longer implicitly enables CDC, and the result is independent of key order.CdcOptions::enabled()/CdcOptions::disabled()shorthand constructors.ConfigFieldimpls and theshould_implement_traitworkaround — all generated by the macro now.enabledfield to the protoCdcOptionsmessage so the proto <-> config mapping is a plain field copy in both directions (removes the presence-encoding and the zero-sentinel fallback).information_schemasnapshot, and addparquet_cdc_config.sltdocumenting the resolution behavior.Are these changes tested?
Yes:
datafusion-commonconfig + writer unit tests (enable toggle, parameter-does-not-enable, validation, writer round-trip).datafusion-proto-commonproto round-trip tests (enabled / disabled / negative norm level).datafusion/coreparquet integration tests (data round-trip, page boundaries).parquet_cdc.slt(end-to-end) and a newparquet_cdc_config.slt(config resolution / order independence).Are there any user-facing changes?
Yes, but only to the unreleased CDC options:
datafusion.execution.parquet.use_content_defined_chunking->datafusion.execution.parquet.content_defined_chunking.enabled(plus.min_chunk_size/.max_chunk_size/.norm_level).content_defined_chunking.enabled = true|false.No released API is affected.
🤖 Generated with Claude Code