Skip to content

refactor: give parquet CDC options an explicit enabled flag#22632

Open
kszucs wants to merge 1 commit into
apache:mainfrom
kszucs:cdc-options-align-parquet-rs
Open

refactor: give parquet CDC options an explicit enabled flag#22632
kszucs wants to merge 1 commit into
apache:mainfrom
kszucs:cdc-options-align-parquet-rs

Conversation

@kszucs
Copy link
Copy Markdown
Member

@kszucs kszucs commented May 29, 2026

Which issue does this PR close?

  • None

Rationale for this change

The CDC options currently work as use_content_defined_chunking: Option<CdcOptions> with a ConfigField impl that accepts a bare use_content_defined_chunking = true|false and otherwise enables CDC implicitly when any sub-field is set. This has a few problems:

  • Naming diverges from parquet-rs. WriterProperties exposes content_defined_chunking() / set_content_defined_chunking(Option<CdcOptions>) with no use_ prefix.
  • Implicit / order-dependent on the SQL side. Format options in COPY ... OPTIONS / CREATE EXTERNAL TABLE ... OPTIONS are applied from a HashMap (non-deterministic order). With the old bare-boolean form, mixing ... = false with a sub-field, or setting a sub-field after = false, could resolve to enabled or disabled depending on iteration order.
  • Extra machinery. Supporting the bare boolean required a hand-written impl ConfigField for CdcOptions + impl ConfigField for Option<CdcOptions> and a #[expect(clippy::should_implement_trait)] workaround, plus a zero-sentinel fallback in the proto mapping.

Since CDC is unreleased, the config/proto surface can still be changed freely.

What changes are included in this PR?

  • Rename the ParquetOptions field use_content_defined_chunking -> content_defined_chunking (matches parquet-rs).
  • Make CdcOptions a plain config_namespace! with an explicit enabled: bool field alongside the chunking parameters; the field is a bare CdcOptions (no longer Option<CdcOptions>). CDC is on if content_defined_chunking.enabled is true. Setting a parameter no longer implicitly enables CDC, and the result is independent of key order.
  • Add CdcOptions::enabled() / CdcOptions::disabled() shorthand constructors.
  • Drop the ConfigField impls and the should_implement_trait workaround — all generated by the macro now.
  • Add an enabled field to the proto CdcOptions message so the proto <-> config mapping is a plain field copy in both directions (removes the presence-encoding and the zero-sentinel fallback).
  • Update unit tests, regenerate config docs + the information_schema snapshot, and add parquet_cdc_config.slt documenting the resolution behavior.

Are these changes tested?

Yes:

  • datafusion-common config + writer unit tests (enable toggle, parameter-does-not-enable, validation, writer round-trip).
  • datafusion-proto-common proto round-trip tests (enabled / disabled / negative norm level).
  • datafusion/core parquet integration tests (data round-trip, page boundaries).
  • sqllogictest: parquet_cdc.slt (end-to-end) and a new parquet_cdc_config.slt (config resolution / order independence).

Are there any user-facing changes?

Yes, but only to the unreleased CDC options:

  • Config key datafusion.execution.parquet.use_content_defined_chunking -> datafusion.execution.parquet.content_defined_chunking.enabled (plus .min_chunk_size / .max_chunk_size / .norm_level).
  • The bare-boolean form is removed; enable/disable via content_defined_chunking.enabled = true|false.

No released API is affected.

🤖 Generated with Claude Code

@github-actions github-actions Bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate datasource Changes to the datasource crate labels May 29, 2026
Content-defined chunking (CDC) write options were added in apache#21110 and have
not been released yet (current workspace is 53.x; CDC is slated for 54.0.0),
so the config and proto surfaces can still be changed freely. This reworks it
before it ships.

What changes:

* Rename the `ParquetOptions` field `use_content_defined_chunking` ->
  `content_defined_chunking`.
* `CdcOptions` becomes a plain `config_namespace!` with an explicit
  `enabled: bool` field alongside the chunking parameters, and the field is a
  bare `CdcOptions` (no longer `Option<CdcOptions>`). CDC is on iff
  `content_defined_chunking.enabled` is true. Add `CdcOptions::enabled()` /
  `CdcOptions::disabled()` shorthand constructors.
* Drop the bespoke `impl ConfigField for CdcOptions` /
  `impl ConfigField for Option<CdcOptions>` and the
  `#[expect(clippy::should_implement_trait)]` workaround that backed the old
  bare-boolean form. Everything is now generated by the macro.
* Add an `enabled` field to the proto `CdcOptions` message so the proto <->
  config mapping is a direct field copy, dropping the previous
  presence-encoding and the zero-sentinel fallback for the chunk sizes.

Why this is better:

* Naming matches parquet-rs. parquet's `WriterProperties` exposes
  `content_defined_chunking()` / `set_content_defined_chunking(...)` with no
  `use_` prefix; the field name now lines up across the boundary.

* Explicit, not magic. CDC is toggled with a real
  `content_defined_chunking.enabled = true|false` key instead of a special
  bare-boolean parse, and setting a chunking parameter no longer silently turns
  CDC on.

* No order-dependence on the SQL side. Format options in `COPY ... OPTIONS`
  and `CREATE EXTERNAL TABLE ... OPTIONS` are applied from a `HashMap`, i.e. in
  non-deterministic order. With a separate `enabled` flag, the flag and the
  parameters are set independently, so the resolved config never depends on the
  order in which the keys happen to be applied.

* Simpler. No hand-written `ConfigField` impls, no clippy hack, and the proto
  serialization is a plain field copy in both directions.

Tests, generated config docs, and the information_schema snapshot are updated
accordingly; a new `parquet_cdc_config.slt` documents the resolution behavior
(enable toggle, parameter-does-not-enable, order independence).
@kszucs kszucs force-pushed the cdc-options-align-parquet-rs branch from 3e66f6e to 30090c1 Compare May 29, 2026 23:37
@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented May 29, 2026

I had second thoughts about the cdc options configuration and actually found some weird implicit behavior (e.g. order dependent configuration), so I switched to a more verbose but explicit one (I also plan to follow this approach in iceberg).

@alamb could we include it in the release so we don't need to break the API later on?

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 29, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v53.1.0 (current)
       Built [  99.147s] (current)
     Parsing datafusion v53.1.0 (current)
      Parsed [   0.035s] (current)
    Building datafusion v53.1.0 (baseline)
       Built [  97.538s] (baseline)
     Parsing datafusion v53.1.0 (baseline)
      Parsed [   0.034s] (baseline)
    Checking datafusion v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.606s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 200.349s] datafusion
    Building datafusion-common v53.1.0 (current)
       Built [  32.835s] (current)
     Parsing datafusion-common v53.1.0 (current)
      Parsed [   0.058s] (current)
    Building datafusion-common v53.1.0 (baseline)
       Built [  33.254s] (baseline)
     Parsing datafusion-common v53.1.0 (baseline)
      Parsed [   0.059s] (baseline)
    Checking datafusion-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.630s] 222 checks: 220 pass, 2 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field ParquetOptions.content_defined_chunking in /home/runner/work/datafusion/datafusion/datafusion/common/src/config.rs:759
  field CdcOptions.enabled in /home/runner/work/datafusion/datafusion/datafusion/common/src/config.rs:712

--- failure struct_pub_field_missing: pub struct's pub field removed or renamed ---

Description:
A publicly-visible struct has at least one public field that is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/struct_pub_field_missing.ron

Failed in:
  field use_content_defined_chunking of struct ParquetOptions, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/cd68ab5c41e8206f2f33fe7ec754f88fd478e052/datafusion/common/src/config.rs:843

     Summary semver requires new major version: 2 major and 0 minor checks failed
    Finished [  68.580s] datafusion-common
    Building datafusion-datasource-parquet v53.1.0 (current)
       Built [  43.092s] (current)
     Parsing datafusion-datasource-parquet v53.1.0 (current)
      Parsed [   0.028s] (current)
    Building datafusion-datasource-parquet v53.1.0 (baseline)
       Built [  42.155s] (baseline)
     Parsing datafusion-datasource-parquet v53.1.0 (baseline)
      Parsed [   0.030s] (baseline)
    Checking datafusion-datasource-parquet v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.147s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  87.242s] datafusion-datasource-parquet
    Building datafusion-proto v53.1.0 (current)
       Built [  56.917s] (current)
     Parsing datafusion-proto v53.1.0 (current)
      Parsed [   0.018s] (current)
    Building datafusion-proto v53.1.0 (baseline)
       Built [  57.762s] (baseline)
     Parsing datafusion-proto v53.1.0 (baseline)
      Parsed [   0.019s] (baseline)
    Checking datafusion-proto v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.250s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 117.192s] datafusion-proto
    Building datafusion-proto-common v53.1.0 (current)
       Built [  21.317s] (current)
     Parsing datafusion-proto-common v53.1.0 (current)
      Parsed [   0.047s] (current)
    Building datafusion-proto-common v53.1.0 (baseline)
       Built [  21.750s] (baseline)
     Parsing datafusion-proto-common v53.1.0 (baseline)
      Parsed [   0.049s] (baseline)
    Checking datafusion-proto-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.115s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field CdcOptions.enabled in /home/runner/work/datafusion/datafusion/datafusion/proto-common/src/generated/prost.rs:980
  field CdcOptions.enabled in /home/runner/work/datafusion/datafusion/datafusion/proto-common/src/generated/prost.rs:980
  field CdcOptions.enabled in /home/runner/work/datafusion/datafusion/datafusion/proto-common/src/generated/prost.rs:980

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  45.891s] datafusion-proto-common
    Building datafusion-proto-models v53.1.0 (current)
       Built [  24.532s] (current)
     Parsing datafusion-proto-models v53.1.0 (current)
      Parsed [   0.123s] (current)
    Building datafusion-proto-models v53.1.0 (baseline)
       Built [  24.769s] (baseline)
     Parsing datafusion-proto-models v53.1.0 (baseline)
      Parsed [   0.125s] (baseline)
    Checking datafusion-proto-models v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.648s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field CdcOptions.enabled in /home/runner/work/datafusion/datafusion/datafusion/proto-models/src/generated/datafusion_proto_common.rs:980
  field CdcOptions.enabled in /home/runner/work/datafusion/datafusion/datafusion/proto-models/src/generated/datafusion_proto_common.rs:980

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  52.761s] datafusion-proto-models
    Building datafusion-sqllogictest v53.1.0 (current)
       Built [ 167.730s] (current)
     Parsing datafusion-sqllogictest v53.1.0 (current)
      Parsed [   0.022s] (current)
    Building datafusion-sqllogictest v53.1.0 (baseline)
       Built [ 168.506s] (baseline)
     Parsing datafusion-sqllogictest v53.1.0 (baseline)
      Parsed [   0.024s] (baseline)
    Checking datafusion-sqllogictest v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.098s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 340.486s] datafusion-sqllogictest

@github-actions github-actions Bot added the auto detected api change Auto detected API change label May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant