You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor: give parquet CDC options an explicit enabled flag
Content-defined chunking (CDC) write options were added in #21110 and have
not been released yet (current workspace is 53.x; CDC is slated for 54.0.0),
so the config and proto surfaces can still be changed freely. This reworks it
before it ships.
What changes:
* Rename the `ParquetOptions` field `use_content_defined_chunking` ->
`content_defined_chunking`.
* `CdcOptions` becomes a plain `config_namespace!` with an explicit
`enabled: bool` field alongside the chunking parameters, and the field is a
bare `CdcOptions` (no longer `Option<CdcOptions>`). CDC is on iff
`content_defined_chunking.enabled` is true. Add `CdcOptions::enabled()` /
`CdcOptions::disabled()` shorthand constructors.
* Drop the bespoke `impl ConfigField for CdcOptions` /
`impl ConfigField for Option<CdcOptions>` and the
`#[expect(clippy::should_implement_trait)]` workaround that backed the old
bare-boolean form. Everything is now generated by the macro.
* Add an `enabled` field to the proto `CdcOptions` message so the proto <->
config mapping is a direct field copy, dropping the previous
presence-encoding and the zero-sentinel fallback for the chunk sizes.
Why this is better:
* Naming matches parquet-rs. parquet's `WriterProperties` exposes
`content_defined_chunking()` / `set_content_defined_chunking(...)` with no
`use_` prefix; the field name now lines up across the boundary.
* Explicit, not magic. CDC is toggled with a real
`content_defined_chunking.enabled = true|false` key instead of a special
bare-boolean parse, and setting a chunking parameter no longer silently turns
CDC on.
* No order-dependence on the SQL side. Format options in `COPY ... OPTIONS`
and `CREATE EXTERNAL TABLE ... OPTIONS` are applied from a `HashMap`, i.e. in
non-deterministic order. With a separate `enabled` flag, the flag and the
parameters are set independently, so the resolved config never depends on the
order in which the keys happen to be applied.
* Simpler. No hand-written `ConfigField` impls, no clippy hack, and the proto
serialization is a plain field copy in both directions.
Tests, generated config docs, and the information_schema snapshot are updated
accordingly; a new `parquet_cdc_config.slt` documents the resolution behavior
(enable toggle, parameter-does-not-enable, order independence).
self.min_chunk_size.visit(v,&key,"Minimum chunk size in bytes. The rolling hash will not trigger a split until this many bytes have been accumulated. Default is 256 KiB.");
764
-
let key = format!("{key_prefix}.max_chunk_size");
765
-
self.max_chunk_size.visit(v,&key,"Maximum chunk size in bytes. A split is forced when the accumulated size exceeds this value. Default is 1 MiB.");
766
-
let key = format!("{key_prefix}.norm_level");
767
-
self.norm_level.visit(v,&key,"Normalization level. Increasing this improves deduplication ratio but increases fragmentation. Recommended range is [-3, 3], default is 0.");
768
-
}
730
+
/// Maximum chunk size in bytes. A split is forced when the accumulated
0 commit comments