Commit c8b784a
refactor(parquet-datasource): split sink and schema_coercion out of file_format.rs (#22347)
## Which issue does this PR close?
Relates to the discussion in #22024 about the Parquet datasource crate
becoming hard to navigate. Split out of #22156, which bundled several
code-motion moves into one PR — this is one of three smaller,
independently-reviewable PRs that replace it.
## Rationale for this change
`file_format.rs` had grown to ~2,000 LOC, bundling several distinct
responsibilities into one file. That makes it hard to read and hard to
review changes in isolation. This PR is **pure code motion**: no
behavior change and no public API change.
## What changes are included in this PR?
Extracts two responsibilities from `file_format.rs` into focused modules
(`file_format.rs` drops to ~660 LOC):
- `sink.rs` — `ParquetSink` and the parallel-write machinery
(`column_serializer_task`, `spawn_column_parallel_row_group_writer`,
`output_single_parquet_file_parallelized`,
`concatenate_parallel_row_groups`, etc.).
- `schema_coercion.rs` — the Arrow-schema coercion utilities
(`apply_file_schema_type_coercions`, `coerce_int96_to_resolution`,
`coerce_file_schema_to_view_type`, `coerce_file_schema_to_string_type`,
`transform_schema_to_view`, `transform_binary_to_string`,
`field_with_new_type`) and their tests.
Every previously-public item is still reachable at the same path: the
crate root re-exports `sink::ParquetSink` and the `schema_coercion::*`
functions, and the historical `file_format::ParquetSink` path is
preserved via `pub use` (datafusion-proto depends on it).
## Are these changes tested?
Yes, covered by existing tests (the `coerce_int96_to_resolution_*` tests
moved with the function to `schema_coercion.rs`). `cargo test -p
datafusion-datasource-parquet --all-features` (122 passing) and `cargo
clippy -p datafusion-datasource-parquet --all-targets --all-features --
-D warnings` both pass. `datafusion-proto` (a downstream `ParquetSink`
consumer) builds clean.
## Are there any user-facing changes?
No. Public API is unchanged — every previously-public item is still
reachable at the same crate-root path. The only difference is the file
organization inside the crate.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent b4739e5 commit c8b784a
4 files changed
Lines changed: 1473 additions & 1392 deletions
0 commit comments