Skip to content

feat(parquet-datasource): support virtual columns (e.g. row number) in the parquet opener#22515

Closed
JanKaul wants to merge 1 commit into
apache:mainfrom
Embucket:virtual-row-number
Closed

feat(parquet-datasource): support virtual columns (e.g. row number) in the parquet opener#22515
JanKaul wants to merge 1 commit into
apache:mainfrom
Embucket:virtual-row-number

Conversation

@JanKaul

@JanKaul JanKaul commented May 25, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

The upstream parquet crate exposes "virtual columns" — columns that aren't physically stored in the file but are materialized by the reader (e.g. the per-row RowNumber extension type). DataFusion's parquet datasource didn't wire this through: a user who included a RowNumber-tagged field in their file schema would get a "column not found in parquet schema" error or a misaligned projection mask, because the opener fed the full Arrow schema to ArrowReaderOptions::with_schema (which expects real fields only) and built ProjectionMask::roots from indices that included virtual fields with no parquet leaves.

What changes are included in this PR?

  • datafusion/datasource-parquet/src/opener/mod.rs
    • Add split_virtual_fields helper that partitions a schema into (real-only schema, virtual field list).
    • In PreparedParquetOpen::load, extract virtuals from logical_file_schema and pass them to ArrowReaderOptions::with_virtual_columns; store the list on MetadataLoadedParquetOpen so later stages can re-supply the real-only schema.
    • In MetadataLoadedParquetOpen, strip virtuals before each options.with_schema(...) call via a resupply_schema closure (the underlying API rejects virtuals).
    • Drop a now-unneeded #[cfg(feature = "parquet_encryption")] let mut options = options; shadow — the new unconditional with_virtual_columns write site means options always has a potential mutation, so the cfg-gated shadow is no longer required to silence unused_mut.
  • datafusion/datasource-parquet/src/row_filter.rs
    • In build_projection_read_plan, filter virtual roots out of the indices passed to ProjectionMask::roots / leaf_indices_for_roots (virtuals have no parquet column to mask).
    • Add a comment on build_parquet_read_plan documenting why no symmetric filter is needed: leaf_indices_for_roots silently drops indices with no matching leaf, and the decoder appends virtuals to every batch regardless of the mask, so projected_schema stays aligned.

Are these changes tested?

Yes. Added two tests in opener/mod.rs:

  • test_virtual_row_number_column — round-trips a [Int32, RowNumber(Int64)] schema across two row groups and asserts row_num == [0..6).
  • test_virtual_row_number_column_with_row_group_pruning — same setup with a predicate (a >= 20) that prunes the first row group and confirms the row numbers come back as [3, 4, 5] (i.e. the virtual column reflects the file's absolute row indices, not the post-pruning sequence).

Are there any user-facing changes?

Yes — purely additive. A Field tagged with the RowNumber extension type (or any other parquet virtual-column extension) in a file schema is now honored by the parquet datasource. No existing API signatures change.

@github-actions github-actions Bot added the datasource Changes to the datasource crate label May 25, 2026
@JanKaul JanKaul closed this May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant