Allow specifying an arrow schema for PartitionedFile by fpetkovski · Pull Request #22360 · apache/datafusion

fpetkovski · 2026-05-19T10:32:20Z

Which issue does this PR close?

Closes Reading arrow schemas from parquet files is expensive #22200.

Rationale for this change

As described in the linked issue, parsing the arrow schema from parquet metadata can be expensive for point lookups, relative to the rest of the query execution pipeline. If the user knows the arrow schema of the file, they should be able to specify it explicitly.

What changes are included in this PR?

Add a arrow_schema: SchemaRef field to PartitionedFile
Use the arrow_schema field in the parquet opener to bypass schema inference from the ARROW:schema metadata field.

Are these changes tested?

Added unit tests for both matching and mismatching schemas.

Are there any user-facing changes?

There are no breaking changes, the new field is optional and is set to None by default.

kosiew

@fpetkovski
Thanks for working on this. I think there is one end-to-end serialization issue that needs to be addressed before this lands.

kosiew · 2026-05-26T10:46:03Z

    pub metadata_size_hint: Option<usize>,
    pub table_reference: Option<TableReference>,
+    /// A user-provided arrow schema for the file.
+    pub arrow_schema: Option<SchemaRef>,


I think this needs a follow-up before merge. PartitionedFile::arrow_schema introduces a new user-provided scan contract, but physical plan proto serialization currently appears to drop it. datafusion/proto/src/physical_plan/to_proto.rs builds protobuf::PartitionedFile without this field, and datafusion/proto/proto/datafusion.proto does not seem to have a schema field for it.

As a result, a Parquet scan that is serialized and deserialized would lose the explicit schema and fall back to parsing ARROW:schema, so the main guarantee from this change would not hold end to end.

Could you please add this field to the proto model and conversions, plus a roundtrip test showing that PartitionedFile::with_arrow_schema(...) survives physical plan or PartitionedFile proto serialization?

Thanks, I updated the protos to serialize and deserialize the file arrow schema as well. There is a proto test now which verifies the round trip.

kosiew

@fpetkovski

This iteration looks good to me

alamb

Looks good to me too @fpetkovski

I think it would also help if you (perhaps as a follow on PR) added an example that shows how to use this API with the background / explanation for why it is better to specify the schema if it is already known

alamb · 2026-05-28T20:41:13Z

+        let options = {
+            let mut options =
+                ArrowReaderOptions::new().with_page_index_policy(PageIndexPolicy::Skip);
+            if let Some(schema) = self.partitioned_file.arrow_schema.as_ref() {
+                options = options.with_schema(schema.to_owned());
+            }
+            options
+        };


I think this could be collapsed / simplified:

Suggested change

let options = {

let mut options =

ArrowReaderOptions::new().with_page_index_policy(PageIndexPolicy::Skip);

if let Some(schema) = self.partitioned_file.arrow_schema.as_ref() {

options = options.with_schema(schema.to_owned());

}

options

};

let mut options = ArrowReaderOptions::new().with_page_index_policy(PageIndexPolicy::Skip);

if let Some(schema) = self.partitioned_file.arrow_schema.as_ref() {

options = options.with_schema(schema.to_owned());

}

alamb · 2026-05-28T20:42:30Z

+            let mut options =
+                ArrowReaderOptions::new().with_page_index_policy(PageIndexPolicy::Skip);
+            if let Some(schema) = self.partitioned_file.arrow_schema.as_ref() {
+                options = options.with_schema(schema.to_owned());


I also think the to_owned() call is a non standard way of cloning a schema

options = options.with_schema(schema.to_owned());

I think the more standard / explicit way is

options = options.with_schema(Arc::clone(schema));

alamb · 2026-05-28T20:43:52Z

Thank you @kosiew for the review

github-actions · 2026-05-29T09:17:19Z

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details

     Cloning apache/main
    Building datafusion-catalog-listing v53.1.0 (current)
       Built [  42.234s] (current)
     Parsing datafusion-catalog-listing v53.1.0 (current)
      Parsed [   0.014s] (current)
    Building datafusion-catalog-listing v53.1.0 (baseline)
       Built [  41.159s] (baseline)
     Parsing datafusion-catalog-listing v53.1.0 (baseline)
      Parsed [   0.012s] (baseline)
    Checking datafusion-catalog-listing v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.089s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 104.510s] datafusion-catalog-listing
    Building datafusion-datasource v53.1.0 (current)
       Built [  34.876s] (current)
     Parsing datafusion-datasource v53.1.0 (current)
      Parsed [   0.030s] (current)
    Building datafusion-datasource v53.1.0 (baseline)
       Built [  35.099s] (baseline)
     Parsing datafusion-datasource v53.1.0 (baseline)
      Parsed [   0.031s] (baseline)
    Checking datafusion-datasource v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.235s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field PartitionedFile.arrow_schema in /home/runner/work/datafusion/datafusion/datafusion/datasource/src/mod.rs:176

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  71.833s] datafusion-datasource
    Building datafusion-datasource-parquet v53.1.0 (current)
       Built [  41.041s] (current)
     Parsing datafusion-datasource-parquet v53.1.0 (current)
      Parsed [   0.028s] (current)
    Building datafusion-datasource-parquet v53.1.0 (baseline)
       Built [  41.495s] (baseline)
     Parsing datafusion-datasource-parquet v53.1.0 (baseline)
      Parsed [   0.030s] (baseline)
    Checking datafusion-datasource-parquet v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.161s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  84.444s] datafusion-datasource-parquet
    Building datafusion-proto v53.1.0 (current)
       Built [  55.980s] (current)
     Parsing datafusion-proto v53.1.0 (current)
      Parsed [   0.018s] (current)
    Building datafusion-proto v53.1.0 (baseline)
       Built [  55.779s] (baseline)
     Parsing datafusion-proto v53.1.0 (baseline)
      Parsed [   0.019s] (baseline)
    Checking datafusion-proto v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.251s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 113.604s] datafusion-proto
    Building datafusion-proto-models v53.1.0 (current)
       Built [  23.366s] (current)
     Parsing datafusion-proto-models v53.1.0 (current)
      Parsed [   0.138s] (current)
    Building datafusion-proto-models v53.1.0 (baseline)
       Built [  23.407s] (baseline)
     Parsing datafusion-proto-models v53.1.0 (baseline)
      Parsed [   0.121s] (baseline)
    Checking datafusion-proto-models v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.610s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field PartitionedFile.arrow_schema in /home/runner/work/datafusion/datafusion/datafusion/proto-models/src/generated/prost.rs:2138
  field PartitionedFile.arrow_schema in /home/runner/work/datafusion/datafusion/datafusion/proto-models/src/generated/prost.rs:2138

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  49.953s] datafusion-proto-models

fpetkovski · 2026-05-29T13:13:29Z

Thanks @alamb I addressed both of your comments. I am happy to follow up with an example, need to get familiar with the patterns that are used there first.

github-actions Bot added catalog Related to the catalog crate datasource Changes to the datasource crate labels May 19, 2026

kosiew requested changes May 26, 2026

View reviewed changes

fpetkovski force-pushed the partitioned-file-schema branch 6 times, most recently from 1bdaf0e to bf0c9ab Compare May 26, 2026 17:01

github-actions Bot added the proto Related to proto crate label May 26, 2026

fpetkovski added 6 commits May 27, 2026 15:07

Allow specifying an arrow schema for PartitionedFile

cf490a4

Fix cargo fmt

2db9b93

Fix clippy

021b9bb

Fix cargo fmt

9be11f3

Integrate arrow schema in proto

d011c64

Use TryFromProto

2c719c8

fpetkovski force-pushed the partitioned-file-schema branch from 7feb2ce to 2c719c8 Compare May 27, 2026 13:08

fpetkovski requested a review from kosiew May 27, 2026 13:53

kosiew approved these changes May 28, 2026

View reviewed changes

alamb approved these changes May 28, 2026

View reviewed changes

Simplify options init

451c718

github-actions Bot added the auto detected api change Auto detected API change label May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow specifying an arrow schema for PartitionedFile#22360

Allow specifying an arrow schema for PartitionedFile#22360
fpetkovski wants to merge 7 commits into
apache:mainfrom
fpetkovski:partitioned-file-schema

fpetkovski commented May 19, 2026

Uh oh!

kosiew left a comment •

edited

Loading

Uh oh!

kosiew May 26, 2026

Uh oh!

fpetkovski May 26, 2026

Uh oh!

Uh oh!

kosiew left a comment

Uh oh!

alamb left a comment

Uh oh!

alamb May 28, 2026

Uh oh!

alamb May 28, 2026

Uh oh!

alamb commented May 28, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

fpetkovski commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fpetkovski commented May 19, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kosiew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew May 26, 2026

Choose a reason for hiding this comment

Uh oh!

fpetkovski May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb May 28, 2026

Choose a reason for hiding this comment

Uh oh!

alamb May 28, 2026

Choose a reason for hiding this comment

Uh oh!

alamb commented May 28, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

fpetkovski commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kosiew left a comment •

edited

Loading