Allow specifying an arrow schema for PartitionedFile#22360
Conversation
There was a problem hiding this comment.
@fpetkovski
Thanks for working on this. I think there is one end-to-end serialization issue that needs to be addressed before this lands.
| pub metadata_size_hint: Option<usize>, | ||
| pub table_reference: Option<TableReference>, | ||
| /// A user-provided arrow schema for the file. | ||
| pub arrow_schema: Option<SchemaRef>, |
There was a problem hiding this comment.
I think this needs a follow-up before merge. PartitionedFile::arrow_schema introduces a new user-provided scan contract, but physical plan proto serialization currently appears to drop it. datafusion/proto/src/physical_plan/to_proto.rs builds protobuf::PartitionedFile without this field, and datafusion/proto/proto/datafusion.proto does not seem to have a schema field for it.
As a result, a Parquet scan that is serialized and deserialized would lose the explicit schema and fall back to parsing ARROW:schema, so the main guarantee from this change would not hold end to end.
Could you please add this field to the proto model and conversions, plus a roundtrip test showing that PartitionedFile::with_arrow_schema(...) survives physical plan or PartitionedFile proto serialization?
There was a problem hiding this comment.
Thanks, I updated the protos to serialize and deserialize the file arrow schema as well. There is a proto test now which verifies the round trip.
1bdaf0e to
bf0c9ab
Compare
7feb2ce to
2c719c8
Compare
kosiew
left a comment
There was a problem hiding this comment.
This iteration looks good to me
alamb
left a comment
There was a problem hiding this comment.
Looks good to me too @fpetkovski
I think it would also help if you (perhaps as a follow on PR) added an example that shows how to use this API with the background / explanation for why it is better to specify the schema if it is already known
| let options = { | ||
| let mut options = | ||
| ArrowReaderOptions::new().with_page_index_policy(PageIndexPolicy::Skip); | ||
| if let Some(schema) = self.partitioned_file.arrow_schema.as_ref() { | ||
| options = options.with_schema(schema.to_owned()); | ||
| } | ||
| options | ||
| }; |
There was a problem hiding this comment.
I think this could be collapsed / simplified:
| let options = { | |
| let mut options = | |
| ArrowReaderOptions::new().with_page_index_policy(PageIndexPolicy::Skip); | |
| if let Some(schema) = self.partitioned_file.arrow_schema.as_ref() { | |
| options = options.with_schema(schema.to_owned()); | |
| } | |
| options | |
| }; | |
| let mut options = ArrowReaderOptions::new().with_page_index_policy(PageIndexPolicy::Skip); | |
| if let Some(schema) = self.partitioned_file.arrow_schema.as_ref() { | |
| options = options.with_schema(schema.to_owned()); | |
| } |
| let mut options = | ||
| ArrowReaderOptions::new().with_page_index_policy(PageIndexPolicy::Skip); | ||
| if let Some(schema) = self.partitioned_file.arrow_schema.as_ref() { | ||
| options = options.with_schema(schema.to_owned()); |
There was a problem hiding this comment.
I also think the to_owned() call is a non standard way of cloning a schema
options = options.with_schema(schema.to_owned());I think the more standard / explicit way is
options = options.with_schema(Arc::clone(schema));|
Thank you @kosiew for the review |
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
|
Thanks @alamb I addressed both of your comments. I am happy to follow up with an example, need to get familiar with the patterns that are used there first. |
Which issue does this PR close?
Rationale for this change
As described in the linked issue, parsing the arrow schema from parquet metadata can be expensive for point lookups, relative to the rest of the query execution pipeline. If the user knows the arrow schema of the file, they should be able to specify it explicitly.
What changes are included in this PR?
arrow_schema: SchemaReffield toPartitionedFilearrow_schemafield in the parquet opener to bypass schema inference from theARROW:schemametadata field.Are these changes tested?
Added unit tests for both matching and mismatching schemas.
Are there any user-facing changes?
There are no breaking changes, the new field is optional and is set to None by default.