fix: propagate parent struct null mask in GetStructField#4523
Open
schenksj wants to merge 2 commits into
Open
Conversation
A field of a NULL struct must be NULL (Spark semantics). Arrow stores a StructArray's child arrays with their own validity, INDEPENDENT of the parent struct's null buffer, so the raw child value at a row where the struct itself is null can be non-null (e.g. parquet files where a logically-null struct column still carries a populated child buffer). GetStructField.evaluate returned the child column verbatim, so isnotnull(struct.field) wrongly evaluated TRUE for a null struct. Fix: union the parent struct's null mask into the extracted child (null where the struct is null OR the child is null). Adds a standalone unit test that fails without the fix and passes with it. Closes apache#4432 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #4432.
Rationale for this change
A field of a NULL struct must be NULL (Spark semantics). Arrow stores a
StructArray's child arrays with their own validity, independent of the parent struct's null buffer — so the raw child value at a row where the struct itself is null can be non-null (e.g. parquet files where a logically-null struct column still carries a populated child buffer).GetStructField::evaluatereturned the child column verbatim, soisnotnull(struct.field)wrongly evaluated TRUE for a null struct.What changes are included in this PR?
GetStructFieldnow unions the parent struct's null mask into the extracted child (null where the struct is null OR the child is null), via aproject_fieldhelper used by both the array and scalar-struct evaluation paths.How are these changes tested?
Added a standalone unit test
field_of_null_struct_is_nullthat builds aStructArraywhose child buffer is non-null at every row while the struct is null at some rows. The test fails without the fix (the field comes back non-null for the null-struct rows) and passes with it.