Related to the work on struct array handling:
When filtering on struct fields (e.g. WHERE s['value'] > 5), Datafusion currently can not prune row groups using Parquet column statistics, even though the underlying leaf columns have valid min/max statistics stored in the parquet metadata
The issue is in the pruning predicate system. When it encounters a GetField expr like GetField(Column("s"), "value"), the column extraction logic only sees the parent struct Column(s) and doesn't resolve through to the nested field
Fixing this would mean teaching the pruning system to resolve GetField expressions down to their leaf columns, then look up the corresponding Parquet column stats. Note, the stats themselves are already there in the Parquet metadata, they're just never consulted for nested field access
On tables with many row groups, this could significantly reduce the amount of data read for struct field predicates
Related to the work on struct array handling:
When filtering on struct fields (e.g.
WHERE s['value'] > 5), Datafusion currently can not prune row groups using Parquet column statistics, even though the underlying leaf columns have valid min/max statistics stored in the parquet metadataThe issue is in the pruning predicate system. When it encounters a
GetFieldexpr likeGetField(Column("s"), "value"), the column extraction logic only sees the parent structColumn(s)and doesn't resolve through to the nested fieldFixing this would mean teaching the pruning system to resolve
GetFieldexpressions down to their leaf columns, then look up the corresponding Parquet column stats. Note, the stats themselves are already there in the Parquet metadata, they're just never consulted for nested field accessOn tables with many row groups, this could significantly reduce the amount of data read for struct field predicates