Commit a540c0b

and

authored

feat: provide expected stats schema (delta-io#1592)

## What changes are proposed in this pull request? related: delta-io#1075 This PR adds `SchemaTransform`s and methods to get the expected stats schema for a given table. Ultimately the aim is to support writing stats as `stats_parsed` to checkpoint files. When putting up the PR some questions arose. First, when would we even want to use an explicit stats schema? All current code paths, that use the schema seem better served using the current approach of assuming the might be stats for all columns. IIRC, the implementation is quite efficient in handling the potentially large number of null columns. During log replay when reading the stats from a file I'd assume we would apply the same approach when parsing the stats json and request the full stats schema when reading from a checkpoint? For now I see two cases where we may want to make use of an explicit stats schema. * as a hint what statistics an engine should collect when writing files. * when parsing the `stats` field from json commits to write to a checkpoint. Secondly about the specifics of constructing the schema. * should we error if a column within the column names property is not skipping eligible? * should we count non eligible fields towards the number of indexed cols? * is the `nullCount` schema always derived from the min/max schema, it seems null counts could be collected for arbitrary columns? The java kernel constructs a full schema first, omitting non eligible fields for min max stats. Then the schema is pruned based on the leafs referenced in the skipping predicate. The use case for writing to checkpoints does not seem to be supported (yet)? ## Update After some investigation, this is what we do now ... After going through the delta-spark implementation I do hope we have something now that corresponds to how spark does it. The way we do things now, first we compute a base schema based on the table configuration - i.e. dataSkippingNumIndexedCols and dataSkippingStatsColumns. Two things that may not immediately be obvious. num indexed cols does not omit any fields but treats maps and arrays as leaf fields. Then counts all leafs. stats columns allows for specifying struct fields which implies all child fields are included as well. For nullCount we just alter all the data-types also treating maps, arrays as leafs. For min/max we prune the above schema to only include skipping-eligible fields - most primitive types. The one big question that came up - how to treat Variant. So far I just treat them is same as maps and arrays ... ## How was this change tested? Additional unit tests. --------- Signed-off-by: Robert Pack <robstar.pack@gmail.com> Co-authored-by: Drake Lin <drakelin18@gmail.com>

1 parent 98a9372 commit a540c0bCopy full SHA for a540c0b

3 files changed

kernel/src
- scan
  - data_skipping.rs
  - data_skipping
    - stats_schema.rs
- table_configuration.rs

`‎kernel/src/scan/data_skipping.rs‎`

-Original file line number
+Diff line change
     Engine, EngineData, ExpressionEvaluator, JsonHandler, PredicateEvaluator, RowVisitor as _,
 };
 +pub(crate) mod stats_schema;
 #[cfg(test)]
 mod tests;

Comments

(0)