Commit a540c0b
feat: provide expected stats schema (delta-io#1592)
## What changes are proposed in this pull request?
related: delta-io#1075
This PR adds `SchemaTransform`s and methods to get the expected stats
schema for a given table. Ultimately the aim is to support writing stats
as `stats_parsed` to checkpoint files.
When putting up the PR some questions arose.
First, when would we even want to use an explicit stats schema?
All current code paths, that use the schema seem better served using the
current approach of assuming the might be stats for all columns. IIRC,
the implementation is quite efficient in handling the potentially large
number of null columns.
During log replay when reading the stats from a file I'd assume we would
apply the same approach when parsing the stats json and request the full
stats schema when reading from a checkpoint?
For now I see two cases where we may want to make use of an explicit
stats schema.
* as a hint what statistics an engine should collect when writing files.
* when parsing the `stats` field from json commits to write to a
checkpoint.
Secondly about the specifics of constructing the schema.
* should we error if a column within the column names property is not
skipping eligible?
* should we count non eligible fields towards the number of indexed
cols?
* is the `nullCount` schema always derived from the min/max schema, it
seems null counts could be collected for arbitrary columns?
The java kernel constructs a full schema first, omitting non eligible
fields for min max stats. Then the schema is pruned based on the leafs
referenced in the skipping predicate. The use case for writing to
checkpoints does not seem to be supported (yet)?
## Update
After some investigation, this is what we do now ...
After going through the delta-spark implementation I do hope we have
something now that corresponds to how spark does it.
The way we do things now, first we compute a base schema based on the
table configuration - i.e. dataSkippingNumIndexedCols and
dataSkippingStatsColumns. Two things that may not immediately be
obvious.
num indexed cols does not omit any fields but treats maps and arrays as
leaf fields. Then counts all leafs.
stats columns allows for specifying struct fields which implies all
child fields are included as well.
For nullCount we just alter all the data-types also treating maps,
arrays as leafs.
For min/max we prune the above schema to only include skipping-eligible
fields - most primitive types.
The one big question that came up - how to treat Variant. So far I just
treat them is same as maps and arrays ...
## How was this change tested?
Additional unit tests.
---------
Signed-off-by: Robert Pack <robstar.pack@gmail.com>
Co-authored-by: Drake Lin <drakelin18@gmail.com>1 parent 98a9372 commit a540c0b
3 files changed
Lines changed: 689 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| |||
0 commit comments