-
Notifications
You must be signed in to change notification settings - Fork 1
Update the Variant RFC to include more details for scalars and nullability #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 2 commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,11 +23,15 @@ enum Variant { | |
| } | ||
| ``` | ||
|
|
||
| Here the semantic `null` value inside the variant payload is represented as | ||
| `Scalar::null(DType::Null)`. That is distinct from the outer nullability of the | ||
| `Variant` dtype itself. | ||
|
|
||
| Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. | ||
|
|
||
| Variant types are usually stored in two ways - values that aren't accessed often in some system-specific binary encoding, and some number of "shredded" columns, where a specific key is extracted from the variant and stored in a dense format with a specific type, allowing for much more performant access. This design can make commonly accessed subfields perform like first-class columns, while keeping the overall schema flexible. Shredding policies differ by system, and can be pre-determined or inferred from the data itself or from usage patterns. | ||
|
|
||
| This document proposed adding a new `DType` variant named `Variant`, a logical type describing this group of data encodings and behavior, with its own canonical representation (see below). | ||
| This document proposes adding a new `DType::Variant(Nullability)`, a logical type describing this group of data encodings and behavior, with its own canonical representation (see below). | ||
|
|
||
| ### Arrow representation | ||
|
|
||
|
|
@@ -37,9 +41,25 @@ Supporting extension types requires replacing the target `DataType` and nullabil | |
|
|
||
| ### Nullability | ||
|
|
||
| In order to support data with a changing or unexpected schema, Variant arrays are always nullable, even for a specific key/path, its value might change type between items which will cause null values in shredded children. | ||
| `Variant` should follow the same top-level nullability model as every other Vortex dtype: | ||
| `DType::Variant(Nullability)` can be nullable or non-nullable. A nullable variant allows the | ||
| array slot itself to be absent. A non-nullable variant guarantees that the slot is present, but it | ||
| does **not** guarantee that extracted paths will be non-null. | ||
|
|
||
| This is distinct from the semantic null value inside the variant payload, which I'll call | ||
| `variantnull`. A `variantnull` is a present variant value whose payload is | ||
| `null`, while an outer null is the absence of the variant value itself. | ||
| In scalar form this is the difference between `Scalar::null(DType::Variant(Nullability::Nullable))`gst | ||
|
AdamGS marked this conversation as resolved.
Outdated
|
||
| and `Scalar::variant(Scalar::null(DType::Null))`. | ||
|
|
||
| Combined with shredding, handling nulls can be complex and is encoding dependent (Like this [parquet example](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#arrays) for handling arrays). | ||
| Typed extraction from a variant should therefore still return nullable arrays even when the source | ||
| variant column is non-nullable. A path can be missing in a given row, have an unexpected type, or | ||
| evaluate to `variantnull`, and each of those cases becomes null in the extracted child. | ||
|
|
||
| Combined with shredding, handling nulls can still be complex and is encoding dependent (like this | ||
| [parquet example](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#arrays) | ||
| for handling arrays), but that is separate from whether the outer `Variant` column itself is | ||
| nullable. | ||
|
|
||
| ### Expressions | ||
|
|
||
|
|
@@ -54,7 +74,14 @@ Every variant encoding will need to be able to dispatch these behaviors, returni | |
|
|
||
| ### Scalar | ||
|
|
||
| While there has been talk for a long time of converting the Vortex scalar system from an enum to length 1 arrays, I do believe the current system actually works very well for variants, and the Variant scalar can just be some version of the type described above. | ||
| While there has been talk for a long time of converting the Vortex scalar system from an enum to | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we are moving away from this now
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll leave it here as an historical artifact? IDK where that's discussed 🤷 |
||
| length 1 arrays, I do believe the current system actually works very well for variants. A variant | ||
| scalar can simply wrap another row-specific `Scalar`, rather than needing a dedicated scalar enum | ||
| just for variants. | ||
|
|
||
| That model also makes the null semantics explicit. `Scalar::null(DType::Variant(Nullability::Nullable))` | ||
| means the variant scalar itself is missing. `Scalar::variant(Scalar::null(DType::Null))` means the | ||
| variant is present and its payload is `variantnull`. | ||
|
|
||
| Just like when extracting child arrays, Variant's need to support an additional expression, `get_variant_scalar(idx, path, dtype)` that will indicate the desired dtype. | ||
|
|
||
|
|
@@ -113,7 +140,7 @@ As described in [this](https://clickhouse.com/blog/a-new-powerful-json-data-type | |
| - Iceberg seems to support the variant type (as described in [this](https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7VJVd8/edit?tab=t.0) proposal), but the docs are minimal. | ||
| - Datafusion's variant support is being developed [here](https://github.com/datafusion-contrib/datafusion-variant), its unclear to me how much effort is going into it and whether its going to be merged upstream. | ||
| - DuckDB doesn't support a variant type. It does have a [Union](https://duckdb.org/docs/stable/sql/data_types/union) type, but its basically a struct. It also seems to have support for Parquet's shredding, but I can't find any docs and seems like PRs are being merged as I'm looking through their issues. | ||
| - Databricks supports some specialized [variant functions](https://docs.databricks.com/gcp/en/sql/language-manual/sql-ref-functions-builtin#variant-functions). | ||
| - Databricks supports some specialized [variant functions](https://docs.databricks.com/gcp/en/sql/language-manual/sql-ref-functions-builtin#variant-functions), and their docs show a [good example](https://docs.databricks.com/aws/en/sql/language-manual/functions/is_variant_null) of null vs variant null. | ||
|
|
||
| ## Unresolved Questions | ||
|
|
||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.