This is a tracking issue for adding the DType::Union variant and canonical encoding for Union.
Motivation
See the parent epic issue: #7705
Design
We will be following the Arrow specification of Union, though we have an important decision to make here. Arrow specifies both a dense and a sparse union layout. Both layouts (encodings in Vortex) would have child arrays for each type that a value could be, as well as a types child array that specifies which type in the Union array a given row is (and thus, which child array to look at to fetch the correct value).
Dense vs Sparse
Dense encoding: each child array contains only the values that actually belong to that type, packed contiguously. An additional offsets buffer tells us which slot of the chosen child holds the value for that row.
Slicing and filtering requires walking type_ids to recompute per-child offsets.
type_ids: [0, 1, 0, 1, 0]
offsets: [0, 0, 1, 1, 2]
child 0 (int): [10, 20, 30]
child 1 (string): ["a", "b"]
Sparse encoding: every child array has the same length as the union itself, so the value at row i is simply found at position i of child type_ids[i].
The "inactive" slots of each child are unused but must still be addressable. There is no offsets buffer.
type_ids: [0, 1, 0, 1, 0]
child 0 (int): [10, _, 20, _, 30] // _ slots are inactive
child 1 (string): [_, "a", _, "b", _]
Why Sparse should be Canonical (and not Dense)
We want to support quick execution of things like slice and filter on our canonical arrays, and sparse is clearly superior to dense in this case (since dense would have to perform compute just to figure out how to slice child arrays).
Additionally, the space loss of sparse in Arrow does not necessarily translate to Vortex because our canonical types do not require the children to also be canonical. This means that actually sparse children under a sparse Union (for example, if I have 99000 integers and 1000 strings all separated) can be encoded with a SparseArray, so we would not lose much space compared to the dense union encoding.
It then follows that DenseUnionArray can just be a different physical encoding of Union. I think the compressor might have to be smarter about compressing into Dense though, since SparseUnion is basically always more performant for the compute we want to do.
Type IDs
Per Arrow's Schema.fbs: "By default ids in the type vector refer to the offsets in the children. Optionally typeIds provides an indirection between the child offset and the type id for each child."
So the per-row type tag defaults to consecutive 0..N matching child offsets, but an optional typeIds: [i8] array can provide indirection. For example, children [a, b, c] with typeIds = [0, 5, 7] means the data buffer uses tags 0, 5, 7 to select them (This matters for schema evolution because it means removing a child doesn't renumber the others).
UnionVariants will have to carry this information.
Also note that UnionVariants will enforce that variant names cannot be duplicated, and must be unique. The arrow specification says nothing about this, but we can enforce this behavior as the alternative of having duplicate variant names is insane.
Unresolved questions
Steps
Implementation history
This is a tracking issue for adding the
DType::Unionvariant and canonical encoding forUnion.Motivation
See the parent epic issue: #7705
Design
We will be following the Arrow specification of Union, though we have an important decision to make here. Arrow specifies both a dense and a sparse union layout. Both layouts (encodings in Vortex) would have child arrays for each type that a value could be, as well as a types child array that specifies which type in the
Unionarray a given row is (and thus, which child array to look at to fetch the correct value).Dense vs Sparse
Dense encoding: each child array contains only the values that actually belong to that type, packed contiguously. An additional
offsetsbuffer tells us which slot of the chosen child holds the value for that row.Slicing and filtering requires walking
type_idsto recompute per-child offsets.Sparse encoding: every child array has the same length as the union itself, so the value at row
iis simply found at positioniof childtype_ids[i].The "inactive" slots of each child are unused but must still be addressable. There is no
offsetsbuffer.Why Sparse should be Canonical (and not Dense)
We want to support quick execution of things like slice and filter on our canonical arrays, and sparse is clearly superior to dense in this case (since dense would have to perform compute just to figure out how to slice child arrays).
Additionally, the space loss of sparse in Arrow does not necessarily translate to Vortex because our canonical types do not require the children to also be canonical. This means that actually sparse children under a sparse
Union(for example, if I have 99000 integers and 1000 strings all separated) can be encoded with aSparseArray, so we would not lose much space compared to the dense union encoding.It then follows that
DenseUnionArraycan just be a different physical encoding ofUnion. I think the compressor might have to be smarter about compressing intoDensethough, sinceSparseUnionis basically always more performant for the compute we want to do.Type IDs
Per Arrow's
Schema.fbs: "By default ids in the type vector refer to the offsets in the children. OptionallytypeIdsprovides an indirection between the child offset and the type id for each child."So the per-row type tag defaults to consecutive
0..Nmatching child offsets, but an optionaltypeIds: [i8]array can provide indirection. For example, children[a, b, c]withtypeIds = [0, 5, 7]means the data buffer uses tags0,5,7to select them (This matters for schema evolution because it means removing a child doesn't renumber the others).UnionVariantswill have to carry this information.Also note that
UnionVariantswill enforce that variant names cannot be duplicated, and must be unique. The arrow specification says nothing about this, but we can enforce this behavior as the alternative of having duplicate variant names is insane.Unresolved questions
Steps
DType::Unionvariant +UnionVariants(flatbuffer/proto schema updates, round-trip tests)UnionArray(vtable, validation,Canonical::Union,Canonical::empty)to_arrow/from_arrowround-trip (handles both dense and sparse Arrow unions → canonical sparse Vortex)UnionBuilder+UnionScalar(+ scalar.proto update)DenseUnionArrayas a separate non-canonical physical encoding + compressor changes?Implementation history
DType::UnionandUnionVariants#7890DType::Unionvariant carrying justNullability#7901UnionVariantstype and embed it inDType::Union#7902