Tracking Issue: Add `DType::Union`

This is a tracking issue for adding the `DType::Union` variant and canonical encoding for `Union`.

## Motivation

See the parent epic issue: https://github.com/vortex-data/vortex/issues/7705

## Design

We will be following the [Arrow specification of Union](https://arrow.apache.org/docs/format/Columnar.html#union-layout), though we have an important decision to make here. Arrow specifies both a dense and a sparse union layout. Both layouts (encodings in Vortex) would have child arrays for each type that a value could be, as well as a types child array that specifies which type in the `Union` array a given row is (and thus, which child array to look at to fetch the correct value).

### Dense vs Sparse

[Dense](https://arrow.apache.org/docs/format/Columnar.html#dense-union) encoding: each child array contains only the values that actually belong to that type, packed contiguously. An additional `offsets` buffer tells us which slot of the chosen child holds the value for that row.

Slicing and filtering requires walking `type_ids` to recompute per-child offsets.

```
type_ids: [0, 1, 0, 1, 0]
offsets:  [0, 0, 1, 1, 2]
child 0 (int):    [10, 20, 30]
child 1 (string): ["a", "b"]
```

[Sparse](https://arrow.apache.org/docs/format/Columnar.html#sparse-union) encoding: every child array has the same length as the union itself, so the value at row `i` is simply found at position `i` of child `type_ids[i]`.

The "inactive" slots of each child are unused but must still be addressable. There is no `offsets` buffer.

```
type_ids: [0, 1, 0, 1, 0]
child 0 (int):    [10, _, 20, _, 30]   // _ slots are inactive
child 1 (string): [_, "a", _, "b", _]
```

### Why Sparse should be Canonical (and not Dense)

We want to support quick execution of things like slice and filter on our canonical arrays, and sparse is clearly superior to dense in this case (since dense would have to perform compute just to figure out how to slice child arrays).

Additionally, the space loss of sparse in Arrow does not necessarily translate to Vortex because our canonical types do not require the children to also be canonical. This means that actually sparse children under a sparse `Union` (for example, if I have 99000 integers and 1000 strings all separated) can be encoded with a `SparseArray`, so we would not lose much space compared to the dense union encoding.

It then follows that `DenseUnionArray` can just be a different physical encoding of `Union`. I think the compressor might have to be smarter about compressing into `Dense` though, since `SparseUnion` is basically _always_ more performant for the compute we want to do.

### Type IDs

Per [Arrow's `Schema.fbs`](https://github.com/apache/arrow/blob/main/format/Schema.fbs): *"By default ids in the type vector refer to the offsets in the children. Optionally `typeIds` provides an indirection between the child offset and the type id for each child."*

So the per-row type tag defaults to consecutive `0..N` matching child offsets, but an optional `typeIds: [i8]` array can provide indirection. For example, children `[a, b, c]` with `typeIds = [0, 5, 7]` means the data buffer uses tags `0`, `5`, `7` to select them (This matters for schema evolution because it means removing a child doesn't renumber the others).

`UnionVariants` will have to carry this information.

Also note that `UnionVariants` will enforce that variant names cannot be duplicated, and must be unique. The arrow specification says nothing about this, but we can enforce this behavior as the alternative of having duplicate variant names is insane.

## Unresolved questions

- [ ] The Arrow specification states that the nullability of the union array is determined solely by the child arrays and the types array (it is whatever the nullability of the chosen type array is according to the types array). Is this something we need to follow? This would be different from the rest of our canonical types, which all track their validity in a dedicated validity mask.
    - I think the answer is that it just doesn't give any more information that we can't already express. So we will follow the Arrow spec, and validity will be a lazily-computed thing that is essentially a gather from a bunch of different validity masks from the children.
- [ ] TODO?

## Steps

- [ ] Add `DType::Union` variant + `UnionVariants` (flatbuffer/proto schema updates, round-trip tests)
- [ ] Canonical sparse `UnionArray` (vtable, validation, `Canonical::Union`, `Canonical::empty`)
- [ ] Compute kernels: slice, take, filter, mask, scalar accessor, cast
- [ ] Arrow `to_arrow` / `from_arrow` round-trip (handles both dense and sparse Arrow unions → canonical sparse Vortex)
- [ ] `UnionBuilder` + `UnionScalar` (+ scalar.proto update)
- [ ] `DenseUnionArray` as a separate non-canonical physical encoding + compressor changes?

## Implementation history

- https://github.com/vortex-data/vortex/pull/7890
- https://github.com/vortex-data/vortex/pull/7901
- https://github.com/vortex-data/vortex/pull/7902

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issue: Add `DType::Union` #7882

Motivation

Design

Dense vs Sparse

Why Sparse should be Canonical (and not Dense)

Type IDs

Unresolved questions

Steps

Implementation history

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tracking Issue: Add DType::Union #7882

Description

Motivation

Design

Dense vs Sparse

Why Sparse should be Canonical (and not Dense)

Type IDs

Unresolved questions

Steps

Implementation history

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Tracking Issue: Add `DType::Union` #7882