Add UnionVariants type and embed it in DType::Union#7902
Conversation
Merging this PR will improve performance by 19.98%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | varbinview_zip_block_mask |
3.7 ms | 2.9 ms | +27.71% |
| ⚡ | Simulation | varbinview_zip_fragmented_mask |
6.9 ms | 6.2 ms | +12.72% |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing ct/union-variants (bb803d1) with develop (c59e16f)
| /// child at offset `i`. Supporting non-consecutive tags from v1 lets the schema remove children | ||
| /// without renumbering the remaining tags. | ||
| /// | ||
| /// Variant names must be distinct. Unlike [`StructFields`](crate::dtype::StructFields), which |
There was a problem hiding this comment.
I am not sure I agree with this.
This will stop arrow round trip.
There was a problem hiding this comment.
Allowing this would prevent duckdb round trip which explicitly disallows duplicates.
So allowing duplicates just because arrow forgot to specify it in the spec does not seem worth being completely incompatible with duckdb.
There was a problem hiding this comment.
Looks like arrow doesn’t care about names since the children are resolved via types buffer. Why does duckdb care? Duplicate names will never conflict since children are always mutually exclusive.
| /// Returns an error if names, dtypes, or type IDs do not all have the same length, or if there | ||
| /// are any duplicate names or type ids. | ||
| pub fn try_new(names: FieldNames, dtypes: Vec<DType>, type_ids: Vec<i8>) -> VortexResult<Self> { | ||
| Self::validate_shape(&names, dtypes.len(), &type_ids)?; |
There was a problem hiding this comment.
this look too expensive for release.
There was a problem hiding this comment.
What do you mean release? Are you just saying that we need a new_unchecked?
Even if there were 128 relatively long names for each of the union variants, I dont think this is that expensive, and if somehow this ends up in a hot loop we can easily optimize this.
| /// | ||
| /// `names` and `dtypes` must have the same length, and `names.len()` cannot be more than | ||
| /// `i8::MAX as usize + 1` (128). | ||
| pub fn new_consecutive(names: FieldNames, dtypes: Vec<DType>) -> VortexResult<Self> { |
There was a problem hiding this comment.
shall we just call this new?
|
|
||
| /// Find the offset of a variant by name. Returns `None` if no variant has the name. | ||
| pub fn find(&self, name: impl AsRef<str>) -> Option<usize> { | ||
| self.0.indices().get(name.as_ref()).copied() |
There was a problem hiding this comment.
i would think this is slower for small variants?
4d92593 to
06f2495
Compare
f7a0f34 to
fef8707
Compare
1cfc585 to
fce9864
Compare
Replaces `DType::Union(Nullability)` with `DType::Union(UnionVariants, Nullability)`. `UnionVariants` carries a `FieldNames` list, per-variant `FieldDType`s, and an `i8` type-id vector (supporting non-consecutive tags like `[0, 5, 7]` so children can be removed without renumbering). The flatbuffer and protobuf `Union` schemas are extended with the `names`/`dtypes`/`type_ids` fields, and serde, flatbuffers, and protobuf paths are filled in with round-trip tests and nullability-constraint checks. All the `todo!()` stubs added in the previous commit for `DType::Union(..)` arms are left in place; concrete implementations land alongside the array work. Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
fce9864 to
bb803d1
Compare
|
I need to override the flatc backcompat check here because we split this into 2 PRs |
e290a2b to
bb803d1
Compare
Summary
Tracking issue: #7882
Replaces
DType::Union(Nullability)withDType::Union(UnionVariants, Nullability).UnionVariantscarries aFieldNameslist, per-variantFieldDTypes, and ani8type-id vector (supporting non-consecutive tags like[0, 5, 7]so children can be removed without renumbering). We usei8because that is what arrow uses, and because there is already a max of 128 different union variants for any given array.A lot of the code is essentially duplicated from implementations of Struct. It could have been nice if we could reuse StructFields, but we need to additionally store a type IDs field that maps the different variant types to tag indexes (i.e. 0, 1, 2, 5), so we might as well have a new type for this that is named more appropriately.
Also note that the tags do not have to be consecutive (to support potential schema evolution, and also it does make things like casts easier).
I've also enforced that variant names are distinct. The Arrow spec does not make any guarantee about this, but I've chosen the sane path that DuckDB also chooses.
I'm going to guess that some of us will want to just allow duplicate variant names, so we should just discuss below.
API Changes
Changes
DType::Unionto carryUnionVariants.Testing
Some basic roundtrip testing for the serialization formats.