Skip to content

Commit f9feb1a

Browse files
authored
Extension Types (#5)
[Rendered](https://github.com/vortex-data/rfcs/blob/ct/ext-types/proposed/0005-extension.md) This RFC proposes extending the `ExtVTable` trait to support richer behavior for extension types beyond forwarding to the storage type. It covers the already completed vtable infrastructure, the proposed trait design for types/scalars/arrays, and identifies open questions around the extension array API and compute dispatch. The big open question here is if we need to do this at all. And if we do need to do this, what should the extension array API look like? Importantly, how should we define expressions over extension type arrays? --------- Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
1 parent 7a6f9c0 commit f9feb1a

File tree

1 file changed

+251
-0
lines changed

1 file changed

+251
-0
lines changed

proposed/0005-extension.md

Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
- Start Date: (2026-02-27)
2+
- RFC PR: [vortex-data/rfcs#5](https://github.com/vortex-data/rfcs/pull/5)
3+
- Tracking Issue: [vortex-data/vortex#6547](https://github.com/vortex-data/vortex/issues/6547)
4+
5+
## Summary
6+
7+
We would like to build a more robust system for extension data types (or `DType`s). This RFC
8+
proposes a direction for extending the `ExtVTable` trait to support richer behavior (beyond
9+
forwarding to the storage type), lays out the completed and in-progress work, and identifies the
10+
open questions that remain.
11+
12+
## Motivation
13+
14+
A limitation of the current type system in Vortex is that we cannot easily add new logical types.
15+
For example, the effort to add `FixedSizeList`
16+
([vortex#4372](https://github.com/vortex-data/vortex/issues/4372)) and also change `List` to
17+
`ListView` ([vortex#4699](https://github.com/vortex-data/vortex/issues/4699)) was very intrusive.
18+
It is much easier to add wrappers around canonical types (treating the canonical dtype as a
19+
"storage type") and implement some additional logic than to add a new variant to the `DType` enum.
20+
21+
### Storage DTypes
22+
23+
Extension types work by wrapping an existing canonical `DType`, called the **storage dtype**. The
24+
storage dtype is itself a logical type (e.g., `Primitive`, `Struct`, `List`), and the extension
25+
type is a logical wrapper over it that layers on additional semantics such as validation, display
26+
formatting, and (eventually) custom compute logic.
27+
28+
For example, a `Timestamp` extension type has a `Primitive` storage dtype. Under the hood, a
29+
timestamp array is just a primitive array of integers, but the extension layer knows that those
30+
integers represent microseconds since the Unix epoch. Similarly, a `Union` extension type might
31+
use `Struct` as its storage dtype, wrapping a struct of fields with union-specific dispatch logic.
32+
33+
This separation means that adding a new logical type does not require changes to the core canonical
34+
type system, the compressor, or the I/O layer. Extension types get compression for free because
35+
data is always read from and written to disk as the underlying storage dtype.
36+
37+
### Current State
38+
39+
Vortex provides an `Extension` variant of `DType` to help with this. Currently, implementors can
40+
add a new extension type by defining an extension ID (for example, `vortex.time` or `vortex.date`)
41+
and specifying a storage dtype. For example, the time extension types use a primitive storage dtype,
42+
meaning they wrap the primitive scalars or primitive arrays with some extra logic on top (mostly
43+
validating that the timestamps are valid).
44+
45+
We would like to add many more extension types. Some notable extension types (and their likely
46+
storage types) include:
47+
48+
- **Matrix / Tensor**: This would be an extension over `FixedSizeList`, where dimensions correspond
49+
to levels of nesting. There are many open questions on the design of this, but that is out of
50+
scope of this RFC.
51+
- **Union**: The sum type of an algebraic data type, like a Rust enum. One approach is to implement
52+
this with a type tag paired with a `Struct` (so `Struct { Primitive, Struct { types } }`).
53+
Vortex is well suited to represent this because it can compress each of the type field arrays
54+
independently, so we do not need to distinguish between a "Sparse" or "Dense" Union.
55+
- **UUID**: Since this is a 128-bit number, we likely want to add `FixedSizeBinary`. This is out of
56+
scope for this RFC.
57+
58+
The issue with the current system is that it only forwards logic to the underlying storage type.
59+
The only other behavior we support is serializing and pretty-printing extension arrays. This means
60+
that we cannot define custom compute logic for extension types.
61+
62+
Take the time extension types as an example of where this limitation does not matter. If we want to
63+
run a `compare` expression over a timestamp array, we just run the `compare` over the underlying
64+
primitive array. For simple types like timestamps, this is sufficient (and this is what we do right
65+
now). For types like Tensors (which are simply type aliases over `FixedSizeList`), this is also
66+
fine.
67+
68+
However, for more complex types like UUID, Union, or JSON, forwarding to the storage type is likely
69+
insufficient as these types need custom compute logic. Given that, we want a more robust
70+
implementation path instead of wrapping `ExtensionArray` and performing significant internal
71+
dispatch work.
72+
73+
## Design
74+
75+
### Background
76+
77+
[vortex#6081](https://github.com/vortex-data/vortex/pull/6081) introduced vtables (virtual tables,
78+
or Rust unit structs with methods) for extension `DType`s. Each extension type (e.g., `Timestamp`)
79+
now implements `ExtDTypeVTable`, which handles validation, serialization, and metadata.
80+
The type-erased `ExtDTypeRef` carries this vtable with it inside `DType::Extension`.
81+
82+
There were a few blockers (detailed in the tracking issue
83+
[vortex#6547](https://github.com/vortex-data/vortex/issues/6547)),
84+
but now that those have been resolved, we can move forward.
85+
86+
### Proposed Design
87+
88+
Now that `vortex-scalar` and `vortex-dtype` have been merged into `vortex-array`, we can place
89+
all extension logic (for types, scalars, and arrays) onto an `ExtVTable` (renamed from
90+
`ExtDTypeVTable`).
91+
92+
It will look something like the following:
93+
94+
```rust
95+
// Note: naming should be considered unstable.
96+
97+
/// The public API for defining new extension types.
98+
///
99+
/// This is the non-object-safe trait that plugin authors implement to define a new extension
100+
/// type. It specifies the type's identity, metadata, serialization, and validation.
101+
pub trait ExtVTable: 'static + Sized + Send + Sync + Clone + Debug + Eq + Hash {
102+
/// Associated type containing the deserialized metadata for this extension type.
103+
type Metadata: 'static + Send + Sync + Clone + Debug + Display + Eq + Hash;
104+
105+
/// A native Rust value that represents a scalar of the extension type.
106+
///
107+
/// The value only represents non-null values. We denote nullable values as `Option<Value>`.
108+
type NativeValue<'a>: Display;
109+
110+
/// Returns the ID for this extension type.
111+
fn id(&self) -> ExtId;
112+
113+
// Methods related to the extension `DType`.
114+
115+
/// Serialize the metadata into a byte vector.
116+
fn serialize_metadata(&self, metadata: &Self::Metadata) -> VortexResult<Vec<u8>>;
117+
118+
/// Deserialize the metadata from a byte slice.
119+
fn deserialize_metadata(&self, metadata: &[u8]) -> VortexResult<Self::Metadata>;
120+
121+
/// Validate that the given storage type is compatible with this extension type.
122+
fn validate_dtype(&self, metadata: &Self::Metadata, storage_dtype: &DType) -> VortexResult<()>;
123+
124+
// Methods related to the extension scalar values.
125+
126+
/// Validate the given storage value is compatible with the extension type.
127+
///
128+
/// By default, this calls [`unpack_native()`](ExtVTable::unpack_native) and discards the
129+
/// result.
130+
///
131+
/// # Errors
132+
///
133+
/// Returns an error if the storage [`ScalarValue`] is not compatible with the extension type.
134+
fn validate_scalar_value(
135+
&self,
136+
metadata: &Self::Metadata,
137+
storage_dtype: &DType,
138+
storage_value: &ScalarValue,
139+
) -> VortexResult<()> {
140+
self.unpack_native(metadata, storage_dtype, storage_value)
141+
.map(|_| ())
142+
}
143+
144+
/// Validate and unpack a native value from the storage [`ScalarValue`].
145+
///
146+
/// Note that [`ExtVTable::validate_dtype()`] is always called first to validate the storage
147+
/// [`DType`], and the [`Scalar`](crate::scalar::Scalar) implementation will verify that the
148+
/// storage value is compatible with the storage dtype on construction.
149+
///
150+
/// # Errors
151+
///
152+
/// Returns an error if the storage [`ScalarValue`] is not compatible with the extension type.
153+
fn unpack_native<'a>(
154+
&self,
155+
metadata: &'a Self::Metadata,
156+
storage_dtype: &'a DType,
157+
storage_value: &'a ScalarValue,
158+
) -> VortexResult<Self::NativeValue<'a>>;
159+
160+
// `ArrayRef`
161+
162+
fn validate_array(&self, metadata: &Self::Metadata, storage_array: &ArrayRef) -> VortexResult<()>;
163+
fn cast_array(&self, metadata: &Self::Metadata, array: &ArrayRef, target: &DType) -> VortexResult<ArrayRef> { ... }
164+
// Additional compute methods TBD.
165+
}
166+
```
167+
168+
Most of the implementation work will be making sure that `ExtDTypeRef` (which we pass around as the
169+
`Extension` variant of `DType`) has the correct methods that access the internal, type-erased
170+
`ExtVTable`.
171+
172+
Take extension scalars as an example. The only behavior we need from extension scalars is validating
173+
that they have correct values, displaying them, and unpacking them into native types. So we added
174+
these methods to `ExtDTypeRef`:
175+
176+
```rust
177+
impl ExtDTypeRef {
178+
/// Formats an extension scalar value using the current dtype for metadata context.
179+
pub fn fmt_storage_value<'a>(
180+
&'a self,
181+
f: &mut fmt::Formatter<'_>,
182+
storage_value: &'a ScalarValue,
183+
) -> fmt::Result { ... }
184+
185+
/// Validates that the given storage scalar value is valid for this dtype.
186+
pub fn validate_storage_value(&self, storage_value: &ScalarValue) -> VortexResult<()> { ... }
187+
}
188+
```
189+
190+
**Open question**: What should the API for extension arrays look like? The answer will determine
191+
what additional methods `ExtDTypeRef` needs beyond the scalar-related ones shown above.
192+
193+
## Compatibility
194+
195+
This should not break anything because extension types are mostly related to in-memory APIs (since
196+
data is read from and written to disk as the storage type).
197+
198+
## Drawbacks
199+
200+
If forwarding to the storage type turns out to be sufficient for all extension types, the
201+
additional vtable surface area adds complexity without clear benefit.
202+
203+
## Alternatives
204+
205+
We could have many `ExtensionArray` wrappers with custom logic. This approach would be clunky and
206+
may not scale.
207+
208+
## Prior Art
209+
210+
Apache Arrow allows defining
211+
[extension types](https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types)
212+
and also provides a
213+
[set of canonical extension types](https://arrow.apache.org/docs/format/CanonicalExtensions.html).
214+
215+
## Unresolved Questions
216+
217+
- Is forwarding to the storage type insufficient, and which extension types genuinely need custom
218+
compute logic?
219+
- What should the `ExtVTable` API for extension arrays look like? What methods beyond
220+
`validate_array` are needed?
221+
- How should compute expressions be defined and dispatched for extension types?
222+
223+
## Future Possibilities
224+
225+
If we can get extension types working well, we can add all of the following types:
226+
227+
- `DateTimeParts` (`Primitive`)
228+
- Matrix (`FixedSizeList`)
229+
- Tensor (`FixedSizeList`)
230+
- UUID (Do we need to add `FixedSizeBinary` as a canonical type?)
231+
- JSON (`UTF8`)
232+
- PDX: https://arxiv.org/pdf/2503.04422v1 (`FixedSizeList`)
233+
- Union
234+
- Sparse (`Struct { Primitive, Struct { types } }`)
235+
- Dense[^1]
236+
- Map (`List<Struct { K, V }>`)
237+
- Tags: See this
238+
[discussion](https://github.com/vortex-data/vortex/discussions/5772#discussioncomment-15279892),
239+
where we think we can represent this with (`ListView<Utf8>`)
240+
- `Struct` but with protobuf-style field numbers (`Struct`)
241+
- **NOT** Variant: see [RFC 0015 (Variant Type)](../accepted/0015-variant-type.md). Variant cannot
242+
be an extension type because there is no way to define a storage dtype when the schema is not
243+
known ahead of time for each row. Instead, Variant will have its own `DType` variant.
244+
- And likely more.
245+
246+
[^1]:
247+
`Struct` doesn't work here because children can have different lengths, but what we could do
248+
is simply force the inner `Struct { types }` to hold `SparseArray` fields, which would
249+
effectively be the exact same but with the overhead of tracking indices for each of the child
250+
fields. In that case, it might just be better to always use a "sparse" union and let the
251+
compressor decide what to do.

0 commit comments

Comments
 (0)