Skip to content

Commit 3223fa6

Browse files
gatesnclaude
andcommitted
RFC 0005: extension
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent cc5c54d commit 3223fa6

1 file changed

Lines changed: 253 additions & 0 deletions

File tree

rfcs/0005-extension.md

Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
- Start Date: 2026-02-27
2+
- Authors: Connor Tsui
3+
- RFC PR: [vortex-data/rfcs#5](https://github.com/vortex-data/rfcs/pull/5)
4+
5+
# Extension Types
6+
7+
## Summary
8+
9+
We would like to build a more robust system for extension data types (or `DType`s). This RFC
10+
proposes a direction for extending the `ExtVTable` trait to support richer behavior (beyond
11+
forwarding to the storage type), lays out the completed and in-progress work, and identifies the
12+
open questions that remain.
13+
14+
## Motivation
15+
16+
A limitation of the current type system in Vortex is that we cannot easily add new logical types.
17+
For example, the effort to add `FixedSizeList`
18+
([vortex#4372](https://github.com/vortex-data/vortex/issues/4372)) and also change `List` to
19+
`ListView` ([vortex#4699](https://github.com/vortex-data/vortex/issues/4699)) was very intrusive.
20+
It is much easier to add wrappers around canonical types (treating the canonical dtype as a
21+
"storage type") and implement some additional logic than to add a new variant to the `DType` enum.
22+
23+
### Storage DTypes
24+
25+
Extension types work by wrapping an existing canonical `DType`, called the **storage dtype**. The
26+
storage dtype is itself a logical type (e.g., `Primitive`, `Struct`, `List`), and the extension
27+
type is a logical wrapper over it that layers on additional semantics such as validation, display
28+
formatting, and (eventually) custom compute logic.
29+
30+
For example, a `Timestamp` extension type has a `Primitive` storage dtype. Under the hood, a
31+
timestamp array is just a primitive array of integers, but the extension layer knows that those
32+
integers represent microseconds since the Unix epoch. Similarly, a `Union` extension type might
33+
use `Struct` as its storage dtype, wrapping a struct of fields with union-specific dispatch logic.
34+
35+
This separation means that adding a new logical type does not require changes to the core canonical
36+
type system, the compressor, or the I/O layer. Extension types get compression for free because
37+
data is always read from and written to disk as the underlying storage dtype.
38+
39+
### Current State
40+
41+
Vortex provides an `Extension` variant of `DType` to help with this. Currently, implementors can
42+
add a new extension type by defining an extension ID (for example, `vortex.time` or `vortex.date`)
43+
and specifying a storage dtype. For example, the time extension types use a primitive storage dtype,
44+
meaning they wrap the primitive scalars or primitive arrays with some extra logic on top (mostly
45+
validating that the timestamps are valid).
46+
47+
We would like to add many more extension types. Some notable extension types (and their likely
48+
storage types) include:
49+
50+
- **Matrix / Tensor**: This would be an extension over `FixedSizeList`, where dimensions correspond
51+
to levels of nesting. There are many open questions on the design of this, but that is out of
52+
scope of this RFC.
53+
- **Union**: The sum type of an algebraic data type, like a Rust enum. One approach is to implement
54+
this with a type tag paired with a `Struct` (so `Struct { Primitive, Struct { types } }`).
55+
Vortex is well suited to represent this because it can compress each of the type field arrays
56+
independently, so we do not need to distinguish between a "Sparse" or "Dense" Union.
57+
- **UUID**: Since this is a 128-bit number, we likely want to add `FixedSizeBinary`. This is out of
58+
scope for this RFC.
59+
60+
The issue with the current system is that it only forwards logic to the underlying storage type.
61+
The only other behavior we support is serializing and pretty-printing extension arrays. This means
62+
that we cannot define custom compute logic for extension types.
63+
64+
Take the time extension types as an example of where this limitation does not matter. If we want to
65+
run a `compare` expression over a timestamp array, we just run the `compare` over the underlying
66+
primitive array. For simple types like timestamps, this is sufficient (and this is what we do right
67+
now). For types like Tensors (which are simply type aliases over `FixedSizeList`), this is also
68+
fine.
69+
70+
However, for more complex types like UUID, Union, or JSON, forwarding to the storage type is likely
71+
insufficient as these types need custom compute logic. Given that, we want a more robust
72+
implementation path instead of wrapping `ExtensionArray` and performing significant internal
73+
dispatch work.
74+
75+
## Design
76+
77+
### Background
78+
79+
[vortex#6081](https://github.com/vortex-data/vortex/pull/6081) introduced vtables (virtual tables,
80+
or Rust unit structs with methods) for extension `DType`s. Each extension type (e.g., `Timestamp`)
81+
now implements `ExtDTypeVTable`, which handles validation, serialization, and metadata.
82+
The type-erased `ExtDTypeRef` carries this vtable with it inside `DType::Extension`.
83+
84+
There were a few blockers (detailed in the tracking issue
85+
[vortex#6547](https://github.com/vortex-data/vortex/issues/6547)),
86+
but now that those have been resolved, we can move forward.
87+
88+
### Proposed Design
89+
90+
Now that `vortex-scalar` and `vortex-dtype` have been merged into `vortex-array`, we can place
91+
all extension logic (for types, scalars, and arrays) onto an `ExtVTable` (renamed from
92+
`ExtDTypeVTable`).
93+
94+
It will look something like the following:
95+
96+
```rust
97+
// Note: naming should be considered unstable.
98+
99+
/// The public API for defining new extension types.
100+
///
101+
/// This is the non-object-safe trait that plugin authors implement to define a new extension
102+
/// type. It specifies the type's identity, metadata, serialization, and validation.
103+
pub trait ExtVTable: 'static + Sized + Send + Sync + Clone + Debug + Eq + Hash {
104+
/// Associated type containing the deserialized metadata for this extension type.
105+
type Metadata: 'static + Send + Sync + Clone + Debug + Display + Eq + Hash;
106+
107+
/// A native Rust value that represents a scalar of the extension type.
108+
///
109+
/// The value only represents non-null values. We denote nullable values as `Option<Value>`.
110+
type NativeValue<'a>: Display;
111+
112+
/// Returns the ID for this extension type.
113+
fn id(&self) -> ExtId;
114+
115+
// Methods related to the extension `DType`.
116+
117+
/// Serialize the metadata into a byte vector.
118+
fn serialize_metadata(&self, metadata: &Self::Metadata) -> VortexResult<Vec<u8>>;
119+
120+
/// Deserialize the metadata from a byte slice.
121+
fn deserialize_metadata(&self, metadata: &[u8]) -> VortexResult<Self::Metadata>;
122+
123+
/// Validate that the given storage type is compatible with this extension type.
124+
fn validate_dtype(&self, metadata: &Self::Metadata, storage_dtype: &DType) -> VortexResult<()>;
125+
126+
// Methods related to the extension scalar values.
127+
128+
/// Validate the given storage value is compatible with the extension type.
129+
///
130+
/// By default, this calls [`unpack_native()`](ExtVTable::unpack_native) and discards the
131+
/// result.
132+
///
133+
/// # Errors
134+
///
135+
/// Returns an error if the storage [`ScalarValue`] is not compatible with the extension type.
136+
fn validate_scalar_value(
137+
&self,
138+
metadata: &Self::Metadata,
139+
storage_dtype: &DType,
140+
storage_value: &ScalarValue,
141+
) -> VortexResult<()> {
142+
self.unpack_native(metadata, storage_dtype, storage_value)
143+
.map(|_| ())
144+
}
145+
146+
/// Validate and unpack a native value from the storage [`ScalarValue`].
147+
///
148+
/// Note that [`ExtVTable::validate_dtype()`] is always called first to validate the storage
149+
/// [`DType`], and the [`Scalar`](crate::scalar::Scalar) implementation will verify that the
150+
/// storage value is compatible with the storage dtype on construction.
151+
///
152+
/// # Errors
153+
///
154+
/// Returns an error if the storage [`ScalarValue`] is not compatible with the extension type.
155+
fn unpack_native<'a>(
156+
&self,
157+
metadata: &'a Self::Metadata,
158+
storage_dtype: &'a DType,
159+
storage_value: &'a ScalarValue,
160+
) -> VortexResult<Self::NativeValue<'a>>;
161+
162+
// `ArrayRef`
163+
164+
fn validate_array(&self, metadata: &Self::Metadata, storage_array: &ArrayRef) -> VortexResult<()>;
165+
fn cast_array(&self, metadata: &Self::Metadata, array: &ArrayRef, target: &DType) -> VortexResult<ArrayRef> { ... }
166+
// Additional compute methods TBD.
167+
}
168+
```
169+
170+
Most of the implementation work will be making sure that `ExtDTypeRef` (which we pass around as the
171+
`Extension` variant of `DType`) has the correct methods that access the internal, type-erased
172+
`ExtVTable`.
173+
174+
Take extension scalars as an example. The only behavior we need from extension scalars is validating
175+
that they have correct values, displaying them, and unpacking them into native types. So we added
176+
these methods to `ExtDTypeRef`:
177+
178+
```rust
179+
impl ExtDTypeRef {
180+
/// Formats an extension scalar value using the current dtype for metadata context.
181+
pub fn fmt_storage_value<'a>(
182+
&'a self,
183+
f: &mut fmt::Formatter<'_>,
184+
storage_value: &'a ScalarValue,
185+
) -> fmt::Result { ... }
186+
187+
/// Validates that the given storage scalar value is valid for this dtype.
188+
pub fn validate_storage_value(&self, storage_value: &ScalarValue) -> VortexResult<()> { ... }
189+
}
190+
```
191+
192+
**Open question**: What should the API for extension arrays look like? The answer will determine
193+
what additional methods `ExtDTypeRef` needs beyond the scalar-related ones shown above.
194+
195+
## Compatibility
196+
197+
This should not break anything because extension types are mostly related to in-memory APIs (since
198+
data is read from and written to disk as the storage type).
199+
200+
## Drawbacks
201+
202+
If forwarding to the storage type turns out to be sufficient for all extension types, the
203+
additional vtable surface area adds complexity without clear benefit.
204+
205+
## Alternatives
206+
207+
We could have many `ExtensionArray` wrappers with custom logic. This approach would be clunky and
208+
may not scale.
209+
210+
## Prior Art
211+
212+
Apache Arrow allows defining
213+
[extension types](https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types)
214+
and also provides a
215+
[set of canonical extension types](https://arrow.apache.org/docs/format/CanonicalExtensions.html).
216+
217+
## Unresolved Questions
218+
219+
- Is forwarding to the storage type insufficient, and which extension types genuinely need custom
220+
compute logic?
221+
- What should the `ExtVTable` API for extension arrays look like? What methods beyond
222+
`validate_array` are needed?
223+
- How should compute expressions be defined and dispatched for extension types?
224+
225+
## Future Possibilities
226+
227+
If we can get extension types working well, we can add all of the following types:
228+
229+
- `DateTimeParts` (`Primitive`)
230+
- Matrix (`FixedSizeList`)
231+
- Tensor (`FixedSizeList`)
232+
- UUID (Do we need to add `FixedSizeBinary` as a canonical type?)
233+
- JSON (`UTF8`)
234+
- PDX: https://arxiv.org/pdf/2503.04422v1 (`FixedSizeList`)
235+
- Union
236+
- Sparse (`Struct { Primitive, Struct { types } }`)
237+
- Dense[^1]
238+
- Map (`List<Struct { K, V }>`)
239+
- Tags: See this
240+
[discussion](https://github.com/vortex-data/vortex/discussions/5772#discussioncomment-15279892),
241+
where we think we can represent this with (`ListView<Utf8>`)
242+
- `Struct` but with protobuf-style field numbers (`Struct`)
243+
- **NOT** Variant: see [RFC 0015 (Variant Type)](../accepted/0015-variant-type.md). Variant cannot
244+
be an extension type because there is no way to define a storage dtype when the schema is not
245+
known ahead of time for each row. Instead, Variant will have its own `DType` variant.
246+
- And likely more.
247+
248+
[^1]:
249+
`Struct` doesn't work here because children can have different lengths, but what we could do
250+
is simply force the inner `Struct { types }` to hold `SparseArray` fields, which would
251+
effectively be the exact same but with the overhead of tracking indices for each of the child
252+
fields. In that case, it might just be better to always use a "sparse" union and let the
253+
compressor decide what to do.

0 commit comments

Comments
 (0)