|
| 1 | +- Start Date: (2026-02-27) |
| 2 | +- RFC PR: [vortex-data/rfcs#5](https://github.com/vortex-data/rfcs/pull/5) |
| 3 | +- Tracking Issue: [vortex-data/vortex#6547](https://github.com/vortex-data/vortex/issues/6547) |
| 4 | + |
| 5 | +## Summary |
| 6 | + |
| 7 | +We would like to build a more robust system for extension data types (or `DType`s). This RFC |
| 8 | +proposes a direction for extending the `ExtVTable` trait to support richer behavior (beyond |
| 9 | +forwarding to the storage type), lays out the completed and in-progress work, and identifies the |
| 10 | +open questions that remain. |
| 11 | + |
| 12 | +## Motivation |
| 13 | + |
| 14 | +A limitation of the current type system in Vortex is that we cannot easily add new logical types. |
| 15 | +For example, the effort to add `FixedSizeList` |
| 16 | +([vortex#4372](https://github.com/vortex-data/vortex/issues/4372)) and also change `List` to |
| 17 | +`ListView` ([vortex#4699](https://github.com/vortex-data/vortex/issues/4699)) was very intrusive. |
| 18 | +It is much easier to add wrappers around canonical types (treating the canonical dtype as a |
| 19 | +"storage type") and implement some additional logic than to add a new variant to the `DType` enum. |
| 20 | + |
| 21 | +### Storage DTypes |
| 22 | + |
| 23 | +Extension types work by wrapping an existing canonical `DType`, called the **storage dtype**. The |
| 24 | +storage dtype is itself a logical type (e.g., `Primitive`, `Struct`, `List`), and the extension |
| 25 | +type is a logical wrapper over it that layers on additional semantics such as validation, display |
| 26 | +formatting, and (eventually) custom compute logic. |
| 27 | + |
| 28 | +For example, a `Timestamp` extension type has a `Primitive` storage dtype. Under the hood, a |
| 29 | +timestamp array is just a primitive array of integers, but the extension layer knows that those |
| 30 | +integers represent microseconds since the Unix epoch. Similarly, a `Union` extension type might |
| 31 | +use `Struct` as its storage dtype, wrapping a struct of fields with union-specific dispatch logic. |
| 32 | + |
| 33 | +This separation means that adding a new logical type does not require changes to the core canonical |
| 34 | +type system, the compressor, or the I/O layer. Extension types get compression for free because |
| 35 | +data is always read from and written to disk as the underlying storage dtype. |
| 36 | + |
| 37 | +### Current State |
| 38 | + |
| 39 | +Vortex provides an `Extension` variant of `DType` to help with this. Currently, implementors can |
| 40 | +add a new extension type by defining an extension ID (for example, `vortex.time` or `vortex.date`) |
| 41 | +and specifying a storage dtype. For example, the time extension types use a primitive storage dtype, |
| 42 | +meaning they wrap the primitive scalars or primitive arrays with some extra logic on top (mostly |
| 43 | +validating that the timestamps are valid). |
| 44 | + |
| 45 | +We would like to add many more extension types. Some notable extension types (and their likely |
| 46 | +storage types) include: |
| 47 | + |
| 48 | +- **Matrix / Tensor**: This would be an extension over `FixedSizeList`, where dimensions correspond |
| 49 | + to levels of nesting. There are many open questions on the design of this, but that is out of |
| 50 | + scope of this RFC. |
| 51 | +- **Union**: The sum type of an algebraic data type, like a Rust enum. One approach is to implement |
| 52 | + this with a type tag paired with a `Struct` (so `Struct { Primitive, Struct { types } }`). |
| 53 | + Vortex is well suited to represent this because it can compress each of the type field arrays |
| 54 | + independently, so we do not need to distinguish between a "Sparse" or "Dense" Union. |
| 55 | +- **UUID**: Since this is a 128-bit number, we likely want to add `FixedSizeBinary`. This is out of |
| 56 | + scope for this RFC. |
| 57 | + |
| 58 | +The issue with the current system is that it only forwards logic to the underlying storage type. |
| 59 | +The only other behavior we support is serializing and pretty-printing extension arrays. This means |
| 60 | +that we cannot define custom compute logic for extension types. |
| 61 | + |
| 62 | +Take the time extension types as an example of where this limitation does not matter. If we want to |
| 63 | +run a `compare` expression over a timestamp array, we just run the `compare` over the underlying |
| 64 | +primitive array. For simple types like timestamps, this is sufficient (and this is what we do right |
| 65 | +now). For types like Tensors (which are simply type aliases over `FixedSizeList`), this is also |
| 66 | +fine. |
| 67 | + |
| 68 | +However, for more complex types like UUID, Union, or JSON, forwarding to the storage type is likely |
| 69 | +insufficient as these types need custom compute logic. Given that, we want a more robust |
| 70 | +implementation path instead of wrapping `ExtensionArray` and performing significant internal |
| 71 | +dispatch work. |
| 72 | + |
| 73 | +## Design |
| 74 | + |
| 75 | +### Background |
| 76 | + |
| 77 | +[vortex#6081](https://github.com/vortex-data/vortex/pull/6081) introduced vtables (virtual tables, |
| 78 | +or Rust unit structs with methods) for extension `DType`s. Each extension type (e.g., `Timestamp`) |
| 79 | +now implements `ExtDTypeVTable`, which handles validation, serialization, and metadata. |
| 80 | +The type-erased `ExtDTypeRef` carries this vtable with it inside `DType::Extension`. |
| 81 | + |
| 82 | +There were a few blockers (detailed in the tracking issue |
| 83 | +[vortex#6547](https://github.com/vortex-data/vortex/issues/6547)), |
| 84 | +but now that those have been resolved, we can move forward. |
| 85 | + |
| 86 | +### Proposed Design |
| 87 | + |
| 88 | +Now that `vortex-scalar` and `vortex-dtype` have been merged into `vortex-array`, we can place |
| 89 | +all extension logic (for types, scalars, and arrays) onto an `ExtVTable` (renamed from |
| 90 | +`ExtDTypeVTable`). |
| 91 | + |
| 92 | +It will look something like the following: |
| 93 | + |
| 94 | +```rust |
| 95 | +// Note: naming should be considered unstable. |
| 96 | + |
| 97 | +/// The public API for defining new extension types. |
| 98 | +/// |
| 99 | +/// This is the non-object-safe trait that plugin authors implement to define a new extension |
| 100 | +/// type. It specifies the type's identity, metadata, serialization, and validation. |
| 101 | +pub trait ExtVTable: 'static + Sized + Send + Sync + Clone + Debug + Eq + Hash { |
| 102 | + /// Associated type containing the deserialized metadata for this extension type. |
| 103 | + type Metadata: 'static + Send + Sync + Clone + Debug + Display + Eq + Hash; |
| 104 | + |
| 105 | + /// A native Rust value that represents a scalar of the extension type. |
| 106 | + /// |
| 107 | + /// The value only represents non-null values. We denote nullable values as `Option<Value>`. |
| 108 | + type NativeValue<'a>: Display; |
| 109 | + |
| 110 | + /// Returns the ID for this extension type. |
| 111 | + fn id(&self) -> ExtId; |
| 112 | + |
| 113 | + // Methods related to the extension `DType`. |
| 114 | + |
| 115 | + /// Serialize the metadata into a byte vector. |
| 116 | + fn serialize_metadata(&self, metadata: &Self::Metadata) -> VortexResult<Vec<u8>>; |
| 117 | + |
| 118 | + /// Deserialize the metadata from a byte slice. |
| 119 | + fn deserialize_metadata(&self, metadata: &[u8]) -> VortexResult<Self::Metadata>; |
| 120 | + |
| 121 | + /// Validate that the given storage type is compatible with this extension type. |
| 122 | + fn validate_dtype(&self, metadata: &Self::Metadata, storage_dtype: &DType) -> VortexResult<()>; |
| 123 | + |
| 124 | + // Methods related to the extension scalar values. |
| 125 | + |
| 126 | + /// Validate the given storage value is compatible with the extension type. |
| 127 | + /// |
| 128 | + /// By default, this calls [`unpack_native()`](ExtVTable::unpack_native) and discards the |
| 129 | + /// result. |
| 130 | + /// |
| 131 | + /// # Errors |
| 132 | + /// |
| 133 | + /// Returns an error if the storage [`ScalarValue`] is not compatible with the extension type. |
| 134 | + fn validate_scalar_value( |
| 135 | + &self, |
| 136 | + metadata: &Self::Metadata, |
| 137 | + storage_dtype: &DType, |
| 138 | + storage_value: &ScalarValue, |
| 139 | + ) -> VortexResult<()> { |
| 140 | + self.unpack_native(metadata, storage_dtype, storage_value) |
| 141 | + .map(|_| ()) |
| 142 | + } |
| 143 | + |
| 144 | + /// Validate and unpack a native value from the storage [`ScalarValue`]. |
| 145 | + /// |
| 146 | + /// Note that [`ExtVTable::validate_dtype()`] is always called first to validate the storage |
| 147 | + /// [`DType`], and the [`Scalar`](crate::scalar::Scalar) implementation will verify that the |
| 148 | + /// storage value is compatible with the storage dtype on construction. |
| 149 | + /// |
| 150 | + /// # Errors |
| 151 | + /// |
| 152 | + /// Returns an error if the storage [`ScalarValue`] is not compatible with the extension type. |
| 153 | + fn unpack_native<'a>( |
| 154 | + &self, |
| 155 | + metadata: &'a Self::Metadata, |
| 156 | + storage_dtype: &'a DType, |
| 157 | + storage_value: &'a ScalarValue, |
| 158 | + ) -> VortexResult<Self::NativeValue<'a>>; |
| 159 | + |
| 160 | + // `ArrayRef` |
| 161 | + |
| 162 | + fn validate_array(&self, metadata: &Self::Metadata, storage_array: &ArrayRef) -> VortexResult<()>; |
| 163 | + fn cast_array(&self, metadata: &Self::Metadata, array: &ArrayRef, target: &DType) -> VortexResult<ArrayRef> { ... } |
| 164 | + // Additional compute methods TBD. |
| 165 | +} |
| 166 | +``` |
| 167 | + |
| 168 | +Most of the implementation work will be making sure that `ExtDTypeRef` (which we pass around as the |
| 169 | +`Extension` variant of `DType`) has the correct methods that access the internal, type-erased |
| 170 | +`ExtVTable`. |
| 171 | + |
| 172 | +Take extension scalars as an example. The only behavior we need from extension scalars is validating |
| 173 | +that they have correct values, displaying them, and unpacking them into native types. So we added |
| 174 | +these methods to `ExtDTypeRef`: |
| 175 | + |
| 176 | +```rust |
| 177 | +impl ExtDTypeRef { |
| 178 | + /// Formats an extension scalar value using the current dtype for metadata context. |
| 179 | + pub fn fmt_storage_value<'a>( |
| 180 | + &'a self, |
| 181 | + f: &mut fmt::Formatter<'_>, |
| 182 | + storage_value: &'a ScalarValue, |
| 183 | + ) -> fmt::Result { ... } |
| 184 | + |
| 185 | + /// Validates that the given storage scalar value is valid for this dtype. |
| 186 | + pub fn validate_storage_value(&self, storage_value: &ScalarValue) -> VortexResult<()> { ... } |
| 187 | +} |
| 188 | +``` |
| 189 | + |
| 190 | +**Open question**: What should the API for extension arrays look like? The answer will determine |
| 191 | +what additional methods `ExtDTypeRef` needs beyond the scalar-related ones shown above. |
| 192 | + |
| 193 | +## Compatibility |
| 194 | + |
| 195 | +This should not break anything because extension types are mostly related to in-memory APIs (since |
| 196 | +data is read from and written to disk as the storage type). |
| 197 | + |
| 198 | +## Drawbacks |
| 199 | + |
| 200 | +If forwarding to the storage type turns out to be sufficient for all extension types, the |
| 201 | +additional vtable surface area adds complexity without clear benefit. |
| 202 | + |
| 203 | +## Alternatives |
| 204 | + |
| 205 | +We could have many `ExtensionArray` wrappers with custom logic. This approach would be clunky and |
| 206 | +may not scale. |
| 207 | + |
| 208 | +## Prior Art |
| 209 | + |
| 210 | +Apache Arrow allows defining |
| 211 | +[extension types](https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types) |
| 212 | +and also provides a |
| 213 | +[set of canonical extension types](https://arrow.apache.org/docs/format/CanonicalExtensions.html). |
| 214 | + |
| 215 | +## Unresolved Questions |
| 216 | + |
| 217 | +- Is forwarding to the storage type insufficient, and which extension types genuinely need custom |
| 218 | + compute logic? |
| 219 | +- What should the `ExtVTable` API for extension arrays look like? What methods beyond |
| 220 | + `validate_array` are needed? |
| 221 | +- How should compute expressions be defined and dispatched for extension types? |
| 222 | + |
| 223 | +## Future Possibilities |
| 224 | + |
| 225 | +If we can get extension types working well, we can add all of the following types: |
| 226 | + |
| 227 | +- `DateTimeParts` (`Primitive`) |
| 228 | +- Matrix (`FixedSizeList`) |
| 229 | +- Tensor (`FixedSizeList`) |
| 230 | +- UUID (Do we need to add `FixedSizeBinary` as a canonical type?) |
| 231 | +- JSON (`UTF8`) |
| 232 | +- PDX: https://arxiv.org/pdf/2503.04422v1 (`FixedSizeList`) |
| 233 | +- Union |
| 234 | + - Sparse (`Struct { Primitive, Struct { types } }`) |
| 235 | + - Dense[^1] |
| 236 | +- Map (`List<Struct { K, V }>`) |
| 237 | +- Tags: See this |
| 238 | + [discussion](https://github.com/vortex-data/vortex/discussions/5772#discussioncomment-15279892), |
| 239 | + where we think we can represent this with (`ListView<Utf8>`) |
| 240 | +- `Struct` but with protobuf-style field numbers (`Struct`) |
| 241 | +- **NOT** Variant: see [RFC 0015 (Variant Type)](../accepted/0015-variant-type.md). Variant cannot |
| 242 | + be an extension type because there is no way to define a storage dtype when the schema is not |
| 243 | + known ahead of time for each row. Instead, Variant will have its own `DType` variant. |
| 244 | +- And likely more. |
| 245 | + |
| 246 | +[^1]: |
| 247 | + `Struct` doesn't work here because children can have different lengths, but what we could do |
| 248 | + is simply force the inner `Struct { types }` to hold `SparseArray` fields, which would |
| 249 | + effectively be the exact same but with the overhead of tracking indices for each of the child |
| 250 | + fields. In that case, it might just be better to always use a "sparse" union and let the |
| 251 | + compressor decide what to do. |
0 commit comments