You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: rfcs/0057-extension-dtypes.md
+60-45Lines changed: 60 additions & 45 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@
6
6
7
7
## Summary
8
8
9
-
This RFC proposes a redesign of Vortex extension dtypes and extension arrays. Extension arrays should remain a fully type-erased semantic wrapper around a storage array, but their array encoding id should be the extension dtype id rather than the generic `vortex.ext` id. Scalar-function behavior for extensions should be implemented through session-registered scalar kernels, with helper APIs for common storage-delegation behavior, instead of ad hoc hooks on `ExtVTable` or special cases in every builtin scalar function.
9
+
This RFC proposes a redesign of Vortex extension dtypes and extension arrays. Extension arrays should remain a fully type-erased semantic wrapper around a storage array, but their array encoding id should be the extension dtype id rather than the generic `vortex.ext` id. Scalar-function behavior for extensions should be implemented through session-registered `execute_parent` kernels, with helper APIs for common storage-delegation behavior, instead of ad hoc hooks on `ExtVTable` or special cases in every builtin scalar function.
10
10
11
11
The proposal does not require a structural wire-format break. New readers should continue reading old `vortex.ext` arrays and should also read extension arrays encoded under their extension dtype id. A compatibility plugin should deserialize both forms into the same in-memory extension array representation.
12
12
@@ -22,9 +22,9 @@ Extensions are currently represented by a generic `vortex.ext` array encoding. T
22
22
23
23
The design goal is to make extension types first-class semantic wrappers while preserving Vortex's plugin model:
- Extension dtypes describe identity, metadata, storage dtype, validation, and whether the extension is a nominal newtype or storage-preserving refinement.
26
26
- Extension arrays wrap storage arrays and expose the extension id as their array encoding id.
27
-
- Scalar-function behavior is provided by session-registered kernels.
27
+
- Scalar-function behavior is provided by session-registered `execute_parent`kernels.
28
28
- Storage delegation is implemented as a reusable kernel helper, not as a required check in every scalar function.
29
29
30
30
## Design
@@ -102,7 +102,6 @@ Extension dtype vtables should expose a coarse classification:
102
102
```rust
103
103
pubenumExtensionKind {
104
104
Newtype,
105
-
Domain,
106
105
Refinement,
107
106
}
108
107
@@ -113,54 +112,70 @@ pub trait ExtVTable {
113
112
}
114
113
```
115
114
116
-
This is policy metadata, not an execution mechanism.
115
+
This is policy metadata, not a custom execution mechanism. It gives Vortex a conservative default for generated storage-delegate kernels.
117
116
118
-
`Newtype` means a nominal semantic type over storage. UUID over fixed bytes and UserId over `u64` are examples. The default policy should be conservative: do not assume storage operations have extension semantics.
117
+
`Newtype` means a nominal semantic type over storage. UUID over fixed bytes and UserId over `u64` are examples. The default policy is conservative: do not assume storage operations have extension semantics. Newtypes must register session `execute_parent` kernels or explicit storage-delegate kernels for operations they support.
119
118
120
-
`Domain` means a storage type plus constraints. PositiveInt over `i64` or Email over `Utf8` are examples. Operations may be storage-compatible, but results may need validation before being wrapped back into the extension type.
119
+
`Refinement` means the extension represents a subset or refinement of the storage type where storage equality and value identity are still the extension's equality and value identity. Utf8-over-Binary, non-empty-Utf8, and fixed-size-list-as-list are examples.
121
120
122
-
`Refinement` means a storage type plus a representational invariant. Utf8-over-Binary or fixed-size-list-as-list are examples. Operations that preserve existing values, such as filter, take, slice, and dictionary decode, usually preserve the refinement.
121
+
Refinements may get default generated storage-delegate kernels for operations that only observe or preserve existing values. Equality, inequality, hash, filter, take, slice, dictionary decode, and min/max are candidates when the storage operation has the same semantics. Transforming operations, such as arithmetic, casts into the refinement, string transforms, parsing, or functions that construct new values, still need explicit kernels or validation-aware wrapping.
123
122
124
-
The kind is useful for documentation, default validation policy, planner hints, and future diagnostics. It should not replace explicit scalar kernels.
123
+
The kind should not replace explicit session kernels. It is a default-policy input for the storage-delegate helper. If an extension's semantics differ from storage for a particular operation, the extension should be a `Newtype` or should avoid registering that default delegate.
125
124
126
-
### Scalar Function Kernels
125
+
### Session Execute-Parent Kernels
127
126
128
-
Vortex should add a session-level scalar kernel registry. This is the extension point for extension-authored scalar-function behavior and storage delegation.
127
+
Vortex should move `execute_parent` kernels into the session. This is the extension point for extension-authored scalar-function behavior and storage delegation.
128
+
129
+
This is not a new scalar-function execution path. Today many scalar functions already have operation-specific kernels, such as `CastKernel`, `CompareKernel`, `LikeKernel`, and `FillNullKernel`, that are adapted into `ExecuteParentKernel` so a child encoding can execute its `ScalarFnArray` parent. This RFC proposes moving those `execute_parent` kernels from static child-vtable registration into a session registry.
130
+
131
+
The registry should be keyed by parent id and child id:
129
132
130
133
```rust
131
-
pubtraitScalarFnKernel:Send+Sync {
132
-
fnscalar_fn_id(&self) ->ScalarFnId;
134
+
pubtypeParentKernelKey= (Id, ArrayId);
135
+
136
+
pubtraitSessionExecuteParentKernel:Send+Sync {
137
+
fnparent_id(&self) ->Id;
138
+
fnchild_id(&self) ->ArrayId;
133
139
134
-
fnexecute(
140
+
fnexecute_parent(
135
141
&self,
136
-
scalar_fn:&ScalarFnRef,
137
-
args:&dynExecutionArgs,
142
+
child:&ArrayRef,
143
+
parent:&ArrayRef,
144
+
child_idx:usize,
138
145
ctx:&mutExecutionCtx,
139
146
) ->VortexResult<Option<ArrayRef>>;
140
147
}
141
148
```
142
149
143
-
The registry is stored in the session:
150
+
The exact Rust signature can be refined during implementation. The important point is that the session stores erased `execute_parent` kernels. Existing typed `ExecuteParentKernel<V>` implementations can remain as an implementation convenience and be adapted into the erased session form.
144
151
145
-
```rust
146
-
pubstructScalarKernelSession {
147
-
kernels:ScalarKernelRegistry,
148
-
}
152
+
Parent id lookup should follow these rules:
153
+
154
+
- For ordinary array parents, `parent_id = parent.encoding_id()`.
155
+
- For `ScalarFnArray` parents, `parent_id = parent.scalar_fn().id()`, not the generic scalar-function array id.
156
+
- The child id is always `child.encoding_id()`.
157
+
158
+
After this RFC, an extension array's child id is its extension dtype id. That means scalar-function extension behavior can be registered as ordinary parent kernels:
`ScalarFnArray::execute`should check the session scalar-kernel registry before calling the scalar function's default implementation:
166
+
Execution order should be:
152
167
153
168
```text
154
-
1. Try exact/custom scalar kernels.
155
-
2. Try generated storage-delegate kernels.
156
-
3. Fall back to ScalarFnVTable::execute.
169
+
1. For each child slot, try matching session execute_parent kernels.
170
+
2. During migration, fall back to the child's static execute_parent implementation.
171
+
3. If no parent kernel applies, execute the parent normally.
157
172
```
158
173
159
-
This centralizes extension dispatch. Individual builtin scalar functions do not all need to remember to check extension-specific flags.
174
+
This centralizes extension dispatch in the existing parent-kernel mechanism. Individual builtin scalar functions do not all need to remember to check extension-specific flags.
160
175
161
176
### Custom Extension Kernels
162
177
163
-
Extensions that need custom semantics register scalar kernels during plugin initialization or default-session construction.
178
+
Extensions that need custom semantics register session `execute_parent` kernels during plugin initialization or default-session construction.
164
179
165
180
Examples:
166
181
@@ -176,13 +191,13 @@ This avoids putting compute behavior on `ExtVTable`.
176
191
177
192
### Storage-Delegate Kernel Helper
178
193
179
-
Many extension functions only need to delegate to storage. This should be easy to register, but still implemented as ordinary scalar kernels.
194
+
Many extension functions only need to delegate to storage. This should be easy to register, but still implemented as ordinary session `execute_parent` kernels.
180
195
181
-
Vortex should provide a helper/builder that creates scalar kernels:
196
+
Vortex should provide a helper/builder that creates session `execute_parent` kernels:
The exact API can be refined during implementation. The important properties are:
207
222
208
-
- it registers a scalar kernel in the session;
223
+
- it registers an `execute_parent` kernel in the session;
209
224
- it is not a method on `ExtVTable`;
210
225
- it does not require every scalar function to check a flag;
211
226
- it can express argument unwrapping, output wrapping, validation, and option matching.
@@ -373,19 +388,19 @@ Forward compatibility depends on reader behavior:
373
388
Public Rust APIs will change around extension array construction and extension plugin registration. The migration path is:
374
389
375
390
- replace generic `vortex.ext` construction with `ExtensionArray::try_new(ext_dtype, storage)`;
376
-
- register extension scalar behavior as session scalar kernels;
391
+
- register extension scalar behavior as session `execute_parent` kernels;
377
392
- use storage-delegate helper kernels for common storage-transparent operations;
378
393
- use `extension_unwrap` and `extension_wrap` for explicit representation access.
379
394
380
-
Performance should improve for extension-specific dispatch because the array encoding id now carries the concrete extension id. There is some additional session-kernel lookup cost during scalar-function execution, but this is centralized and should be small compared to actual array execution.
395
+
Performance should improve for extension-specific dispatch because the array encoding id now carries the concrete extension id. There is some additional session-kernel lookup cost during parent-kernel execution, but this is centralized and should be small compared to actual array execution.
381
396
382
397
## Drawbacks
383
398
384
-
This adds a session scalar-kernel registry and a storage-delegate helper API. That is more machinery than direct methods on `ExtVTable`.
399
+
This adds a session `execute_parent`kernel registry and a storage-delegate helper API. That is more machinery than direct methods on `ExtVTable`.
385
400
386
401
The design also changes the meaning of extension array encoding ids. Although this is not a structural wire-format break, it requires compatibility behavior during serde and careful migration of tests and registry setup.
387
402
388
-
The storage-delegate helper must be expressive enough for common cases without becoming a second scalar-function implementation framework. Complex extension semantics should use custom scalar kernels instead of stretching the helper API.
403
+
The storage-delegate helper must be expressive enough for common cases without becoming a second scalar-function implementation framework. Complex extension semantics should use custom session `execute_parent` kernels instead of stretching the helper API.
389
404
390
405
## Alternatives
391
406
@@ -407,7 +422,7 @@ This looks simple but creates a bad contract. Every scalar function would need t
407
422
408
423
### Add `ExtVTable::execute_scalar_fn`
409
424
410
-
This makes the dtype vtable a compute engine and creates arbitration problems for multi-argument functions. For `binary(lhs_ext, rhs_ext)`, it is unclear whether the left extension, right extension, or scalar function owns execution. Session scalar kernels are a cleaner extension point.
425
+
This makes the dtype vtable a compute engine and creates arbitration problems for multi-argument functions. For `binary(lhs_ext, rhs_ext)`, it is unclear whether the left extension, right extension, or scalar function owns execution. Session `execute_parent` kernels are a cleaner extension point.
411
426
412
427
### Add `ExtVTable::register_storage_delegates`
413
428
@@ -417,16 +432,16 @@ This RFC explicitly rejects registration methods on `ExtVTable`. Registration sh
417
432
418
433
Apache Arrow extension types store a regular Arrow storage type plus extension metadata on the field. The storage array remains a normal Arrow array. Vortex should preserve this separation between logical extension type and physical storage representation while giving extensions better runtime dispatch. See the Arrow extension type documentation: <https://arrow.apache.org/docs/format/Columnar.html#extension-types>.
419
434
420
-
Postgres domains are base types with constraints. They are useful prior art for Vortex `ExtensionKind::Domain`. Postgres also has the concept of binary-coercible casts through `CREATE CAST ... WITHOUT FUNCTION`, where no conversion is required because the source and target have the same internal representation. That is related to storage delegation, but Vortex should express it through scalar kernels rather than a closed set of global flags. See <https://www.postgresql.org/docs/current/sql-createcast.html> and <https://www.postgresql.org/docs/current/sql-createdomain.html>.
435
+
Postgres domains are base types with constraints. They are useful prior art for refinement-like types, although this RFC does not model domains as a separate extension kind. Postgres also has the concept of binary-coercible casts through `CREATE CAST ... WITHOUT FUNCTION`, where no conversion is required because the source and target have the same internal representation. That is related to storage delegation, but Vortex should express it through registered kernels rather than a closed set of global flags. See <https://www.postgresql.org/docs/current/sql-createcast.html> and <https://www.postgresql.org/docs/current/sql-createdomain.html>.
421
436
422
-
DuckDB and Postgres both distinguish type identity from function/operator implementations. Operators and casts are registered behavior, not hard-coded methods on the type descriptor. Vortex should follow that separation by putting extension scalar behavior in session kernels.
437
+
DuckDB and Postgres both distinguish type identity from function/operator implementations. Operators and casts are registered behavior, not hard-coded methods on the type descriptor. Vortex should follow that separation by putting extension scalar behavior in session `execute_parent`kernels.
423
438
424
439
## Unresolved Questions
425
440
426
-
- What should the exact `ScalarFnKernel` trait look like?
427
-
- Should scalar kernels be ordered by registration order, specificity, or explicit priority?
428
-
- Should generated storage-delegate kernels be stored in the same registry as custom scalar kernels, or in a separate registry checked by the same dispatcher?
429
-
- How should scalar-kernel dispatch handle multi-extension arguments when multiple kernels match?
441
+
- What should the exact erased session `execute_parent` kernel trait look like?
442
+
- Should session `execute_parent` kernels be ordered by registration order, specificity, or explicit priority?
443
+
- Should generated storage-delegate kernels be stored in the same registry as custom session kernels, or in a separate registry checked by the same dispatcher?
444
+
- How should session `execute_parent` dispatch handle multi-extension arguments when multiple kernels match?
430
445
- What should the exact `extension_wrap` validation-policy API be?
431
446
- Should new writers default to extension-id encoding immediately, or should there be a transition period where `vortex.ext` remains the default?
432
447
- Which built-in extension dtypes should register storage-delegate kernels initially?
@@ -437,8 +452,8 @@ Adding `FixedSizeBinary` is also out of scope. It may be a good storage dtype fo
437
452
438
453
## Future Possibilities
439
454
440
-
The same session scalar-kernel mechanism can eventually replace more static `execute_parent`and `reduce_parent`implementations. The migration does not need to happen as part of this RFC.
455
+
The same session-kernel mechanism can eventually replace more static `execute_parent`implementations beyond scalar functions. Session `reduce_parent`already exists in a limited form; aligning both registries is a natural follow-on.
441
456
442
457
The extension descriptor could eventually include richer documentation metadata for external systems, such as Arrow extension mappings, SQL type names, and display/formatting preferences.
443
458
444
-
The storage-delegate helper may grow convenience constructors for common patterns such as equality-only newtypes, ordered domains, and value-preserving refinements.
459
+
The storage-delegate helper may grow convenience constructors for common patterns such as equality-only newtypes and value-preserving refinements.
0 commit comments