Skip to content

Commit 502453f

Browse files
committed
Update author
Signed-off-by: Nicholas Gates <nick@nickgates.com>
1 parent ca748ca commit 502453f

1 file changed

Lines changed: 60 additions & 45 deletions

File tree

rfcs/0057-extension-dtypes.md

Lines changed: 60 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
## Summary
88

9-
This RFC proposes a redesign of Vortex extension dtypes and extension arrays. Extension arrays should remain a fully type-erased semantic wrapper around a storage array, but their array encoding id should be the extension dtype id rather than the generic `vortex.ext` id. Scalar-function behavior for extensions should be implemented through session-registered scalar kernels, with helper APIs for common storage-delegation behavior, instead of ad hoc hooks on `ExtVTable` or special cases in every builtin scalar function.
9+
This RFC proposes a redesign of Vortex extension dtypes and extension arrays. Extension arrays should remain a fully type-erased semantic wrapper around a storage array, but their array encoding id should be the extension dtype id rather than the generic `vortex.ext` id. Scalar-function behavior for extensions should be implemented through session-registered `execute_parent` kernels, with helper APIs for common storage-delegation behavior, instead of ad hoc hooks on `ExtVTable` or special cases in every builtin scalar function.
1010

1111
The proposal does not require a structural wire-format break. New readers should continue reading old `vortex.ext` arrays and should also read extension arrays encoded under their extension dtype id. A compatibility plugin should deserialize both forms into the same in-memory extension array representation.
1212

@@ -22,9 +22,9 @@ Extensions are currently represented by a generic `vortex.ext` array encoding. T
2222

2323
The design goal is to make extension types first-class semantic wrappers while preserving Vortex's plugin model:
2424

25-
- Extension dtypes describe identity, metadata, storage dtype, validation, and high-level kind.
25+
- Extension dtypes describe identity, metadata, storage dtype, validation, and whether the extension is a nominal newtype or storage-preserving refinement.
2626
- Extension arrays wrap storage arrays and expose the extension id as their array encoding id.
27-
- Scalar-function behavior is provided by session-registered kernels.
27+
- Scalar-function behavior is provided by session-registered `execute_parent` kernels.
2828
- Storage delegation is implemented as a reusable kernel helper, not as a required check in every scalar function.
2929

3030
## Design
@@ -102,7 +102,6 @@ Extension dtype vtables should expose a coarse classification:
102102
```rust
103103
pub enum ExtensionKind {
104104
Newtype,
105-
Domain,
106105
Refinement,
107106
}
108107

@@ -113,54 +112,70 @@ pub trait ExtVTable {
113112
}
114113
```
115114

116-
This is policy metadata, not an execution mechanism.
115+
This is policy metadata, not a custom execution mechanism. It gives Vortex a conservative default for generated storage-delegate kernels.
117116

118-
`Newtype` means a nominal semantic type over storage. UUID over fixed bytes and UserId over `u64` are examples. The default policy should be conservative: do not assume storage operations have extension semantics.
117+
`Newtype` means a nominal semantic type over storage. UUID over fixed bytes and UserId over `u64` are examples. The default policy is conservative: do not assume storage operations have extension semantics. Newtypes must register session `execute_parent` kernels or explicit storage-delegate kernels for operations they support.
119118

120-
`Domain` means a storage type plus constraints. PositiveInt over `i64` or Email over `Utf8` are examples. Operations may be storage-compatible, but results may need validation before being wrapped back into the extension type.
119+
`Refinement` means the extension represents a subset or refinement of the storage type where storage equality and value identity are still the extension's equality and value identity. Utf8-over-Binary, non-empty-Utf8, and fixed-size-list-as-list are examples.
121120

122-
`Refinement` means a storage type plus a representational invariant. Utf8-over-Binary or fixed-size-list-as-list are examples. Operations that preserve existing values, such as filter, take, slice, and dictionary decode, usually preserve the refinement.
121+
Refinements may get default generated storage-delegate kernels for operations that only observe or preserve existing values. Equality, inequality, hash, filter, take, slice, dictionary decode, and min/max are candidates when the storage operation has the same semantics. Transforming operations, such as arithmetic, casts into the refinement, string transforms, parsing, or functions that construct new values, still need explicit kernels or validation-aware wrapping.
123122

124-
The kind is useful for documentation, default validation policy, planner hints, and future diagnostics. It should not replace explicit scalar kernels.
123+
The kind should not replace explicit session kernels. It is a default-policy input for the storage-delegate helper. If an extension's semantics differ from storage for a particular operation, the extension should be a `Newtype` or should avoid registering that default delegate.
125124

126-
### Scalar Function Kernels
125+
### Session Execute-Parent Kernels
127126

128-
Vortex should add a session-level scalar kernel registry. This is the extension point for extension-authored scalar-function behavior and storage delegation.
127+
Vortex should move `execute_parent` kernels into the session. This is the extension point for extension-authored scalar-function behavior and storage delegation.
128+
129+
This is not a new scalar-function execution path. Today many scalar functions already have operation-specific kernels, such as `CastKernel`, `CompareKernel`, `LikeKernel`, and `FillNullKernel`, that are adapted into `ExecuteParentKernel` so a child encoding can execute its `ScalarFnArray` parent. This RFC proposes moving those `execute_parent` kernels from static child-vtable registration into a session registry.
130+
131+
The registry should be keyed by parent id and child id:
129132

130133
```rust
131-
pub trait ScalarFnKernel: Send + Sync {
132-
fn scalar_fn_id(&self) -> ScalarFnId;
134+
pub type ParentKernelKey = (Id, ArrayId);
135+
136+
pub trait SessionExecuteParentKernel: Send + Sync {
137+
fn parent_id(&self) -> Id;
138+
fn child_id(&self) -> ArrayId;
133139

134-
fn execute(
140+
fn execute_parent(
135141
&self,
136-
scalar_fn: &ScalarFnRef,
137-
args: &dyn ExecutionArgs,
142+
child: &ArrayRef,
143+
parent: &ArrayRef,
144+
child_idx: usize,
138145
ctx: &mut ExecutionCtx,
139146
) -> VortexResult<Option<ArrayRef>>;
140147
}
141148
```
142149

143-
The registry is stored in the session:
150+
The exact Rust signature can be refined during implementation. The important point is that the session stores erased `execute_parent` kernels. Existing typed `ExecuteParentKernel<V>` implementations can remain as an implementation convenience and be adapted into the erased session form.
144151

145-
```rust
146-
pub struct ScalarKernelSession {
147-
kernels: ScalarKernelRegistry,
148-
}
152+
Parent id lookup should follow these rules:
153+
154+
- For ordinary array parents, `parent_id = parent.encoding_id()`.
155+
- For `ScalarFnArray` parents, `parent_id = parent.scalar_fn().id()`, not the generic scalar-function array id.
156+
- The child id is always `child.encoding_id()`.
157+
158+
After this RFC, an extension array's child id is its extension dtype id. That means scalar-function extension behavior can be registered as ordinary parent kernels:
159+
160+
```text
161+
(parent_id = vortex.binary, child_id = vortex.uuid)
162+
(parent_id = vortex.cast, child_id = vortex.timestamp)
163+
(parent_id = vortex.get_item, child_id = vortex.json)
149164
```
150165

151-
`ScalarFnArray::execute` should check the session scalar-kernel registry before calling the scalar function's default implementation:
166+
Execution order should be:
152167

153168
```text
154-
1. Try exact/custom scalar kernels.
155-
2. Try generated storage-delegate kernels.
156-
3. Fall back to ScalarFnVTable::execute.
169+
1. For each child slot, try matching session execute_parent kernels.
170+
2. During migration, fall back to the child's static execute_parent implementation.
171+
3. If no parent kernel applies, execute the parent normally.
157172
```
158173

159-
This centralizes extension dispatch. Individual builtin scalar functions do not all need to remember to check extension-specific flags.
174+
This centralizes extension dispatch in the existing parent-kernel mechanism. Individual builtin scalar functions do not all need to remember to check extension-specific flags.
160175

161176
### Custom Extension Kernels
162177

163-
Extensions that need custom semantics register scalar kernels during plugin initialization or default-session construction.
178+
Extensions that need custom semantics register session `execute_parent` kernels during plugin initialization or default-session construction.
164179

165180
Examples:
166181

@@ -176,13 +191,13 @@ This avoids putting compute behavior on `ExtVTable`.
176191

177192
### Storage-Delegate Kernel Helper
178193

179-
Many extension functions only need to delegate to storage. This should be easy to register, but still implemented as ordinary scalar kernels.
194+
Many extension functions only need to delegate to storage. This should be easy to register, but still implemented as ordinary session `execute_parent` kernels.
180195

181-
Vortex should provide a helper/builder that creates scalar kernels:
196+
Vortex should provide a helper/builder that creates session `execute_parent` kernels:
182197

183198
```rust
184-
session.scalar_kernels().register(
185-
StorageDelegateKernel::new(Binary.id())
199+
session.execute_parent_kernels().register(
200+
StorageDelegateExecuteParentKernel::new(Binary.id())
186201
.for_extension(Uuid.id())
187202
.when_options(|options| matches_binary_operator(options, [Eq, NotEq]))
188203
.unwrap_args([0, 1])
@@ -205,7 +220,7 @@ register_storage_delegate(
205220

206221
The exact API can be refined during implementation. The important properties are:
207222

208-
- it registers a scalar kernel in the session;
223+
- it registers an `execute_parent` kernel in the session;
209224
- it is not a method on `ExtVTable`;
210225
- it does not require every scalar function to check a flag;
211226
- it can express argument unwrapping, output wrapping, validation, and option matching.
@@ -373,19 +388,19 @@ Forward compatibility depends on reader behavior:
373388
Public Rust APIs will change around extension array construction and extension plugin registration. The migration path is:
374389

375390
- replace generic `vortex.ext` construction with `ExtensionArray::try_new(ext_dtype, storage)`;
376-
- register extension scalar behavior as session scalar kernels;
391+
- register extension scalar behavior as session `execute_parent` kernels;
377392
- use storage-delegate helper kernels for common storage-transparent operations;
378393
- use `extension_unwrap` and `extension_wrap` for explicit representation access.
379394

380-
Performance should improve for extension-specific dispatch because the array encoding id now carries the concrete extension id. There is some additional session-kernel lookup cost during scalar-function execution, but this is centralized and should be small compared to actual array execution.
395+
Performance should improve for extension-specific dispatch because the array encoding id now carries the concrete extension id. There is some additional session-kernel lookup cost during parent-kernel execution, but this is centralized and should be small compared to actual array execution.
381396

382397
## Drawbacks
383398

384-
This adds a session scalar-kernel registry and a storage-delegate helper API. That is more machinery than direct methods on `ExtVTable`.
399+
This adds a session `execute_parent` kernel registry and a storage-delegate helper API. That is more machinery than direct methods on `ExtVTable`.
385400

386401
The design also changes the meaning of extension array encoding ids. Although this is not a structural wire-format break, it requires compatibility behavior during serde and careful migration of tests and registry setup.
387402

388-
The storage-delegate helper must be expressive enough for common cases without becoming a second scalar-function implementation framework. Complex extension semantics should use custom scalar kernels instead of stretching the helper API.
403+
The storage-delegate helper must be expressive enough for common cases without becoming a second scalar-function implementation framework. Complex extension semantics should use custom session `execute_parent` kernels instead of stretching the helper API.
389404

390405
## Alternatives
391406

@@ -407,7 +422,7 @@ This looks simple but creates a bad contract. Every scalar function would need t
407422

408423
### Add `ExtVTable::execute_scalar_fn`
409424

410-
This makes the dtype vtable a compute engine and creates arbitration problems for multi-argument functions. For `binary(lhs_ext, rhs_ext)`, it is unclear whether the left extension, right extension, or scalar function owns execution. Session scalar kernels are a cleaner extension point.
425+
This makes the dtype vtable a compute engine and creates arbitration problems for multi-argument functions. For `binary(lhs_ext, rhs_ext)`, it is unclear whether the left extension, right extension, or scalar function owns execution. Session `execute_parent` kernels are a cleaner extension point.
411426

412427
### Add `ExtVTable::register_storage_delegates`
413428

@@ -417,16 +432,16 @@ This RFC explicitly rejects registration methods on `ExtVTable`. Registration sh
417432

418433
Apache Arrow extension types store a regular Arrow storage type plus extension metadata on the field. The storage array remains a normal Arrow array. Vortex should preserve this separation between logical extension type and physical storage representation while giving extensions better runtime dispatch. See the Arrow extension type documentation: <https://arrow.apache.org/docs/format/Columnar.html#extension-types>.
419434

420-
Postgres domains are base types with constraints. They are useful prior art for Vortex `ExtensionKind::Domain`. Postgres also has the concept of binary-coercible casts through `CREATE CAST ... WITHOUT FUNCTION`, where no conversion is required because the source and target have the same internal representation. That is related to storage delegation, but Vortex should express it through scalar kernels rather than a closed set of global flags. See <https://www.postgresql.org/docs/current/sql-createcast.html> and <https://www.postgresql.org/docs/current/sql-createdomain.html>.
435+
Postgres domains are base types with constraints. They are useful prior art for refinement-like types, although this RFC does not model domains as a separate extension kind. Postgres also has the concept of binary-coercible casts through `CREATE CAST ... WITHOUT FUNCTION`, where no conversion is required because the source and target have the same internal representation. That is related to storage delegation, but Vortex should express it through registered kernels rather than a closed set of global flags. See <https://www.postgresql.org/docs/current/sql-createcast.html> and <https://www.postgresql.org/docs/current/sql-createdomain.html>.
421436

422-
DuckDB and Postgres both distinguish type identity from function/operator implementations. Operators and casts are registered behavior, not hard-coded methods on the type descriptor. Vortex should follow that separation by putting extension scalar behavior in session kernels.
437+
DuckDB and Postgres both distinguish type identity from function/operator implementations. Operators and casts are registered behavior, not hard-coded methods on the type descriptor. Vortex should follow that separation by putting extension scalar behavior in session `execute_parent` kernels.
423438

424439
## Unresolved Questions
425440

426-
- What should the exact `ScalarFnKernel` trait look like?
427-
- Should scalar kernels be ordered by registration order, specificity, or explicit priority?
428-
- Should generated storage-delegate kernels be stored in the same registry as custom scalar kernels, or in a separate registry checked by the same dispatcher?
429-
- How should scalar-kernel dispatch handle multi-extension arguments when multiple kernels match?
441+
- What should the exact erased session `execute_parent` kernel trait look like?
442+
- Should session `execute_parent` kernels be ordered by registration order, specificity, or explicit priority?
443+
- Should generated storage-delegate kernels be stored in the same registry as custom session kernels, or in a separate registry checked by the same dispatcher?
444+
- How should session `execute_parent` dispatch handle multi-extension arguments when multiple kernels match?
430445
- What should the exact `extension_wrap` validation-policy API be?
431446
- Should new writers default to extension-id encoding immediately, or should there be a transition period where `vortex.ext` remains the default?
432447
- Which built-in extension dtypes should register storage-delegate kernels initially?
@@ -437,8 +452,8 @@ Adding `FixedSizeBinary` is also out of scope. It may be a good storage dtype fo
437452

438453
## Future Possibilities
439454

440-
The same session scalar-kernel mechanism can eventually replace more static `execute_parent` and `reduce_parent` implementations. The migration does not need to happen as part of this RFC.
455+
The same session-kernel mechanism can eventually replace more static `execute_parent` implementations beyond scalar functions. Session `reduce_parent` already exists in a limited form; aligning both registries is a natural follow-on.
441456

442457
The extension descriptor could eventually include richer documentation metadata for external systems, such as Arrow extension mappings, SQL type names, and display/formatting preferences.
443458

444-
The storage-delegate helper may grow convenience constructors for common patterns such as equality-only newtypes, ordered domains, and value-preserving refinements.
459+
The storage-delegate helper may grow convenience constructors for common patterns such as equality-only newtypes and value-preserving refinements.

0 commit comments

Comments
 (0)