pluggable registry for input/export arrow kernels#7824
Conversation
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Merging this PR will degrade performance by 10.84%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | canonicalize_compare[(1000, 4, 4)] |
122 µs | 136.8 µs | -10.84% |
Comparing aduffy/arrow-vtable (bf447fc) with develop (e160125)
Footnotes
-
523 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Polar Signals Profiling ResultsLatest Run
Previous Runs (2)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.981x ➖ datafusion / vortex-file-compressed (0.981x ➖, 0↑ 0↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.027x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.015x ➖, 0↑ 0↓)
datafusion / parquet (0.994x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.022x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.000x ➖, 0↑ 0↓)
duckdb / parquet (1.004x ➖, 0↑ 1↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.990x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.996x ➖, 0↑ 0↓)
datafusion / parquet (1.000x ➖, 1↑ 1↓)
datafusion / arrow (1.022x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.005x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.988x ➖, 0↑ 0↓)
duckdb / parquet (0.993x ➖, 1↑ 0↓)
duckdb / duckdb (0.991x ➖, 0↑ 1↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.108x ❌, 0↑ 55↓)
datafusion / vortex-compact (1.094x ➖, 1↑ 42↓)
datafusion / parquet (1.099x ➖, 1↑ 50↓)
duckdb / vortex-file-compressed (1.096x ➖, 1↑ 47↓)
duckdb / vortex-compact (1.084x ➖, 0↑ 39↓)
duckdb / parquet (1.055x ➖, 0↑ 14↓)
duckdb / duckdb (1.106x ❌, 0↑ 56↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.992x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.960x ➖, 0↑ 0↓)
datafusion / parquet (0.922x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.990x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.007x ➖, 0↑ 0↓)
duckdb / parquet (0.989x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (1.002x ➖, 1↑ 0↓)
duckdb / vortex-compact (1.020x ➖, 0↑ 0↓)
duckdb / parquet (1.033x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.006x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.014x ➖, 0↑ 0↓)
datafusion / parquet (0.988x ➖, 0↑ 0↓)
datafusion / arrow (1.126x ❌, 0↑ 14↓)
duckdb / vortex-file-compressed (1.056x ➖, 0↑ 3↓)
duckdb / vortex-compact (1.035x ➖, 0↑ 1↓)
duckdb / parquet (1.013x ➖, 0↑ 0↓)
duckdb / duckdb (1.068x ➖, 0↑ 5↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.073x ➖, 0↑ 4↓)
datafusion / parquet (1.055x ➖, 0↑ 3↓)
duckdb / vortex-file-compressed (1.089x ➖, 0↑ 11↓)
duckdb / parquet (1.022x ➖, 0↑ 3↓)
duckdb / duckdb (1.026x ➖, 0↑ 2↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.954x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.975x ➖, 1↑ 0↓)
datafusion / parquet (0.994x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (1.009x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.029x ➖, 0↑ 0↓)
duckdb / parquet (1.010x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.993x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.996x ➖, 1↑ 2↓)
datafusion / parquet (0.973x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.000x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.020x ➖, 0↑ 1↓)
duckdb / parquet (1.038x ➖, 0↑ 0↓)
Full attributed analysis
|
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
|
Perf seems within margin of error? I still need to fix up variant handling before we can merge. |
| .elements() | ||
| .clone() | ||
| .execute_arrow(Some(elements_field.data_type()), ctx)?; | ||
| let elements = ctx.session().clone().arrow().execute_arrow( |
There was a problem hiding this comment.
ctx.session().clone().arrow() show expensive is this?
There was a problem hiding this comment.
i think this is unnecessary
There was a problem hiding this comment.
actually it is necessary b/c ExecutionCtx isheld mutably and we need an immutable ref.
this is just an Arc clone so should be cheap compared to execute_arrow
There was a problem hiding this comment.
not sure i have an idea of cleaner way to do this
| pub enum ArrowExport { | ||
| /// The plugin does not handle this input; the session may try another plugin. | ||
| Unsupported(ArrayRef), | ||
| /// A successful export. | ||
| Exported(ArrowArrayRef), | ||
| } | ||
|
|
||
| /// Outcome of a successful call to [`ArrowImportVTable::from_arrow_array`]. | ||
| /// | ||
| /// Plugins that don't handle the supplied array return [`Unsupported`][Self::Unsupported] | ||
| /// with ownership of the input so the session can probe the next plugin or fall back to the | ||
| /// canonical path. Errors are propagated through [`VortexResult`]. | ||
| pub enum ArrowImport { | ||
| /// The plugin does not handle this input; the session may try another plugin. | ||
| Unsupported(ArrowArrayRef), | ||
| /// A successful import. | ||
| Imported(ArrayRef), | ||
| } |
There was a problem hiding this comment.
i guess probably not, i just like giving them meaningful names
There was a problem hiding this comment.
they return different things in the success enum
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Register `Vector` as an Arrow import/export plugin on the new `ArrowSession`, so a Vortex Vector extension array can go to Arrow and back without losing its identity. Note: If the plugin is not registered, imports fall back to the plain FSL (no error, but the Vector identity is dropped). --------- Signed-off-by: Baris Palaska <barispalaska@gmail.com>
|
@palaska I just noticed, for vortex-tensor |
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
|
There's some fundamental weirdness with Parquet Variant. Variant itself is the only canonical DType that does not have an obvious Arrow encoding. The closest is the So, this leads us to an interesting situation where we need to implement the According to the Arrow spec, this field type must be a STRUCT with a required We don't capture information about shredding in the system, so we don't know here what fields our returned struct field should have... |
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
No very good reason tbh, I guess Vector is a special case of FixedShapeTensor and we could have used it, but since Vortex already has the distinction between those two, I thought we could mirror it. I was planning to have something like this in Spiral. Lmk if you prefer otherwise, I can put up a PR! |
Summary
Adds a pluggable
ArrowSessionregistry onVortexSessionfor round-tripping Vortex extension types in and out of Arrow extension types. Unblocks Arrow round-trip forarrow.uuidtoday, witharrow.parquet.variant, GeoArrow, and tensor types as the next consumers.Part of #7686.
API changes
The session exposes two trait-driven plugin slots:
ArrowExportVTable— dispatched by target Arrow extension name (ARROW:extension:name). Implementations turn a VortexArrayRefinto an ArrowArrayRefshaped to the requestedField. Also providesto_arrow_fieldfor schema inference when only a VortexDTypeis inhand.
ArrowImportVTable— dispatched by source Arrow extension name carried on the incomingField. Implementations turn an ArrowArrayRefback into a VortexArrayRef, including any storage re-encoding (e.g.FixedSizeBinary[16]→FixedSizeList<u8; 16>for UUID).Both traits return
Unsupported(input)to defer to the next plugin or to the canonical fallback, so multiple plugins can register against the same key and probe in order.New session entry points (
vortex-array/src/arrow/session.rs):ArrowSession::to_arrow_field/to_arrow_schema— VortexDType→ ArrowField/Schema, recursing into containers so nested extension fields go through the registered plugin.ArrowSession::from_arrow_field/from_arrow_schema— inverse direction, plugin-aware.ArrowSession::from_arrow_record_batch/execute_record_batch—RecordBatchround-trip.ArrowSessionExtextension trait so anySessionExtcan callsession.arrow().….The default session pre-registers the builtin UUID plugin (
vortex-array/src/extension/uuid/arrow.rs).What's not in the plugin layer
Date,Time, andTimestampare Vortex builtin extensions that map directly to native Arrow temporal types, so they continue to go through the canonical executor (vortex-array/src/arrow/executor/temporal.rs) rather than the plugin registry. The plugin layer is reserved for Arrow extension types that the canonical path can't express.DataFusion wiring
vortex-datafusionnow goes through the session for schema/array conversion:convert/schema.rs::calculate_physical_schemausesArrowSession::to_arrow_fieldso extension metadata survives projection.persistent/format.rsandpersistent/opener.rsroute schema inference through the session.persistent/sink.rsusesfrom_arrow_record_batch, passing the original schema separately fromRecordBatch::schema()to preserveARROW:extension:namemetadata that DataFusion strips at runtime.Tests
Two new end-to-end tests in
vortex-datafusion/src/persistent/tests.rs:arrow_uuid_extension_roundtrip— write Arrow UUID column to a Vortex file via the session,SELECT *it back, assert the field still carries theUuidextension type and the values match.arrow_uuid_extension_roundtrip_nested_struct— same flow with the UUID nested in a top-levelStruct, exercising recursive session-aware schema inference.