feat: iterative execution ExecutionStep::AppendChild and Chunked impl #7600
feat: iterative execution ExecutionStep::AppendChild and Chunked impl #7600joseph-isaacs wants to merge 19 commits into
ExecutionStep::AppendChild and Chunked impl #7600Conversation
Introduces a builder-driven execution mode alongside the existing
`ExecuteSlot` / `Done` model. Encodings can now append directly into an
`ArrayBuilder` without materializing intermediate canonical arrays.
New types (in `vortex_array::builder_kernel`):
* `BuilderStep { Done, ExecuteSlot, AppendSlot }`
* `BuilderResult { array, builder, step }`
* `AppendToBuilderKernel<V>` typed kernel trait
* `DynAppendToBuilderKernel` type-erased dispatch trait
* `BuilderKernelSession` session variable keyed by encoding id
New executor surface:
* `ExecutionStep::AppendSlot(usize)` to bootstrap the builder path
* `execute_into_builder(array, builder, ctx)` public entry point
* internal `drive_builder` + `execute_until_predicate` helpers
The `Chunked` encoding gets a builder kernel that scans its chunk slots,
returning `BuilderStep::AppendSlot` for the first still-populated slot.
When a chunk is driven through the builder path the executor takes the
slot, leaving it as `None` — the kernel iterates to the next `Some` on
the next call and returns `Done` once all chunks are consumed.
Tests exercise both `Chunked(Primitive)` and nested
`Chunked(Chunked(Primitive))` through the builder path.
Known issue documented via comment in `drive_builder`: the plan relies on
`take_slot_unchecked` truly leaving the taken slot as `None`, but that
only happens when the `Arc` has refcount 1. The driver now explicitly
drops extra references after `optimize()` and passes arrays by value
through the kernel to keep refcount at one; any future caller that keeps
an outstanding clone of the array they hand to the builder path would
re-trigger the original infinite loop.
Signed-off-by: Claude <noreply@anthropic.com>
Previously `take_slot_unchecked` only actually mutated the parent when the `Arc` had refcount 1; otherwise it cloned the child and left the slot as `Some`. That was fine for the existing take->put-back pattern but broke the builder path, which relies on `slot(i).is_none()` to track consumed children (causing an infinite loop whenever a caller held an outstanding clone of the array). Fix: on the shared-Arc path, clone the child out and construct a fresh parent with that slot set to None via a new unsafe `DynArray::with_slots_unchecked` that skips `V::validate`. The None state is invisible outside the executor — callers still must put the slot back (or drive it to completion via the builder path) before the parent escapes. Update the chunked builder-kernel tests to deliberately hold an extra clone of the array across the `execute_into_builder` call, exercising the previously-broken shared-Arc path. Signed-off-by: Claude <noreply@anthropic.com>
`Executable for ArrayRef::execute` is defined as "remove one node from the execution path." `ExecutionStep::AppendSlot` is an exception: there is no useful partial progress point inside a builder-driven sub-tree, so single-step execute drives the slot all the way to canonical and splices the result back in one call. Call that out on both the trait impl and the `AppendSlot` variant so the collapsed semantics aren't a surprise to callers. Signed-off-by: Claude <noreply@anthropic.com>
…-execution-path-0qvq2 Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> # Conflicts: # vortex-array/src/executor.rs
Merging this PR will degrade performance by 25.62%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | chunked_opt_bool_into_canonical[(100, 100)] |
220 µs | 257.1 µs | -14.4% |
| ❌ | Simulation | chunked_opt_bool_into_canonical[(10, 1000)] |
1 ms | 1.4 ms | -25.62% |
| ❌ | Simulation | chunked_varbinview_into_canonical[(10, 1000)] |
1.7 ms | 2 ms | -17.03% |
| ❌ | Simulation | chunked_varbinview_opt_into_canonical[(10, 1000)] |
2.5 ms | 2.8 ms | -12.35% |
| ⚡ | Simulation | varbinview_zip_fragmented_mask |
7.3 ms | 6.5 ms | +11.47% |
| ⚡ | Simulation | varbinview_zip_block_mask |
3.7 ms | 2.9 ms | +27.62% |
| ⚡ | Simulation | decompress[alp_for_bp_f64] |
2.9 ms | 1.8 ms | +58.59% |
| ⚡ | Simulation | alp_rd_decompress_f64 |
2.4 ms | 1.1 ms | ×2.2 |
| ⚡ | Simulation | decompress_rd[f32, (10000, 0.0)] |
166 µs | 85.2 µs | +94.89% |
| ⚡ | Simulation | decompress_rd[f32, (10000, 0.1)] |
165.8 µs | 81.3 µs | ×2 |
| ⚡ | Simulation | decompress_rd[f32, (10000, 0.01)] |
166.1 µs | 81.3 µs | ×2 |
| ⚡ | Simulation | decompress_rd[f32, (100000, 0.1)] |
1,374.6 µs | 582 µs | ×2.4 |
| ⚡ | Simulation | decompress_rd[f32, (100000, 0.01)] |
1,375.1 µs | 582 µs | ×2.4 |
| ⚡ | Simulation | decompress_rd[f32, (100000, 0.0)] |
1,286.9 µs | 495.1 µs | ×2.6 |
| ⚡ | Simulation | decompress_rd[f64, (100000, 0.0)] |
2,294.5 µs | 978.3 µs | ×2.3 |
| ⚡ | Simulation | decompress_rd[f64, (100000, 0.1)] |
2.3 ms | 1 ms | ×2.3 |
| ⚡ | Simulation | decompress_rd[f64, (10000, 0.1)] |
258.5 µs | 121.5 µs | ×2.1 |
| ⚡ | Simulation | decompress_rd[f64, (10000, 0.01)] |
258.4 µs | 121.4 µs | ×2.1 |
| ⚡ | Simulation | decompress_rd[f64, (100000, 0.01)] |
2.3 ms | 1 ms | ×2.3 |
| ⚡ | Simulation | decompress_rd[f64, (10000, 0.0)] |
258 µs | 121.6 µs | ×2.1 |
Comparing claude/add-builder-execution-path-0qvq2 (861457f) with develop (ea31cc2)
Polar Signals Profiling ResultsLatest Run
Previous Runs (7)
Powered by Polar Signals Cloud |
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
File Sizes: PolarSignals ProfilingNo baseline file sizes found for base commit. |
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
File Sizes: FineWeb NVMeNo file size changes detected. |
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
Benchmarks: CompressionVortex (geomean): 0.982x ➖ unknown / unknown (0.960x ➖, 24↑ 3↓)
|
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.075x ➖, 0↑ 4↓)
datafusion / vortex-compact (1.070x ➖, 0↑ 1↓)
datafusion / parquet (1.063x ➖, 0↑ 3↓)
datafusion / arrow (1.103x ❌, 0↑ 9↓)
duckdb / vortex-file-compressed (1.064x ➖, 0↑ 4↓)
duckdb / vortex-compact (1.068x ➖, 0↑ 5↓)
duckdb / parquet (1.032x ➖, 0↑ 0↓)
duckdb / duckdb (1.053x ➖, 0↑ 3↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo baseline file sizes found for base commit. |
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
Benchmarks: Random AccessVortex (geomean): 1.046x ➖ unknown / unknown (1.051x ➖, 0↑ 5↓)
|
…-execution-path-0qvq2 Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> # Conflicts: # vortex-array/src/executor.rs
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
…-execution-path-0qvq2 Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
ExecutionStep::AppendChild and Chunked impl
Chunked array decompression (in iterative execution) currently requires a copy from a decompressed chunk into the whole decompressed array.
This PR add support for a new
ExecutionStep::AppendChildthat will iteratively request that a child is executed into a Builder.Next steps
VTable::append_to_buildwith a iterative model.