Reverse order scans#7777
Conversation
|
Hi, thanks for the PR! Could you create a discussion for this? It's not clear to me that this is how we would want to implement this. I agree it might be nice to have the functionality to reverse scan. However, we might not want to implement this as an array encoding. We also consider Vortex to be a "scalar" query engine, where we essentially always know where values are located in a column (row indices), and thus Regardless, this likely needs some discussion before we can move forward. Let us know if you have any questions! |
Merging this PR will degrade performance by 24.99%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | new_bp_prim_test_between[i64, 32768] |
177.3 µs | 236.4 µs | -24.99% |
Comparing ch-sc:reverse-order-scans (70ebbce) with develop (44a6367)
Footnotes
-
138 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
|
Hi @connortsui20, thanks for the feedback. I created an issue: #7787 to discuss implementation details further.
I totally see where you are coming from. I think this really is a data access optimization which can be applied when data properties line up. It doesn't add sorting capabilities to Vortex. There might be things that I have overlooked though - I'm fairly new to Vortex. |
|
Hi @ch-sc, just a heads up that I converted the issue to a discussion. Also, if you haven't already, please feel free to join the public Vortex Slack! I'm going to post the discussion there since I feel other people might have some thoughts on this. In the meantime, I'm going to make this PR a draft. |
Summary
Reverse order scans are an optimization for queries like
ORDER BY timestamp DESC LIMIT nwhere the data is ordered bytimestamp ASC. Such read patterns appear constantly in time-series workloads where callers want the most recent rows. With the current implementation users would follow naive approaches: fully scan a Vortex file, buffer all rows and then reverse the output or sort all rows of the file. This is unnecessarily expensive.If files are already written in sorted order, a scan in opposite direction can be answered by iterating chunks from last to first and reversing the rows within each chunk. Avoiding sorting and buffering. This PR implements this by reversing ranges in the scan layer and reversing the Vortex array representation.
Closes: #7787
Implementation
The work spans two layers: the scan orchestration layer (
vortex-layout) and the array encoding layer (vortex-array).Scan layer (vortex-layout)
ScanBuildergains awith_reversed(bool)builder method. When set:RepeatedScan::executecollects the chunk ranges and iterates them in reverse order (last chunk first). This is the global reversal — chunk order is flipped for free by reversing aVecof ranges.map_fnclosure wraps the user-supplied function to callarray.reverse()on each chunk before passing it downstream. This is the per-chunk reversal — row order within each chunk is flipped.Reversed scans are always ordered (they produce a strict global sequence), so
ordered = trueis implied.Array layer (
vortex-array) —ReversedArrayReversedArray is a new lazy wrapper encoding. It is constructed by ArrayRef::reverse() and immediately runs through the optimizer. The optimizer fires structural reduce rules at construction time, before any data is read:
Reduce rules:
Reversed(Reversed(x))xReversed(Dict(codes, values))Dict(Reversed(codes), values)Reversed(Chunked([c₀, c₁, …, cₙ]))Chunked([reverse(cₙ), …, reverse(c₁),reverse(c₀)])Reversedand re-optimized recursivelyThe Dict rule is the most important one. Reversing a
Dictmeans reversing only the codes, not the values.Execute kernels:
BitBuffer::value_unchecked— O(n), no intermediate allocationfield.reverse()on each child — per-field optimizer rules still firetake(reversed_indices)API Changes
New surface in
vortex-array:ArrayRef::reverse() -> VortexResult<ArrayRef>— reverse any array lazilyReversed/ReversedArray— the new encoding type (public, can be pattern-matched)ReverseReducetrait +ReverseReduceAdaptorstruct — extension point for custom encodingsNew surface in
vortex-layout:ScanBuilder::with_reversed(bool) -> SelfScanBuilder::reversed() -> boolNo breaking changes. All changes are additive.
Testing
vortex-array/src/arrays/reversed/tests.rscovers 13 cases forPrimitiveArray,BoolArray,DictArray,StructArray, andChunkedArray.