|
| 1 | + |
| 2 | +- Start Date: 2026-03-02 |
| 3 | +- Tracking Issue: TBD |
| 4 | +- Draft PR: https://github.com/vortex-data/vortex/pull/6815 |
| 5 | + |
| 6 | +# Data-parallel Patched Array |
| 7 | + |
| 8 | +## Summary |
| 9 | + |
| 10 | +Make a backwards compatible change to the serialization format for `Patches` used by the FastLanes-derived encodings: |
| 11 | + |
| 12 | +- BitPacked |
| 13 | +- ALP |
| 14 | +- ALP-RD |
| 15 | + |
| 16 | +enabling fully data-parallel patch application inside of the CUDA bit-unpacking kernels, while not impacting |
| 17 | +CPU performance. |
| 18 | + |
| 19 | +This relies on introducing a new encoding to represent exception patching, which would be a forward-compatibility break |
| 20 | +as is always the case when adding a new default encoding. |
| 21 | + |
| 22 | +--- |
| 23 | + |
| 24 | +## Data Layout |
| 25 | + |
| 26 | +Patches have a new layout, influenced by the [G-ALP paper](https://ir.cwi.nl/pub/35205/35205.pdf) from CWI. |
| 27 | + |
| 28 | +The key insight of the paper is that instead of holding the patches sorted by their global offset, instead |
| 29 | + |
| 30 | +- Group patches into 1024-element chunks |
| 31 | +- Further group the patches within each chunk by their "lanes", where the lane is w/e the lane of the underlying operation you're patching over aligns to |
| 32 | + |
| 33 | +For example, let's say that we have an array of 5,000 elements, with 32 lanes. |
| 34 | + |
| 35 | +- We'd have $\left\lceil\frac{5,000}{1024}\right\rceil = 5$ chunks, each chunk has 32 lanes. Each lane can have up to 32 patch values |
| 36 | +- Indices and values are aligned. Indices are indices within a chunk, so they can be stored as u16. Values are whatever the underlying values type is. |
| 37 | + |
| 38 | +```text |
| 39 | +
|
| 40 | + chunk 0 chunk 0 chunk 0 chunk 0 chunk 0 chunk 0 |
| 41 | + lane 0 lane 1 lane 2 lane 3 lane 4 lane 5 |
| 42 | + ┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐ |
| 43 | +lane_offsets │ 0 │ 0 │ 2 │ 2 │ 3 │ 5 │ ... |
| 44 | + └─────┬──────┴─────┬──────┴─────┬──────┴──────┬─────┴──────┬─────┴──────┬─────┘ |
| 45 | + │ │ │ │ │ │ |
| 46 | + │ │ │ │ │ │ |
| 47 | + ┌─────┴────────────┘ └──────┬──────┘ ┌──────┘ └─────┐ |
| 48 | + │ │ │ │ |
| 49 | + │ │ │ │ |
| 50 | + │ │ │ │ |
| 51 | + ▼────────────┬────────────┬────────────▼────────────▼────────────┬────────────▼ |
| 52 | + indices │ │ │ │ │ │ │ |
| 53 | + │ │ │ │ │ │ │ |
| 54 | + ├────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤ |
| 55 | + values │ │ │ │ │ │ │ |
| 56 | + │ │ │ │ │ │ │ |
| 57 | + └────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘ |
| 58 | +``` |
| 59 | + |
| 60 | +This layout has a few benefits |
| 61 | + |
| 62 | +- For GPU operations, each warp handles a single chunk, and each thread handles a single lane. Through the `lane_offsets`, each thread of execution can have quick random access to an iterator of values |
| 63 | +- Patches can be trivially sliced to a specific chunk range simply by slicing into the `lane_offsets` |
| 64 | +- Bulk operations can be executed efficiently per-chunk by loading all patches for a chunk and applying them in a loop, as before |
| 65 | +- Point lookups are still efficient. Convert the target index into the chunk/lane, then do a linear scan for the index. There will be at most `1024 / N_LANES` patches, which in our current implementation is 64. A linear search with loop unrolling should be able to execute this extremely fast on hardware with SIMD registers. |
| 66 | + |
| 67 | +--- |
| 68 | + |
| 69 | +## Array Structure |
| 70 | + |
| 71 | +```rust |
| 72 | +/// An array that partially "patches" another array with new values. |
| 73 | +/// |
| 74 | +/// Patched arrays implement the set of nodes that do this instead here...I think? |
| 75 | +#[derive(Debug, Clone)] |
| 76 | +pub struct PatchedArray { |
| 77 | + /// The inner array that is being patched. This is the zeroth child. |
| 78 | + pub(super) inner: ArrayRef, |
| 79 | + |
| 80 | + /// Number of 1024-element chunks. Pre-computed for convenience. |
| 81 | + pub(super) n_chunks: usize, |
| 82 | + |
| 83 | + /// Number of lanes the patch indices and values have been split into. Each of the `n_chunks` |
| 84 | + /// of 1024 values is split into `n_lanes` lanes horizontally, each lane having 1024 / n_lanes |
| 85 | + /// values that might be patched. |
| 86 | + pub(super) n_lanes: usize, |
| 87 | + |
| 88 | + /// Offset into the first chunk |
| 89 | + pub(super) offset: usize, |
| 90 | + /// Total length. |
| 91 | + pub(super) len: usize, |
| 92 | + |
| 93 | + /// lane offsets. The PType of these MUST be u32 |
| 94 | + pub(super) lane_offsets: ArrayRef, |
| 95 | + /// indices within a 1024-element chunk. The PType of these MUST be u16 |
| 96 | + pub(super) indices: ArrayRef, |
| 97 | + /// patch values corresponding to the indices. The ptype is specified by `values_ptype`. |
| 98 | + pub(super) values: ArrayRef, |
| 99 | + |
| 100 | + pub(super) stats_set: ArrayStats, |
| 101 | +} |
| 102 | +``` |
| 103 | + |
| 104 | +The PatchedArray holds a `lane_offsets` child which provides chunk/lane-level random indexing |
| 105 | +into the patch `indices` and `values`. Like all arrays, these can live in device or host memory. |
| 106 | + |
| 107 | +The only operation performed at planning time is slicing, which means that all of its reduce rules would run |
| 108 | +without issue in CUDA or on CPU. |
| 109 | + |
| 110 | +--- |
| 111 | + |
| 112 | +# Operations |
| 113 | + |
| 114 | +## Slicing |
| 115 | + |
| 116 | +We look at the slice indices, align them to chunk boundaries, then slice both the child and the patches to chunk boundaries, and preserve the offset + len to apply the final intra-chunk slice at execution time. |
| 117 | + |
| 118 | +## Filter |
| 119 | + |
| 120 | +We can do some limited optimization of Filter in a reducer. First, we find the start/end indices of the filter mask to nearest chunk boundary (1024 elements). |
| 121 | + |
| 122 | +We then slice the underlying array to those boundaries. We also can slice the `lane_offsets` by multiples of `n_lanes` to trim to only in-bounds chunks. |
| 123 | + |
| 124 | +Then we re-wrap in a FilterArray with the mask sliced to same chunk boundaries. When the filter is sparse and clustered this greatly reduces the number of chunks |
| 125 | +that need to be decoded. |
| 126 | + |
| 127 | +## `ScalarFn`s |
| 128 | + |
| 129 | +The behavior of some scalar functions may be undefined over placeholder values that exist in the inner array. For example, integer addition may overflow. |
| 130 | + |
| 131 | +To avoid this, only scalar functions where `ScalarFnVTable::is_fallible()` is `false` can be kernelized. |
| 132 | + |
| 133 | +Currently, this only applies to the `CompareKernel`, which will push down to inner, then perform the comparison on the patches as well. |
| 134 | + |
| 135 | +--- |
| 136 | + |
| 137 | +## Compatibility |
| 138 | + |
| 139 | +BitPackedArray and ALPArray both hold a `Patches` internally, which we'd like to replace by wrapping them in a `PatchedArray`. |
| 140 | + |
| 141 | +To do this without breaking backward compatibility, we modify the `VTable::build` function to return `ArrayRef`. This makes it easy to do encoding migrations on read in the future. The alternative is adding a new BitPackedArray and ALPArray that gets migrated to on write. |
| 142 | + |
| 143 | +This requires executing the Patches at read time. From scanning a handful of our tables, this is unlikely to cause any issues as patches are generally not compressed. We only apply constant compression for patch values, and I would expect that to be rare in practice. |
| 144 | + |
| 145 | +## Drawbacks |
| 146 | + |
| 147 | +This will be a forward-compatibility break. Old clients will not be able to read files written with the new encoding. |
| 148 | +However, the potential break surface is huge given how ubiquitous bitpacked arrays and patches are in our encoding trees. |
| 149 | +This will cause friction as users of Vortex who have separate writer/reader pipelines will need to upgrade their Vortex |
| 150 | +clients across both in lockstep. |
| 151 | + |
| 152 | +> Does this add complexity that could be avoided? |
| 153 | +
|
| 154 | +IMO this centralizes some complexity that previously was shared across multiple encodings. |
| 155 | + |
| 156 | +## Alternatives |
| 157 | + |
| 158 | +> Transpose the patches within GPU execution |
| 159 | +
|
| 160 | +This was found to be not very performant. The time spent D2H copy, transpose patches, H2D copy far exceeded the cost of executing the bitpacking kernel, which puts a serious |
| 161 | +limit on our GPU scan performance. Combined with how ubiquitous `BitPackedArray`s with patches are in our encoding trees, would be a permanent bottleneck on throughput. |
| 162 | + |
| 163 | +> What is the cost of **not** doing this? |
| 164 | +
|
| 165 | +Our GPU scan performance would be permanently limited by patching overhead, which in TPC-H lineitem scans was shown to be the biggest bottleneck after string decoding. |
| 166 | + |
| 167 | +> Is there a simpler approach that gets us most of the way there? |
| 168 | +
|
| 169 | +I don't think so |
| 170 | + |
| 171 | +## Prior Art |
| 172 | + |
| 173 | +The original FastLanes GPU paper did not attempt to implement data-parallel patching within the FastLanes unpacking |
| 174 | +kernels. |
| 175 | + |
| 176 | +The G-ALP paper was published later on, and implemented patching for ALP values _after_ unpacking. |
| 177 | + |
| 178 | +We use a data layout that closely matches the one described in _G-ALP_ and apply it to bit-unpacking as well. |
| 179 | + |
| 180 | +## Unresolved Questions |
| 181 | + |
| 182 | +- What parts of the design need to be resolved during the RFC process? |
| 183 | +- What is explicitly out of scope for this RFC? |
| 184 | +- Are there open questions that can be deferred to implementation? |
| 185 | + |
| 186 | +## Future Possibilities |
| 187 | + |
| 188 | +It would be nice to use this to replace the SparseArray. |
| 189 | + |
| 190 | +We also need a plan for how to extend this to non-primitive types. Would need to pick a lane count for the other types. |
0 commit comments