Skip to content

Commit 0cd32ff

Browse files
committed
PatchedArray (#27)
Distilling some thoughts from the initial implementation work into RFC so we can all get on the same page before we go any further. --------- Signed-off-by: Andrew Duffy <andrew@a10y.dev>
1 parent 48fac27 commit 0cd32ff

File tree

3 files changed

+201
-1
lines changed

3 files changed

+201
-1
lines changed

accepted/0027-patches-format.md

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
2+
- Start Date: 2026-03-02
3+
- Tracking Issue: TBD
4+
- Draft PR: https://github.com/vortex-data/vortex/pull/6815
5+
6+
# Data-parallel Patched Array
7+
8+
## Summary
9+
10+
Make a backwards compatible change to the serialization format for `Patches` used by the FastLanes-derived encodings:
11+
12+
- BitPacked
13+
- ALP
14+
- ALP-RD
15+
16+
enabling fully data-parallel patch application inside of the CUDA bit-unpacking kernels, while not impacting
17+
CPU performance.
18+
19+
This relies on introducing a new encoding to represent exception patching, which would be a forward-compatibility break
20+
as is always the case when adding a new default encoding.
21+
22+
---
23+
24+
## Data Layout
25+
26+
Patches have a new layout, influenced by the [G-ALP paper](https://ir.cwi.nl/pub/35205/35205.pdf) from CWI.
27+
28+
The key insight of the paper is that instead of holding the patches sorted by their global offset, instead
29+
30+
- Group patches into 1024-element chunks
31+
- Further group the patches within each chunk by their "lanes", where the lane is w/e the lane of the underlying operation you're patching over aligns to
32+
33+
For example, let's say that we have an array of 5,000 elements, with 32 lanes.
34+
35+
- We'd have $\left\lceil\frac{5,000}{1024}\right\rceil = 5$ chunks, each chunk has 32 lanes. Each lane can have up to 32 patch values
36+
- Indices and values are aligned. Indices are indices within a chunk, so they can be stored as u16. Values are whatever the underlying values type is.
37+
38+
```text
39+
40+
chunk 0 chunk 0 chunk 0 chunk 0 chunk 0 chunk 0
41+
lane 0 lane 1 lane 2 lane 3 lane 4 lane 5
42+
┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐
43+
lane_offsets │ 0 │ 0 │ 2 │ 2 │ 3 │ 5 │ ...
44+
└─────┬──────┴─────┬──────┴─────┬──────┴──────┬─────┴──────┬─────┴──────┬─────┘
45+
│ │ │ │ │ │
46+
│ │ │ │ │ │
47+
┌─────┴────────────┘ └──────┬──────┘ ┌──────┘ └─────┐
48+
│ │ │ │
49+
│ │ │ │
50+
│ │ │ │
51+
▼────────────┬────────────┬────────────▼────────────▼────────────┬────────────▼
52+
indices │ │ │ │ │ │ │
53+
│ │ │ │ │ │ │
54+
├────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
55+
values │ │ │ │ │ │ │
56+
│ │ │ │ │ │ │
57+
└────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘
58+
```
59+
60+
This layout has a few benefits
61+
62+
- For GPU operations, each warp handles a single chunk, and each thread handles a single lane. Through the `lane_offsets`, each thread of execution can have quick random access to an iterator of values
63+
- Patches can be trivially sliced to a specific chunk range simply by slicing into the `lane_offsets`
64+
- Bulk operations can be executed efficiently per-chunk by loading all patches for a chunk and applying them in a loop, as before
65+
- Point lookups are still efficient. Convert the target index into the chunk/lane, then do a linear scan for the index. There will be at most `1024 / N_LANES` patches, which in our current implementation is 64. A linear search with loop unrolling should be able to execute this extremely fast on hardware with SIMD registers.
66+
67+
---
68+
69+
## Array Structure
70+
71+
```rust
72+
/// An array that partially "patches" another array with new values.
73+
///
74+
/// Patched arrays implement the set of nodes that do this instead here...I think?
75+
#[derive(Debug, Clone)]
76+
pub struct PatchedArray {
77+
/// The inner array that is being patched. This is the zeroth child.
78+
pub(super) inner: ArrayRef,
79+
80+
/// Number of 1024-element chunks. Pre-computed for convenience.
81+
pub(super) n_chunks: usize,
82+
83+
/// Number of lanes the patch indices and values have been split into. Each of the `n_chunks`
84+
/// of 1024 values is split into `n_lanes` lanes horizontally, each lane having 1024 / n_lanes
85+
/// values that might be patched.
86+
pub(super) n_lanes: usize,
87+
88+
/// Offset into the first chunk
89+
pub(super) offset: usize,
90+
/// Total length.
91+
pub(super) len: usize,
92+
93+
/// lane offsets. The PType of these MUST be u32
94+
pub(super) lane_offsets: ArrayRef,
95+
/// indices within a 1024-element chunk. The PType of these MUST be u16
96+
pub(super) indices: ArrayRef,
97+
/// patch values corresponding to the indices. The ptype is specified by `values_ptype`.
98+
pub(super) values: ArrayRef,
99+
100+
pub(super) stats_set: ArrayStats,
101+
}
102+
```
103+
104+
The PatchedArray holds a `lane_offsets` child which provides chunk/lane-level random indexing
105+
into the patch `indices` and `values`. Like all arrays, these can live in device or host memory.
106+
107+
The only operation performed at planning time is slicing, which means that all of its reduce rules would run
108+
without issue in CUDA or on CPU.
109+
110+
---
111+
112+
# Operations
113+
114+
## Slicing
115+
116+
We look at the slice indices, align them to chunk boundaries, then slice both the child and the patches to chunk boundaries, and preserve the offset + len to apply the final intra-chunk slice at execution time.
117+
118+
## Filter
119+
120+
We can do some limited optimization of Filter in a reducer. First, we find the start/end indices of the filter mask to nearest chunk boundary (1024 elements).
121+
122+
We then slice the underlying array to those boundaries. We also can slice the `lane_offsets` by multiples of `n_lanes` to trim to only in-bounds chunks.
123+
124+
Then we re-wrap in a FilterArray with the mask sliced to same chunk boundaries. When the filter is sparse and clustered this greatly reduces the number of chunks
125+
that need to be decoded.
126+
127+
## `ScalarFn`s
128+
129+
The behavior of some scalar functions may be undefined over placeholder values that exist in the inner array. For example, integer addition may overflow.
130+
131+
To avoid this, only scalar functions where `ScalarFnVTable::is_fallible()` is `false` can be kernelized.
132+
133+
Currently, this only applies to the `CompareKernel`, which will push down to inner, then perform the comparison on the patches as well.
134+
135+
---
136+
137+
## Compatibility
138+
139+
BitPackedArray and ALPArray both hold a `Patches` internally, which we'd like to replace by wrapping them in a `PatchedArray`.
140+
141+
To do this without breaking backward compatibility, we modify the `VTable::build` function to return `ArrayRef`. This makes it easy to do encoding migrations on read in the future. The alternative is adding a new BitPackedArray and ALPArray that gets migrated to on write.
142+
143+
This requires executing the Patches at read time. From scanning a handful of our tables, this is unlikely to cause any issues as patches are generally not compressed. We only apply constant compression for patch values, and I would expect that to be rare in practice.
144+
145+
## Drawbacks
146+
147+
This will be a forward-compatibility break. Old clients will not be able to read files written with the new encoding.
148+
However, the potential break surface is huge given how ubiquitous bitpacked arrays and patches are in our encoding trees.
149+
This will cause friction as users of Vortex who have separate writer/reader pipelines will need to upgrade their Vortex
150+
clients across both in lockstep.
151+
152+
> Does this add complexity that could be avoided?
153+
154+
IMO this centralizes some complexity that previously was shared across multiple encodings.
155+
156+
## Alternatives
157+
158+
> Transpose the patches within GPU execution
159+
160+
This was found to be not very performant. The time spent D2H copy, transpose patches, H2D copy far exceeded the cost of executing the bitpacking kernel, which puts a serious
161+
limit on our GPU scan performance. Combined with how ubiquitous `BitPackedArray`s with patches are in our encoding trees, would be a permanent bottleneck on throughput.
162+
163+
> What is the cost of **not** doing this?
164+
165+
Our GPU scan performance would be permanently limited by patching overhead, which in TPC-H lineitem scans was shown to be the biggest bottleneck after string decoding.
166+
167+
> Is there a simpler approach that gets us most of the way there?
168+
169+
I don't think so
170+
171+
## Prior Art
172+
173+
The original FastLanes GPU paper did not attempt to implement data-parallel patching within the FastLanes unpacking
174+
kernels.
175+
176+
The G-ALP paper was published later on, and implemented patching for ALP values _after_ unpacking.
177+
178+
We use a data layout that closely matches the one described in _G-ALP_ and apply it to bit-unpacking as well.
179+
180+
## Unresolved Questions
181+
182+
- What parts of the design need to be resolved during the RFC process?
183+
- What is explicitly out of scope for this RFC?
184+
- Are there open questions that can be deferred to implementation?
185+
186+
## Future Possibilities
187+
188+
It would be nice to use this to replace the SparseArray.
189+
190+
We also need a plan for how to extend this to non-primitive types. Would need to pick a lane count for the other types.

index.ts

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -442,7 +442,7 @@ async function getHighlighter(): Promise<Highlighter> {
442442
if (!highlighter) {
443443
highlighter = await createHighlighter({
444444
themes: ["github-light", "github-dark"],
445-
langs: ["rust", "python", "markdown"],
445+
langs: ["rust", "python", "markdown", "cpp", "c"],
446446
});
447447
}
448448
return highlighter;
@@ -554,6 +554,16 @@ async function build(liveReload: boolean = false): Promise<number> {
554554
await Bun.write("dist/vortex_logo.svg", await logo.text());
555555
}
556556

557+
// Copy all static assets to dist/static/
558+
await $`mkdir -p dist/static`.quiet();
559+
const staticGlob = new Bun.Glob("*");
560+
for await (const filename of staticGlob.scan("./static")) {
561+
const src = Bun.file(`static/${filename}`);
562+
const dest = `dist/static/${filename}`;
563+
await Bun.write(dest, src);
564+
console.log(`Copied static/${filename} -> ${dest}`);
565+
}
566+
557567
// Generate index page
558568
const indexHTML = indexPage(rfcs, repoUrl, liveReload);
559569
await Bun.write("dist/index.html", indexHTML);

static/galp-fig1.png

75.3 KB
Loading

0 commit comments

Comments
 (0)