Commit 53c3ea4
committed
OnPair: filter shares dict (TPC-H Q22 SF=10 fix) + token-aware predicates + memchr contains
Three connected changes that drop the SF=10 regression and accelerate
predicate pushdown.
OnPair::filter — share the dictionary (was the SF=10 cause)
-----------------------------------------------------------
The previous implementation decoded the whole array, filtered the
canonical bytes, and re-trained a brand-new OnPair dictionary on the
surviving rows. TPC-H Q22 customer.c_phone goes through two consecutive
filters (`SUBSTRING(c_phone,1,2) IN (...)` and `c_acctbal > avg`), each
of which paid full `Column::compress` training overhead — a ~50–100 ms
constant cost per call that vanishes below noise at SF=1 but dominates
at SF=10.
The rewrite is FSST-shape: keep `dict_bytes` + `dict_offsets` byte-
identical to the input; rebuild only `codes`, `codes_offsets`,
`uncompressed_lengths`, and validity by walking the mask. No decode,
no retrain, no C++ on the read path. New unit test
`test_onpair_filter_shares_dict` asserts the dict is byte-identical
post-filter.
Bench (UrlLog 1 M, --sample-count 30, release):
filter_share_dict 4.8 ms median
(vs. ~70 ms estimated for the old recompress path)
Token-aware Eq pushdown (no row decode)
---------------------------------------
New `lpm.rs` greedy longest-prefix-match tokeniser. OnPair's dictionary
is sorted lexicographically, so a 257-entry first-byte index gives
O(1) bucket lookup per byte; the inner loop scans the small bucket
to pick the longest matching dict entry. Two byte strings have equal
LPM token sequences iff they have equal bytes (LPM is deterministic
under the same dict), so `compute/compare.rs::compare(Eq)` LPM-tokenises
the needle once and then for each row compares `codes[lo..hi]` against
the tokenised needle as `&[u16]` — direct slice eq, no decode at all.
If the needle contains a byte that has no dict entry, no row can match
(every row was compressed against the same dict) — we leave the
bitmap zeroed and `NotEq` inverts.
Bench (UrlLog 1 M):
eq_constant 6.8 ms median
(mostly OwnedDecodeInputs::collect; the actual token compare is
sub-millisecond)
LIKE pushdown
-------------
* `'literal'` — same token-aware path as Eq.
* `'prefix%'` — byte-streaming via `for_each_dict_slice`. The naive
"tokenise the prefix and compare token prefix"
trick is **wrong** for LIKE: the LPM of the row's
leading bytes may merge tokens past the literal
prefix's boundary. Streaming dict slices and
comparing prefix-wise is the correct minimum-work
option.
* `'%substring%'` — `memchr::memmem::Finder` (SSE2/AVX2 on x86_64,
NEON on aarch64, Two-Way underneath). Built once
per kernel call, reused across every row.
Everything else (escapes, `_`, mid-pattern wildcards,
case-insensitive) returns `None` so the framework decompresses + runs
the scalar `LIKE`.
Bench (UrlLog 1 M):
like_prefix 14.8 ms median
like_contains 36.4 ms median
Bench surface
-------------
* New corpus shapes: `UrlLog`, `Short`, `Long`, `HighCard` × 2 row
counts (100 K, 1 M).
* New compute benches: `eq_constant`, `like_prefix`, `like_contains`,
`filter_share_dict`.
Verified
* `cargo test -p vortex-onpair` 19 / 19
* `cargo test -p vortex-btrblocks` 35 / 35
* `cargo test -p vortex-file --features
onpair,tokio --test test_onpair_string_roundtrip` — 5 / 5
* `cargo clippy -p vortex-onpair --all-targets` clean
Signed-off-by: Claude <noreply@anthropic.com>1 parent adeda19 commit 53c3ea4
9 files changed
Lines changed: 573 additions & 103 deletions
File tree
- encodings/onpair
- benches
- src
- compute
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
20 | 21 | | |
21 | 22 | | |
22 | 23 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
24 | 26 | | |
25 | 27 | | |
26 | 28 | | |
27 | 29 | | |
28 | 30 | | |
29 | 31 | | |
30 | 32 | | |
| 33 | + | |
31 | 34 | | |
32 | 35 | | |
| 36 | + | |
33 | 37 | | |
34 | 38 | | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
35 | 43 | | |
| 44 | + | |
36 | 45 | | |
37 | 46 | | |
| 47 | + | |
38 | 48 | | |
39 | 49 | | |
40 | 50 | | |
| |||
83 | 93 | | |
84 | 94 | | |
85 | 95 | | |
86 | | - | |
87 | | - | |
| 96 | + | |
88 | 97 | | |
89 | 98 | | |
90 | 99 | | |
| |||
179 | 188 | | |
180 | 189 | | |
181 | 190 | | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
182 | 267 | | |
183 | 268 | | |
184 | 269 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
4 | | - | |
5 | | - | |
6 | | - | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
7 | 17 | | |
8 | 18 | | |
9 | 19 | | |
| |||
19 | 29 | | |
20 | 30 | | |
21 | 31 | | |
22 | | - | |
23 | 32 | | |
| 33 | + | |
| 34 | + | |
24 | 35 | | |
25 | 36 | | |
26 | 37 | | |
| |||
43 | 54 | | |
44 | 55 | | |
45 | 56 | | |
46 | | - | |
47 | | - | |
48 | | - | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
49 | 70 | | |
50 | 71 | | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
51 | 77 | | |
52 | 78 | | |
53 | 79 | | |
| |||
67 | 93 | | |
68 | 94 | | |
69 | 95 | | |
70 | | - | |
71 | | - | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | | - | |
90 | | - | |
91 | | - | |
92 | | - | |
93 | | - | |
94 | | - | |
95 | | - | |
96 | | - | |
97 | | - | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
4 | | - | |
5 | | - | |
6 | | - | |
7 | | - | |
8 | | - | |
9 | | - | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
10 | 15 | | |
11 | 16 | | |
12 | 17 | | |
13 | | - | |
14 | 18 | | |
15 | 19 | | |
| 20 | + | |
16 | 21 | | |
| 22 | + | |
| 23 | + | |
17 | 24 | | |
| 25 | + | |
18 | 26 | | |
19 | 27 | | |
20 | 28 | | |
21 | | - | |
22 | | - | |
| 29 | + | |
23 | 30 | | |
24 | 31 | | |
25 | 32 | | |
26 | 33 | | |
27 | 34 | | |
28 | 35 | | |
29 | 36 | | |
30 | | - | |
31 | | - | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
32 | 46 | | |
33 | | - | |
34 | | - | |
35 | | - | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
36 | 92 | | |
37 | | - | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
38 | 106 | | |
39 | 107 | | |
40 | 108 | | |
0 commit comments