Commit 87f217f
committed
Refactor OnPair to FSST-shape: dict-as-blob, u16 codes child, Rust decode
Replaces the previous opaque-blob layout with one that mirrors how FSST
splits its symbols-as-buffer / codes-as-child encoding, and shifts every
read path off the C++ FFI.
Layout
------
Buffer 0 dict_bytes — dictionary blob built by C++ training
Slot 0 dict_offsets u32[] — len = dict_size + 1
Slot 1 codes u16[] — one token id per element, low `bits`
bits populated (FastLanes-bit-packable)
Slot 2 codes_offsets u32[] — per-row token offsets, len = n + 1
Slot 3 uncompressed_lengths — i32[], len = n
Slot 4 validity — optional Bool child
metadata = { bits: u32, uncompressed_lengths_ptype: i32 }
Decode path
-----------
At compress time we call OnPair's C++ trainer to produce the dictionary
and bit-packed token stream, then immediately unpack the stream into u16
codes in Rust (`vortex_onpair_sys::unpack_codes_to_u16`) and drop the
C++ column. After that, nothing on the read path touches C++:
decode_row(r):
for c in codes[codes_offsets[r] .. codes_offsets[r+1]]:
out.extend_from_slice(
dict_bytes[dict_offsets[c] .. dict_offsets[c+1]]
)
`canonicalize`, `scalar_at`, and the compute kernels all share a
`DecodeView` over the materialised children.
Compute kernels (pure Rust, no C++ scan)
----------------------------------------
* compare (Eq / NotEq): streams dict slices per row, short-circuits on
the first mismatch.
* like ('lit', 'pre%', '%sub%'): same streaming approach for prefix; a
full row decode + memmem for contains.
* filter: canonical round-trip + recompress (unchanged).
* slice: zero-copy — narrows codes_offsets / uncompressed_lengths /
validity and shares the dict blob + codes child.
* cast: identity rewrap, no payload touched.
Tests
-----
All 7 unit tests + the 100 000-row big_data smoke test pass. On the
smoke corpus (release): compress 147 ms, full canonicalize 7.5 ms,
equals / starts_with / contains pushdown counts match a brute-force
reference exactly.
Signed-off-by: Claude <noreply@anthropic.com>1 parent 0fb5929 commit 87f217f
17 files changed
Lines changed: 806 additions & 381 deletions
File tree
- encodings
- onpair-sys
- cxx
- src
- onpair
- goldenfiles
- src
- compute
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
| 18 | + | |
17 | 19 | | |
18 | 20 | | |
19 | | - | |
| 21 | + | |
20 | 22 | | |
21 | 23 | | |
22 | 24 | | |
| |||
351 | 353 | | |
352 | 354 | | |
353 | 355 | | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
354 | 395 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
124 | 124 | | |
125 | 125 | | |
126 | 126 | | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
127 | 150 | | |
128 | 151 | | |
129 | 152 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
115 | 115 | | |
116 | 116 | | |
117 | 117 | | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
118 | 138 | | |
119 | 139 | | |
120 | 140 | | |
| |||
322 | 342 | | |
323 | 343 | | |
324 | 344 | | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
325 | 370 | | |
326 | 371 | | |
327 | 372 | | |
328 | 373 | | |
329 | 374 | | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
0 commit comments