Commit a51c8e9
committed
encodings/onpair-rs: pure-Rust port of OnPair training + encoding
New crate `onpair-lib` at `encodings/onpair-rs/` that mirrors the subset of
`vortex-onpair-sys` actually consumed by `vortex-onpair`: BPE-style
dictionary training plus LSB-first bit-packed token encoding, exposed via
`Column::compress` and `Column::parts` with the same shape as the FFI
crate. Decode, LIKE, and EQ remain in `vortex-onpair` (already pure Rust)
and read the same `(dict_bytes, dict_offsets, codes_packed,
codes_boundaries, bits)` layout.
Modules ported from `gargiulofrancesco/onpair_cpp`:
* types, dict, store, bit_writer, bit_unpack
* lpm (flat HashMap keyed by (u128, u8); behavioural-equivalent
replacement for the C++ short/long bucket split)
* trainer (BPE pair-discovery + DynamicThresholdController + sort)
* parser, column
Tests:
* 162 unit tests ported from the C++ GoogleTest suite (types,
dictionary, store, bit_writer, lpm, trainer, parser, column
round-trip across all 8 bit widths).
* 8 cross-impl tests in `tests/cross_impl.rs` against
`vortex-onpair-sys`: structural parity, decompression equivalence,
eq / starts_with / contains predicate equivalence on a shared decode
loop, and dictionary invariants (covers all 256 bytes, lex-sorted).
Known divergence from C++: bit-exact dictionary equality is not asserted
because the two implementations use different RNGs (`std::mt19937_64` vs
Rust's `StdRng`). Every observable downstream operation matches.
Signed-off-by: Claude <noreply@anthropic.com>1 parent a1ba67f commit a51c8e9
16 files changed
Lines changed: 3660 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
| 52 | + | |
52 | 53 | | |
53 | 54 | | |
54 | 55 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
0 commit comments