You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Boncz / Gargiulo would ask: at the 12-bit parity that this RFC's
default settles on, is OnPair-at-12-bit faster than FSST12 (the
relevant comparison) — not the FSST8-vs-OnPair16 framing the doc
had?
The answer (prior, since no public head-to-head measurement exists):
no, OnPair-at-12-bit is probably modestly slower than FSST12 at
AVX2 parity, ~50–60% of FSST12's speed. At 12-bit parity two of
the three structural reasons OnPair16 is slower than FSST8
collapse (dict size and average symbol length both equalize); the
remaining reason — OnPair's heavier per-LPM-probe data structures
(robin_hood hash + bucketed long-pattern struct) vs FSST's lossy
perfect hash — persists by design (OnPair trades speed for ratio).
Compression throughput is parsing-dominated for typical Vortex chunks
(OnPair paper Table 5: <10% training), so the OnPair single-pass
training advantage helps only on very small inputs.
Changes:
- Add a "Note on the apples-to-apples comparison" paragraph in the
Compression section spelling out the FSST12-vs-OnPair-at-12-bit
estimate and reasoning.
- Update the Drawbacks "compression speed gap" bullet to mention the
12-bit parity narrows the gap.
- Update validation sub-benchmark 9 to explicitly call out the
FSST12-vs-OnPair-at-12-bit head-to-head as the relevant measurement.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: proposed/0059-onpair-gpu-string-encoding.md
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -313,6 +313,8 @@ All automata operate on the token-id stream, not the decoded byte stream. Each q
313
313
314
314
OnPair16's single-pass LPM training; the on-disk format inherits the upstream wire format directly. Compression speed (paper Table 3, single-core AVX2, no AVX-512) is **137–229 MiB/s** for OnPair16 across the five public workloads, vs **325–504 MiB/s** for FSST AVX2 — about ½× FSST AVX2. The paper's hardware doesn't have AVX-512, so the OnPair16 numbers are "no AVX-512 yet" rather than a fundamental ceiling; the FSST AVX-512 path reports ~2 GB/s (FSST paper §3.2). An AVX-512-vectorized OnPair16 should close most of the 2× gap to FSST at parity — the hot path is longest-prefix matching against a hash map plus bytewise copies, both of which have known SIMD treatments (the OnPair paper §3.2.2 explicitly notes AVX-512's 64-bytes-per-instruction copy capability). The engineering isn't done in the public reference impl yet.
315
315
316
+
**A note on the apples-to-apples comparison: FSST12 vs OnPair-at-12-bit.** The numbers cited above are FSST8 vs OnPair16. At the 12-bit default, both encodings have a 4K-token dictionary, which collapses two of the three structural reasons OnPair16 is slower than FSST8: (a) larger dictionary → more LPM probe cost (collapses at parity) and (b) longer average symbol length → more bytes touched per match (also largely collapses; OnPair at 4K tokens has avg symbol ~3 bytes per paper Table 1, comparable to FSST12's expected average). The remaining factor is (c) OnPair's heavier per-probe data structure (`robin_hood` hash for ≤8-byte patterns, bucketed 8-byte-prefix-keyed structure for longer patterns, per paper §3.4) vs FSST's lossy perfect hash on the 3-byte prefix (FSST paper §4.3) — a deliberate ratio-for-speed trade in OnPair's design. There is no published FSST12-vs-OnPair-at-12-bit head-to-head measurement in the public literature; the validation campaign sub-benchmark 9 measures it directly. Our prior is that OnPair-at-12-bit is **~50–60% of FSST12's compression speed at AVX2 parity**, a meaningfully smaller gap than the OnPair16-vs-FSST8 ratio but still measurable. With AVX-512 on both, the gap narrows further; with AVX-512 on neither (the public state today), the gap is largest. Compression throughput is parsing-dominated for typical-sized inputs (OnPair paper Table 5: training is <10% of total compression time on most workloads); the smaller training cost of OnPair's single pass vs FSST's 5-round evolutionary algorithm wins meaningfully only on very small inputs where training fixed cost matters.
317
+
316
318
In Tier-2 (shared-dict) mode, training is a one-time corpus-wide cost amortized across all chunks; per-chunk compression is then a parsing pass (which dominates OnPair16's total compression time per paper Table 5) and should run ~2–5× faster than per-chunk training.
317
319
318
320
### Why this is the right design
@@ -429,7 +431,7 @@ The format-stability commitments for the OnPair16 encoding:
429
431
430
432
-**Decoder dictionary footprint.**~76 KiB on-disk for 4K-token dicts; up to ~1 MiB for 64K-token dicts. Per-thread decoder scratch is similar. Tier 2 amortizes across all chunks sharing the dict.
431
433
-**CUDA-only GPU.** This RFC scopes GPU decode to NVIDIA. Targeting AMD (ROCm/HIP) or vendor-neutral compute (SYCL, Vulkan) is out of scope — see Future Possibilities.
432
-
-**Compression speed gap to FSST.** Reference OnPair16 is ~½× FSST's AVX2-only compression speed (137–229 MiB/s vs 325–504 MiB/s) and ~⅙× FSST's AVX-512 compression speed (~150–230 MiB/s vs ~1–3 GB/s). The structural reasons are a larger dictionary, longer average symbols, and real hash maps vs lossy perfect hashing. An AVX-512 OnPair16 path is plausible and would close most of the gap; that engineering work isn't done in the public reference impl.
434
+
-**Compression speed gap to FSST.** Reference OnPair16 is ~½× FSST8's AVX2-only compression speed (137–229 MiB/s vs 325–504 MiB/s) and ~⅙× FSST8's AVX-512 compression speed (~150–230 MiB/s vs ~1–3 GB/s). At 12-bit parity (this RFC's default vs FSST12), our prior is a smaller gap — OnPair-at-12-bit ≈ 50–60% of FSST12's speed at AVX2 parity (no public head-to-head exists yet; validation sub-benchmark 9 measures). The structural reasons after dict-size and symbol-length collapse at parity: OnPair's `robin_hood`-style hash + bucketed long-pattern structure is heavier per LPM probe than FSST's lossy perfect hash. An AVX-512 OnPair16 path is plausible and would close most of the residual gap; that engineering work isn't done in the public reference impl.
433
435
-**Tier-1 metadata is larger than FSST8's.** The OnPair16 dictionary (a few tens of KiB to 1 MiB) is larger than FSST8's ~2 KiB. Negligible at array-scale but worth noting for very-small arrays. Tier 2 makes this a one-time corpus-level cost.
434
436
-**Predicate pushdown precompute is larger than FSST8's.** OnPair16's compressed-domain automata operate on a larger dict and use specialized data structures (Aho–Corasick trie, KMP failure tables, prefix range lookups). Precompute is O(pattern_length + dict_size) per query; for short patterns and 4K-token dicts this is sub-millisecond. For point-lookup queries (decode one row), the precompute is unjustified — fall back to decode-then-filter.
435
437
-**More moving pieces than a single fixed-width encoding.**`code_width_bits` is a real configuration knob; `codes_per_split`, the coalesced-format flag, the Tier-1-vs-Tier-2 choice, the optional `splits` buffer — each must be picked by the writer or by encoder heuristics. The validation campaign recommends defaults.
@@ -475,7 +477,7 @@ Every design choice above is a prior, not a conclusion. Before any production co
475
477
6.**Per-string decode latency.** OnPair16 vs FSST for short strings (names, URLs, UUIDs). Verify SIMD ramp-up doesn't make short-string decode worse than FSST's tight scalar loop. The OnPair paper Table 3 random-access column already shows OnPair16 at 156–335 ns per random access vs FSST at 175–359 ns — comparable. Vortex-specific re-measurement on representative columns is the validation.
476
478
7.**Predicate pushdown end-to-end.** Vortex's existing FSST DFA vs OnPair16's compressed-domain automata (KMP, Aho–Corasick, prefix), on representative LIKE / equality / substring workloads (e.g., ClickBench Q19/Q20). The OnPair paper Section 4 reports per-pattern throughput; we measure the Vortex-integration overhead.
477
479
8.**Tier-2 ratio uplift.** Train one dict on each whole dataset and Arc-share across chunks; measure ratio improvement vs. per-array (Tier-1) training. Compare against the OnPair paper Table 3 ratios for sanity.
478
-
9.**Compression speed.** Per-array LPM training cost, including cold-cache effects. Also measure an AVX-512-vectorized hot path (a focused engineering investment, ~1–2 weeks) to confirm the predicted ~½× → ~1× FSST parity at AVX-512.
480
+
9.**Compression speed.** Per-array LPM training cost, including cold-cache effects. **The apples-to-apples comparison is FSST12 vs OnPair-at-12-bit**, both at the 4K-token dictionary capacity — there's no public head-to-head measurement of this, and the structural argument in §[Compression](#compression) predicts ~50–60% of FSST12 at AVX2 parity. Also measure an AVX-512-vectorized OnPair16 hot path (a focused engineering investment, ~1–2 weeks) to confirm the predicted gap-narrowing.
479
481
10.**Tier-2 dict materialization cost.** Confirm `OnceLock`-cached dict reuse pays for itself across realistic query patterns. Decide if `OnPair16Layout` needs a different caching policy than `DictLayout`'s default.
0 commit comments