RFC 0059: address FSST12-vs-OnPair12 compression-speed question

lwwmanning · claude · lwwmanning · commit 242f4ae9a1c6 · 2026-05-12T12:11:44.000-04:00
Boncz / Gargiulo would ask: at the 12-bit parity that this RFC's
default settles on, is OnPair-at-12-bit faster than FSST12 (the
relevant comparison) — not the FSST8-vs-OnPair16 framing the doc
had?

The answer (prior, since no public head-to-head measurement exists):
no, OnPair-at-12-bit is probably modestly slower than FSST12 at
AVX2 parity, ~50–60% of FSST12's speed. At 12-bit parity two of
the three structural reasons OnPair16 is slower than FSST8
collapse (dict size and average symbol length both equalize); the
remaining reason — OnPair's heavier per-LPM-probe data structures
(robin_hood hash + bucketed long-pattern struct) vs FSST's lossy
perfect hash — persists by design (OnPair trades speed for ratio).
Compression throughput is parsing-dominated for typical Vortex chunks
(OnPair paper Table 5: &lt;10% training), so the OnPair single-pass
training advantage helps only on very small inputs.

Changes:
- Add a "Note on the apples-to-apples comparison" paragraph in the
  Compression section spelling out the FSST12-vs-OnPair-at-12-bit
  estimate and reasoning.
- Update the Drawbacks "compression speed gap" bullet to mention the
  12-bit parity narrows the gap.
- Update validation sub-benchmark 9 to explicitly call out the
  FSST12-vs-OnPair-at-12-bit head-to-head as the relevant measurement.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/proposed/0059-onpair-gpu-string-encoding.md b/proposed/0059-onpair-gpu-string-encoding.md
@@ -313,6 +313,8 @@ All automata operate on the token-id stream, not the decoded byte stream. Each q
 
 OnPair16's single-pass LPM training; the on-disk format inherits the upstream wire format directly. Compression speed (paper Table 3, single-core AVX2, no AVX-512) is **137–229 MiB/s** for OnPair16 across the five public workloads, vs **325–504 MiB/s** for FSST AVX2 — about ½× FSST AVX2. The paper's hardware doesn't have AVX-512, so the OnPair16 numbers are "no AVX-512 yet" rather than a fundamental ceiling; the FSST AVX-512 path reports ~2 GB/s (FSST paper §3.2). An AVX-512-vectorized OnPair16 should close most of the 2× gap to FSST at parity — the hot path is longest-prefix matching against a hash map plus bytewise copies, both of which have known SIMD treatments (the OnPair paper §3.2.2 explicitly notes AVX-512's 64-bytes-per-instruction copy capability). The engineering isn't done in the public reference impl yet.
 
+**A note on the apples-to-apples comparison: FSST12 vs OnPair-at-12-bit.** The numbers cited above are FSST8 vs OnPair16. At the 12-bit default, both encodings have a 4K-token dictionary, which collapses two of the three structural reasons OnPair16 is slower than FSST8: (a) larger dictionary → more LPM probe cost (collapses at parity) and (b) longer average symbol length → more bytes touched per match (also largely collapses; OnPair at 4K tokens has avg symbol ~3 bytes per paper Table 1, comparable to FSST12's expected average). The remaining factor is (c) OnPair's heavier per-probe data structure (`robin_hood` hash for ≤8-byte patterns, bucketed 8-byte-prefix-keyed structure for longer patterns, per paper §3.4) vs FSST's lossy perfect hash on the 3-byte prefix (FSST paper §4.3) — a deliberate ratio-for-speed trade in OnPair's design. There is no published FSST12-vs-OnPair-at-12-bit head-to-head measurement in the public literature; the validation campaign sub-benchmark 9 measures it directly. Our prior is that OnPair-at-12-bit is **~50–60% of FSST12's compression speed at AVX2 parity**, a meaningfully smaller gap than the OnPair16-vs-FSST8 ratio but still measurable. With AVX-512 on both, the gap narrows further; with AVX-512 on neither (the public state today), the gap is largest. Compression throughput is parsing-dominated for typical-sized inputs (OnPair paper Table 5: training is <10% of total compression time on most workloads); the smaller training cost of OnPair's single pass vs FSST's 5-round evolutionary algorithm wins meaningfully only on very small inputs where training fixed cost matters.
+
 In Tier-2 (shared-dict) mode, training is a one-time corpus-wide cost amortized across all chunks; per-chunk compression is then a parsing pass (which dominates OnPair16's total compression time per paper Table 5) and should run ~2–5× faster than per-chunk training.
 
 ### Why this is the right design
@@ -429,7 +431,7 @@ The format-stability commitments for the OnPair16 encoding:
 
 - **Decoder dictionary footprint.** ~76 KiB on-disk for 4K-token dicts; up to ~1 MiB for 64K-token dicts. Per-thread decoder scratch is similar. Tier 2 amortizes across all chunks sharing the dict.
 - **CUDA-only GPU.** This RFC scopes GPU decode to NVIDIA. Targeting AMD (ROCm/HIP) or vendor-neutral compute (SYCL, Vulkan) is out of scope — see Future Possibilities.
-- **Compression speed gap to FSST.** Reference OnPair16 is ~½× FSST's AVX2-only compression speed (137–229 MiB/s vs 325–504 MiB/s) and ~⅙× FSST's AVX-512 compression speed (~150–230 MiB/s vs ~1–3 GB/s). The structural reasons are a larger dictionary, longer average symbols, and real hash maps vs lossy perfect hashing. An AVX-512 OnPair16 path is plausible and would close most of the gap; that engineering work isn't done in the public reference impl.
+- **Compression speed gap to FSST.** Reference OnPair16 is ~½× FSST8's AVX2-only compression speed (137–229 MiB/s vs 325–504 MiB/s) and ~⅙× FSST8's AVX-512 compression speed (~150–230 MiB/s vs ~1–3 GB/s). At 12-bit parity (this RFC's default vs FSST12), our prior is a smaller gap — OnPair-at-12-bit ≈ 50–60% of FSST12's speed at AVX2 parity (no public head-to-head exists yet; validation sub-benchmark 9 measures). The structural reasons after dict-size and symbol-length collapse at parity: OnPair's `robin_hood`-style hash + bucketed long-pattern structure is heavier per LPM probe than FSST's lossy perfect hash. An AVX-512 OnPair16 path is plausible and would close most of the residual gap; that engineering work isn't done in the public reference impl.
 - **Tier-1 metadata is larger than FSST8's.** The OnPair16 dictionary (a few tens of KiB to 1 MiB) is larger than FSST8's ~2 KiB. Negligible at array-scale but worth noting for very-small arrays. Tier 2 makes this a one-time corpus-level cost.
 - **Predicate pushdown precompute is larger than FSST8's.** OnPair16's compressed-domain automata operate on a larger dict and use specialized data structures (Aho–Corasick trie, KMP failure tables, prefix range lookups). Precompute is O(pattern_length + dict_size) per query; for short patterns and 4K-token dicts this is sub-millisecond. For point-lookup queries (decode one row), the precompute is unjustified — fall back to decode-then-filter.
 - **More moving pieces than a single fixed-width encoding.** `code_width_bits` is a real configuration knob; `codes_per_split`, the coalesced-format flag, the Tier-1-vs-Tier-2 choice, the optional `splits` buffer — each must be picked by the writer or by encoder heuristics. The validation campaign recommends defaults.
@@ -475,7 +477,7 @@ Every design choice above is a prior, not a conclusion. Before any production co
 6. **Per-string decode latency.** OnPair16 vs FSST for short strings (names, URLs, UUIDs). Verify SIMD ramp-up doesn't make short-string decode worse than FSST's tight scalar loop. The OnPair paper Table 3 random-access column already shows OnPair16 at 156–335 ns per random access vs FSST at 175–359 ns — comparable. Vortex-specific re-measurement on representative columns is the validation.
 7. **Predicate pushdown end-to-end.** Vortex's existing FSST DFA vs OnPair16's compressed-domain automata (KMP, Aho–Corasick, prefix), on representative LIKE / equality / substring workloads (e.g., ClickBench Q19/Q20). The OnPair paper Section 4 reports per-pattern throughput; we measure the Vortex-integration overhead.
 8. **Tier-2 ratio uplift.** Train one dict on each whole dataset and Arc-share across chunks; measure ratio improvement vs. per-array (Tier-1) training. Compare against the OnPair paper Table 3 ratios for sanity.
-9. **Compression speed.** Per-array LPM training cost, including cold-cache effects. Also measure an AVX-512-vectorized hot path (a focused engineering investment, ~1–2 weeks) to confirm the predicted ~½× → ~1× FSST parity at AVX-512.
+9. **Compression speed.** Per-array LPM training cost, including cold-cache effects. **The apples-to-apples comparison is FSST12 vs OnPair-at-12-bit**, both at the 4K-token dictionary capacity — there's no public head-to-head measurement of this, and the structural argument in §[Compression](#compression) predicts ~50–60% of FSST12 at AVX2 parity. Also measure an AVX-512-vectorized OnPair16 hot path (a focused engineering investment, ~1–2 weeks) to confirm the predicted gap-narrowing.
 10. **Tier-2 dict materialization cost.** Confirm `OnceLock`-cached dict reuse pays for itself across realistic query patterns. Decide if `OnPair16Layout` needs a different caching policy than `DictLayout`'s default.
 
 ## Future Possibilities