Commit 2ed9284
RFC 0059: review-pass v3 — major revision after expert-grade review
After running the rfc-review skill end-to-end and cross-checking every
claim against the OnPair paper (arXiv:2508.02280v1), the GSST thesis
(Vonk 2024), the FSST paper (Boncz/Neumann/Leis 2020), the cwida/fsst
source, and the onpair_cpp source, the following substantive changes:
Naming and attribution:
- Rename throughout from "OnPair" to "OnPair16" — the encoding caps
symbols at 16 bytes, which is OnPair16's defining property per paper
§3.4.2, not generic OnPair. Array/Layout types become OnPair16Array /
OnPair16Layout.
- Drop the novel "merge-pair list" on-disk dict representation. Use
OnPair16's actual format (lexicographically-ordered flat bytes +
Arrow-style u32 offsets, per paper §3.5 and onpair_cpp's
Dictionary struct). This preserves the lex-order property that
enables prefix-range queries, matches upstream, and eliminates a
decode-time reconstruction pass that didn't exist in published
OnPair16.
- Predicate pushdown section rewritten to cite OnPair's published
compressed-domain automata (KMP / Aho-Corasick / prefix / equality /
boolean composition) at onpair_cpp/include/onpair/search/automata/
rather than describing the technique as a generalization of Vortex's
FSST DFA. Vortex integration is "adopt the upstream automata as
compute kernels", not "reinvent for OnPair".
Quantitative corrections:
- Fix decompression-throughput numbers: 6.5–7.8 GB/s for OnPair16
(Table 3), 5.4 GB/s for OnPair-unbounded on Book Titles (Table 1).
- Fix compression-throughput framing: OnPair16 is ~½× FSST AVX2 at
parity (137–229 vs 325–504 MiB/s); the apparent ~10× gap to FSST's
AVX-512 1–3 GB/s is an apples-to-oranges artifact (OnPair paper
hardware doesn't have AVX-512). An AVX-512-vectorized OnPair16
encoder is plausible engineering and would close most of the gap.
- Drop the "2× more compact" merge-pair-list claim (was actually ~5×
on disk, and dropped now that we use OnPair's wire format).
- Soften the predicted 250–300 GB/s H100 throughput — flag offsetting
factors (12-bit bit-unpacking, lower occupancy at 68 KiB SMEM).
- Mark the per-warp-iter 25–40 ns estimate as a prior to be measured.
Citation hygiene:
- Pin every code URL to a commit SHA. No more main/develop/master refs.
cwida/fsst@e638d4cf, onpair_cpp@ae590713, onpair_rs@ac663abe,
fsst-gpu@a0b639c3, vortex@6f54d3d6.
- Pin arXiv to v1 throughout.
- Add ADMS 2025 paper URL alongside the timanema/fsst-gpu source link.
- Add a TU Delft repository pointer for the Vonk thesis.
- Add an explicit caveat that onpair_cpp is pre-1.0 with no tagged
releases ("pin to a specific commit hash if you embed the library"
per the upstream README).
Substantive content additions:
- Engage with FSST paper §4.1 (the "dependency issue") in the Training
section. Boncz et al. tried single-pass suffix-array and DP
approaches and found them worse than the 5-round evolutionary
algorithm. Explain why OnPair's pair-merging sidesteps this via a
different mechanism, and flag the empirical question for validation.
- Engage with OnPair paper §3.6's "16 bits is the recommended default"
finding. The RFC's 12-bit default is driven specifically by GPU
SMEM; CPU-only Stage 1 ships 16-bit (matching the paper).
- Engage with Vonk thesis §4.2.2 / §3.3's one-block-per-SM kernel
recommendation. Make GSST's mapping the reference variant;
persistent threads is a measured alternative.
- Add a 12-column comparison table (FSST8 / FSST12 / OnPair16 / this
RFC) in Motivation.
- Add a "Current Vortex state" subsection grounding the proposal in
the existing FSSTArray code.
- Add a "Staging" subsection breaking the RFC into three independently-
mergeable stages: (1) OnPair16Array at 16-bit CPU-only, (2)
parametric width + splits + Mode A GPU, (3) Tier 2 + Mode B +
coalesced format.
- Add an "Edge cases" subsection covering empty strings, nulls, zero
rows, very small arrays, low-entropy data, Tier-2 dict mismatches,
and single very long strings.
- Add ASCII diagrams of the on-disk format and the GPU split-parallel
decode flow.
- State the assumed Vortex chunk size (~500K codes) explicitly in the
GPU split-size analysis, with sensitivity notes.
Smaller cleanups:
- Drop the moot "bitpacked alignment at non-byte-aligned widths"
unresolved question — at codes_per_split=1024 every supported width
produces a byte-aligned split start.
- Disambiguate "block" usage (GSST compressed block vs CUDA thread
block).
- Note the "dan"/"than" typo in the cwida/fsst fsst12.h source comment
when quoting it verbatim.
- Resolve the Tier-2 dict-layout-type question: use BinaryView so the
upstream automata can be implemented as Vortex compute kernels.
Net: 308 insertions, 191 deletions. The RFC's core design is unchanged;
the attribution, citations, and engagement with the cited works are
substantially improved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 2dad93b commit 2ed9284
1 file changed
Lines changed: 308 additions & 191 deletions
0 commit comments