Skip to content

Commit 2ed9284

Browse files
lwwmanningclaude
andcommitted
RFC 0059: review-pass v3 — major revision after expert-grade review
After running the rfc-review skill end-to-end and cross-checking every claim against the OnPair paper (arXiv:2508.02280v1), the GSST thesis (Vonk 2024), the FSST paper (Boncz/Neumann/Leis 2020), the cwida/fsst source, and the onpair_cpp source, the following substantive changes: Naming and attribution: - Rename throughout from "OnPair" to "OnPair16" — the encoding caps symbols at 16 bytes, which is OnPair16's defining property per paper §3.4.2, not generic OnPair. Array/Layout types become OnPair16Array / OnPair16Layout. - Drop the novel "merge-pair list" on-disk dict representation. Use OnPair16's actual format (lexicographically-ordered flat bytes + Arrow-style u32 offsets, per paper §3.5 and onpair_cpp's Dictionary struct). This preserves the lex-order property that enables prefix-range queries, matches upstream, and eliminates a decode-time reconstruction pass that didn't exist in published OnPair16. - Predicate pushdown section rewritten to cite OnPair's published compressed-domain automata (KMP / Aho-Corasick / prefix / equality / boolean composition) at onpair_cpp/include/onpair/search/automata/ rather than describing the technique as a generalization of Vortex's FSST DFA. Vortex integration is "adopt the upstream automata as compute kernels", not "reinvent for OnPair". Quantitative corrections: - Fix decompression-throughput numbers: 6.5–7.8 GB/s for OnPair16 (Table 3), 5.4 GB/s for OnPair-unbounded on Book Titles (Table 1). - Fix compression-throughput framing: OnPair16 is ~½× FSST AVX2 at parity (137–229 vs 325–504 MiB/s); the apparent ~10× gap to FSST's AVX-512 1–3 GB/s is an apples-to-oranges artifact (OnPair paper hardware doesn't have AVX-512). An AVX-512-vectorized OnPair16 encoder is plausible engineering and would close most of the gap. - Drop the "2× more compact" merge-pair-list claim (was actually ~5× on disk, and dropped now that we use OnPair's wire format). - Soften the predicted 250–300 GB/s H100 throughput — flag offsetting factors (12-bit bit-unpacking, lower occupancy at 68 KiB SMEM). - Mark the per-warp-iter 25–40 ns estimate as a prior to be measured. Citation hygiene: - Pin every code URL to a commit SHA. No more main/develop/master refs. cwida/fsst@e638d4cf, onpair_cpp@ae590713, onpair_rs@ac663abe, fsst-gpu@a0b639c3, vortex@6f54d3d6. - Pin arXiv to v1 throughout. - Add ADMS 2025 paper URL alongside the timanema/fsst-gpu source link. - Add a TU Delft repository pointer for the Vonk thesis. - Add an explicit caveat that onpair_cpp is pre-1.0 with no tagged releases ("pin to a specific commit hash if you embed the library" per the upstream README). Substantive content additions: - Engage with FSST paper §4.1 (the "dependency issue") in the Training section. Boncz et al. tried single-pass suffix-array and DP approaches and found them worse than the 5-round evolutionary algorithm. Explain why OnPair's pair-merging sidesteps this via a different mechanism, and flag the empirical question for validation. - Engage with OnPair paper §3.6's "16 bits is the recommended default" finding. The RFC's 12-bit default is driven specifically by GPU SMEM; CPU-only Stage 1 ships 16-bit (matching the paper). - Engage with Vonk thesis §4.2.2 / §3.3's one-block-per-SM kernel recommendation. Make GSST's mapping the reference variant; persistent threads is a measured alternative. - Add a 12-column comparison table (FSST8 / FSST12 / OnPair16 / this RFC) in Motivation. - Add a "Current Vortex state" subsection grounding the proposal in the existing FSSTArray code. - Add a "Staging" subsection breaking the RFC into three independently- mergeable stages: (1) OnPair16Array at 16-bit CPU-only, (2) parametric width + splits + Mode A GPU, (3) Tier 2 + Mode B + coalesced format. - Add an "Edge cases" subsection covering empty strings, nulls, zero rows, very small arrays, low-entropy data, Tier-2 dict mismatches, and single very long strings. - Add ASCII diagrams of the on-disk format and the GPU split-parallel decode flow. - State the assumed Vortex chunk size (~500K codes) explicitly in the GPU split-size analysis, with sensitivity notes. Smaller cleanups: - Drop the moot "bitpacked alignment at non-byte-aligned widths" unresolved question — at codes_per_split=1024 every supported width produces a byte-aligned split start. - Disambiguate "block" usage (GSST compressed block vs CUDA thread block). - Note the "dan"/"than" typo in the cwida/fsst fsst12.h source comment when quoting it verbatim. - Resolve the Tier-2 dict-layout-type question: use BinaryView so the upstream automata can be implemented as Vortex compute kernels. Net: 308 insertions, 191 deletions. The RFC's core design is unchanged; the attribution, citations, and engagement with the cited works are substantially improved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 2dad93b commit 2ed9284

1 file changed

Lines changed: 308 additions & 191 deletions

File tree

0 commit comments

Comments
 (0)