Skip to content

Commit 1b141bf

Browse files
lwwmanningclaude
andcommitted
RFC 0059: apply v4 review nits before sharing
- Performance budget: pin compiler floor (LLVM ≥ 14 / GCC ≥ 11) and add assembly-level auto-vectorization verification command. - Memory layout: fix undefined `FlatTable` reference; use `Arc<[u8]>` matching the Public API section. - Encoder algorithm: clarify training-termination phrasing notes that the 256 single-byte tokens are part of the `2^code_width_bits` count. - GPU decoder: add `unpack_12bit` CUDA pseudocode body handling byte-boundary spans. - Encoder algorithm: name the small-array fallback constants (MIN_STRINGS_FOR_ONPAIR16, MIN_CODES_FOR_ONPAIR16) and cross-reference from Edge cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 6c3bda8 commit 1b141bf

1 file changed

Lines changed: 25 additions & 6 deletions

File tree

proposed/0059-onpair-gpu-string-encoding.md

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -311,7 +311,7 @@ A reasonable reader will ask: why a new array? Why not just extend `FSSTArray`,
311311
- **Empty strings.** `code_offsets[i] == code_offsets[i+1]`. Decoder writes zero bytes. Trivially handled.
312312
- **Null strings.** Handled by Vortex's standard validity bitmap at the Array level; the OnPair16 encoding never sees a null. The encoding's per-string offsets index into the non-null subset.
313313
- **Zero rows.** `n_strings == 0`, `n_codes == 0`. The dictionary may still be present (Tier 2) or absent (Tier 1 — encoder may emit an empty dict trivially). Decoder is a no-op.
314-
- **Very small arrays** (e.g. ≤16 strings, ≤64 codes). Dictionary materialization dominates decode time. The recommended fallback: for arrays below a writer-configurable threshold (default ~256 strings or ~1 KiB compressed), the encoder falls back to FSST or to plain (unencoded) bytes — OnPair16's ratio advantage is lost on tiny arrays and the materialization overhead is wasted. This is an encoder heuristic, not a format requirement.
314+
- **Very small arrays** (e.g. ≤16 strings, ≤64 codes). Dictionary materialization dominates decode time. The recommended fallback: for arrays below the `MIN_STRINGS_FOR_ONPAIR16` / `MIN_CODES_FOR_ONPAIR16` thresholds defined in §[Encoder algorithm](#encoder-algorithm), the encoder falls back to `FSSTArray` or plain bytes — OnPair16's ratio advantage is lost on tiny arrays and the materialization overhead is wasted. This is an encoder heuristic, not a format requirement; the wire format is identical regardless of which encoding was chosen.
315315
- **Extreme low entropy** (e.g., every string is the same value). OnPair16's training produces a single learned token for the repeated value; compression ratio is excellent. No special-case logic needed.
316316
- **Mismatch between Tier-2 dict and a chunk's data.** OnPair16 always covers all 256 bytes (codes 0–255 are reserved single-byte tokens), so any byte sequence is representable. A poorly-matched dict produces a longer code stream; compression ratio degrades but the format remains well-defined.
317317
- **Single very long string** (e.g., a 10 MB blob in one row). `code_offsets[1] - code_offsets[0]` is the whole code stream; decode is one big sequential operation. The split-parallel GPU decoder treats the codes the same way regardless of how they're partitioned into strings.
@@ -706,7 +706,8 @@ The training algorithm follows the OnPair paper §3.2 unchanged. Hyperparameter
706706

707707
- **Pair-frequency threshold:** `max(2, ⌊log₂(S)⌋)` where `S` is sample size in MiB, per OnPair paper §3.2.1 ("The threshold is set as a slowly growing function of dataset size"). Override via `TrainingConfig::pair_frequency_threshold`.
708708
- **Sample size:** 5% of input for inputs >10 MiB; full input for smaller. Override via `TrainingConfig::sample_fraction`. The OnPair paper notes the sample is "randomly selected … shuffled" — implementers should use a deterministic seeded shuffle (`TrainingConfig::seed`) for reproducibility.
709-
- **Training termination:** dictionary fills to `2^code_width_bits` tokens, OR no pair in the sample exceeds the frequency threshold, OR sample exhausted. Whichever comes first.
709+
- **Training termination:** dictionary reaches `2^code_width_bits` total tokens (the first 256 of which are always the reserved single-byte tokens; learnable merge slots are therefore `2^code_width_bits − 256`), OR no pair in the sample exceeds the frequency threshold, OR sample exhausted. Whichever comes first.
710+
- **Small-array fallback threshold:** `MIN_STRINGS_FOR_ONPAIR16 = 256` (constant). For arrays with fewer than `MIN_STRINGS_FOR_ONPAIR16` strings OR `MIN_CODES_FOR_ONPAIR16 = 1024` codes, the encoder skips OnPair16 training and falls back to `FSSTArray` or plain bytes — dictionary materialization dominates decode on tiny arrays and the ratio advantage is lost. This is an encoder-side heuristic; the wire format is unchanged. (See §[Edge cases](#edge-cases).)
710711
- **Token sorting:** after training completes, the dictionary is sorted lexicographically by byte sequence. Token IDs are reassigned to match sorted position. This is the OnPair paper §3.5 invariant and is load-bearing for the prefix automaton.
711712

712713
The encoder's parsing phase (paper §3.3) runs longest-prefix-matching against the final sorted dictionary for every input string. Implementation note: the LPM data structures (paper §3.4) — short-pattern hash + long-pattern bucket structure — are the hot path; reuse from `onpair_cpp` if the dependency decision lands that way (see Drawbacks).
@@ -775,12 +776,30 @@ for (uint32_t k = code_start; k < code_end; k++) {
775776
}
776777
```
777778

778-
The `unpack_12bit` function implements the two-codes-per-three-bytes layout from the Wire format section.
779+
The `unpack_12bit` function implements the two-codes-per-three-bytes layout from the Wire format section. Per-thread CUDA body (handles byte-boundary spans; matches the CPU layout at lines 522–527):
780+
781+
```cuda
782+
__device__ __forceinline__ uint32_t unpack_12bit(const uint8_t* codes, uint32_t k) {
783+
uint32_t byte_off = (k / 2) * 3;
784+
uint32_t b0 = codes[byte_off];
785+
uint32_t b1 = codes[byte_off + 1];
786+
if ((k & 1) == 0) {
787+
// Even code: low 8 bits in b0, high 4 bits in low nibble of b1
788+
return b0 | ((b1 & 0x0F) << 8);
789+
} else {
790+
// Odd code: low 4 bits in high nibble of b1, high 8 bits in b2
791+
uint32_t b2 = codes[byte_off + 2];
792+
return (b1 >> 4) | (b2 << 4);
793+
}
794+
}
795+
```
796+
797+
Warp-cooperative variants that amortize the byte-stream load across 32 codes (8 × 12 bits = 96 bits = 12 bytes per 8 codes per lane) are an implementation optimization, not a format requirement.
779798

780799
### Memory layout and allocation
781800

782801
- **Decoder output buffer:** caller-allocated. Caller must allocate at least `uncompressed_bytes + DECOMPRESS_BUFFER_PADDING` (= `uncompressed_bytes + 16`) bytes so the unconditional 16-byte over-copy on the final token is in-bounds.
783-
- **CPU decode-time symbol-table cache:** allocated by `OnPair16Array` lazily on first decode call into an `Arc<[u8]>` scratch buffer of size `16 * n_tokens + n_tokens` (symbol bytes + lengths). Cached for the lifetime of the Array. For Tier 2 (`OnPair16Layout`), the cache lives in the layout reader's `OnceLock<Arc<FlatTable>>` and is shared across all chunks referencing the same dict.
802+
- **CPU decode-time symbol-table cache:** allocated by `OnPair16Array` lazily on first decode call into an `Arc<[u8]>` scratch buffer of size `16 * n_tokens + n_tokens` (symbol bytes + lengths). Cached for the lifetime of the Array. For Tier 2 (`OnPair16Layout`), the cache lives in the layout reader's `OnceLock<Arc<[u8]>>` (same scratch layout) and is shared across all chunks referencing the same dict.
784803
- **GPU SMEM:** sized as above; per-block, kernel-local.
785804

786805
### Registry and dispatch
@@ -814,8 +833,8 @@ These are the targets an implementer can measure during development to confirm t
814833

815834
| Budget | Target | Verifiable with |
816835
|---|---|---|
817-
| CPU decode throughput at 12-bit | ≥4 GB/s on Zen 4 / Sapphire Rapids, scalar (compiler auto-vectorized) | criterion benchmark with the OnPair-paper corpora |
818-
| CPU decode throughput at 16-bit | ≥5 GB/s on Zen 4 / Sapphire Rapids, scalar | criterion |
836+
| CPU decode throughput at 12-bit | ≥4 GB/s on Zen 4 / Sapphire Rapids, scalar (compiler auto-vectorized: LLVM ≥ 14 or GCC ≥ 11) | criterion benchmark with the OnPair-paper corpora; verify the inner loop emits a 16-byte vector load+store (e.g., `movdqu` / `vmovdqu`) via `cargo asm` or `objdump -d` |
837+
| CPU decode throughput at 16-bit | ≥5 GB/s on Zen 4 / Sapphire Rapids, scalar (same compiler floors as above) | criterion; same assembly-verification step |
819838
| CPU decode L1 hit rate | ≥95% for the symbol table at 12-bit; ≥90% at 16-bit | `perf stat -e L1-dcache-load-misses,L1-dcache-loads` |
820839
| GPU decode throughput at 12-bit (H100) | ≥200 GB/s | nsight compute, runtime divided by uncompressed bytes |
821840
| GPU kernel launch overhead | ≤50 µs per chunk | nsight systems |

0 commit comments

Comments
 (0)