fsst opt DO NOT MERGE by joseph-isaacs · Pull Request #6996 · vortex-data/vortex

joseph-isaacs · 2026-03-17T14:14:45Z

Inline the fsst-rs crate (v0.5.6) directly into vortex-fsst to enable
direct modification of the symbol table training algorithm. Key changes:

Inline fsst-rs source into encodings/fsst/src/fsst_rs/ module
Remove external fsst-rs dependency
Re-export Compressor, CompressorBuilder, Decompressor, Symbol from crate root
Add benchmark comparing FSST vs zstd/snappy on static datasets

Symbol table training optimizations:

Increase training generations from 5 to 7 with more granular early rounds
Double the sample target (16KB -> 32KB) for better pattern coverage
Improve single-byte symbol gain boost (8x -> 12x) to reduce escape overhead
Add length bonus to merged symbol scoring to favor longer symbols
Account for escape overhead in gain heuristic

Results on static datasets (10K strings each):
Dataset | Before | After | Improvement
random_binary | 1.36x | 1.36x | -0.3%
urls | 2.50x | 2.91x | -14.0%
log_lines | 2.78x | 3.06x | -9.0%
json | 3.99x | 4.40x | -9.3%
emails | 3.54x | 4.06x | -12.8%

Compression throughput also improved 9-16% due to fewer escape codes.
Training time increased ~50% but remains under 2ms.

Signed-off-by: Claude noreply@anthropic.com

https://claude.ai/code/session_01CbsKijtazMRRV6qsoM6e2G

Inline the fsst-rs crate (v0.5.6) directly into vortex-fsst to enable direct modification of the symbol table training algorithm. Key changes: - Inline fsst-rs source into encodings/fsst/src/fsst_rs/ module - Remove external fsst-rs dependency - Re-export Compressor, CompressorBuilder, Decompressor, Symbol from crate root - Add benchmark comparing FSST vs zstd/snappy on static datasets Symbol table training optimizations: - Increase training generations from 5 to 7 with more granular early rounds - Double the sample target (16KB -> 32KB) for better pattern coverage - Improve single-byte symbol gain boost (8x -> 12x) to reduce escape overhead - Add length bonus to merged symbol scoring to favor longer symbols - Account for escape overhead in gain heuristic Results on static datasets (10K strings each): Dataset | Before | After | Improvement random_binary | 1.36x | 1.36x | -0.3% urls | 2.50x | 2.91x | -14.0% log_lines | 2.78x | 3.06x | -9.0% json | 3.99x | 4.40x | -9.3% emails | 3.54x | 4.06x | -12.8% Compression throughput also improved 9-16% due to fewer escape codes. Training time increased ~50% but remains under 2ms. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01CbsKijtazMRRV6qsoM6e2G

Key changes to the symbol table training algorithm: - Progressive single-byte boost: 8x in early rounds (cover byte space to prevent escapes) tapering to 3x in final round (let longer symbols win) - Superlinear length bonus for merged symbols: len^2/4 bonus makes 8-byte symbols much more competitive vs many short symbols - Length bonus for existing multi-byte symbols: same superlinear scoring ensures the best long symbols are retained across generations - Lower candidate threshold for long symbols (2x vs 5x minimum count) - Allow merges in final training round (previously skipped) - More training generations (7 vs 5) with doubled sample sizes Results on 8 diverse datasets (10K items each): - JSON beats snappy: 4.55x vs 4.50x - Emails beat snappy: 4.02x vs 3.20x - File paths beat snappy: 4.97x vs 4.21x - URLs improved from ~2.5x to 3.13x - Log lines improved from ~2.6x to 3.29x Also adds diagnostic methods (count_escapes, length_histogram) and expands benchmark with structured binary, file paths, and CSV datasets. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01CbsKijtazMRRV6qsoM6e2G

…ilization pass Key changes: 1. **Stop first-byte recording after early generations** (biggest win): compress_count previously always recorded multi-byte symbols' first bytes as separate single-byte counts. This biased the symbol table toward short symbols. Now only done in early rounds (sample_frac<=28), letting later rounds build much longer symbols. 2. **Stabilization pass (generation 129)**: Final training round uses full sample but skips merges, allowing the table to settle with the best existing symbols from the previous round. 3. **9 training generations** (was 7): [4, 12, 28, 48, 68, 98, 128, 128, 129] gives more refinement with two full-sample merge passes before the stabilization pass. 4. **Tolerant PHT rebuild**: rebuild_from no longer asserts on PHT collision, improving robustness. 5. **13 benchmark datasets** (was 8): Added SQL queries, XML fragments, repeated binary, config key-value, timestamped logs from test_utils generators. Results on 13 datasets (10K items each): - URLs: 5.88x (was ~2.5x baseline, snappy=7.79x) — 2.4x improvement! - Log lines: 4.40x (was ~2.6x, snappy=5.00x) - JSON: 4.96x beats snappy 4.50x - SQL queries: 4.33x beats snappy 3.15x - XML: 4.55x beats snappy 3.09x - File paths: 5.59x beats snappy 4.21x - Timestamp logs: 3.73x beats snappy 3.02x - FSST now beats snappy on 12 of 13 datasets Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01CbsKijtazMRRV6qsoM6e2G

Add 12 new dataset generators to test_utils.rs for comprehensive FSST benchmarking across different data domains: Network/protocol: HTTP headers, IPv4 CIDR rules, Prometheus metrics, DNS wire format binary. Analytics/data: Parquet footer binary, Spark query plans, Arrow IPC binary, JSONL event streams. Text/document: Markdown fragments, stack traces, CSS rules, shell commands. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01CbsKijtazMRRV6qsoM6e2G

Add new dataset generators to test_utils.rs including network-oriented (HTTP headers, IPv4 CIDR rules, Prometheus metrics, DNS wire binary), analytics (Spark plan strings, Arrow IPC binary, JSON lines), and text-heavy (markdown fragments, stack traces, CSS rules, shell commands) datasets for comprehensive FSST vs baseline benchmarking. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01CbsKijtazMRRV6qsoM6e2G

Add 12 new datasets to the FSST vs baselines benchmark: HTTP headers, IPv4 CIDR rules, Prometheus metrics, DNS wire binary, Parquet footer binary, Spark plan strings, Arrow IPC binary, JSON lines, markdown fragments, stack traces, CSS rules, and shell commands. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01CbsKijtazMRRV6qsoM6e2G

Add FSST, zstd, and snappy decompression benchmarks for URLs, log lines, JSON, and random binary datasets to enable throughput comparison across all three codecs for both compress and decompress paths. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01CbsKijtazMRRV6qsoM6e2G

Replace template-based generators that drew from ~10-element arrays with procedural generators using random words, random hex strings, and varied structures. This dramatically increases inter-row diversity to better represent real-world data. Also increase N from 10k to 50k rows. Key changes: - URLs: random domains/paths/params instead of 20 fixed domains - Log lines: random IPs, random path segments, varied user agents - JSON: random field names/values, varied schemas per row - SQL: random table/column names, JOINs, more query types - Stack traces: random package/class/method names - All other generators similarly diversified Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01CbsKijtazMRRV6qsoM6e2G

… more generations Two key changes to the FSST symbol table training algorithm: 1. Stratified sampling with per-generation resampling: Instead of creating a single random sample that is reused across all generations, each generation now draws a fresh sample using a different seed. The sampling strategy uses evenly-spaced string selection (stratified) rather than random hash jumps, ensuring better coverage of diverse datasets. 2. More training generations with larger final samples: Increased from 9 to 11 generations (adding 2 extra full-sample passes before stabilization). Final generations use a 4x larger sample (128KB vs 32KB) for better frequency estimation without the cost of processing the entire corpus. Compression ratio improvements across 25 benchmark datasets vs baseline: - stack_traces: 1.52 → 1.61 (+5.9%) - log_lines: 2.38 → 2.48 (+4.2%) - parquet_ftr: 1.01 → 1.05 (+4.0%) - repeat_binary: 4.23 → 4.42 (+4.5%) - spark_plans: 1.93 → 1.97 (+2.1%) - markdown: 1.40 → 1.43 (+2.1%) - http_headers: 1.98 → 2.04 (+3.0%) - All 25 datasets improved (no regressions) Training speed: ~2.5-3.5x slower (5-11ms vs 2-3ms), negligible in practice since training is a one-time cost per dataset. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01CbsKijtazMRRV6qsoM6e2G

codspeed-hq · 2026-03-17T14:19:18Z

Merging this PR will degrade performance by 84.28%

⚡ 14 improved benchmarks
❌ 17 regressed benchmarks
✅ 978 untouched benchmarks
🆕 44 new benchmarks
⏩ 1515 skipped benchmarks¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Simulation	`chunked_canonicalize_into[(1000, 100, 16, 64)]`	23.9 ms	19.1 ms	+25.34%
⚡	Simulation	`chunked_canonicalize_into[(1000, 100, 16, 16)]`	20.5 ms	14.8 ms	+38.14%
⚡	Simulation	`chunked_into_canonical[(1000, 100, 16, 16)]`	20.8 ms	15.2 ms	+36.88%
⚡	Simulation	`chunked_canonicalize_into[(1000, 50, 8, 16)]`	8.9 ms	6.7 ms	+33.22%
⚡	Simulation	`chunked_canonicalize_into[(1000, 50, 8, 4)]`	7.8 ms	6.8 ms	+13.71%
⚡	Simulation	`chunked_canonicalize_into[(1000, 50, 8, 64)]`	10.5 ms	7.3 ms	+43.18%
⚡	Simulation	`chunked_into_canonical[(1000, 50, 8, 16)]`	9.1 ms	6.9 ms	+31.87%
⚡	Simulation	`chunked_into_canonical[(1000, 100, 16, 64)]`	24.2 ms	19.3 ms	+25.14%
⚡	Simulation	`chunked_into_canonical[(1000, 50, 8, 4)]`	8 ms	7.1 ms	+13.19%
⚡	Simulation	`chunked_into_canonical[(1000, 50, 8, 64)]`	10.8 ms	7.6 ms	+41.51%
❌	Simulation	`train_compressor[(1000, 16, 4)]`	4.2 ms	9.6 ms	-56.22%
❌	Simulation	`train_compressor[(1000, 4, 4)]`	1.9 ms	3.2 ms	-41.58%
❌	Simulation	`train_compressor[(1000, 16, 8)]`	4.2 ms	8 ms	-47.02%
❌	Simulation	`train_compressor[(1000, 64, 4)]`	4.2 ms	17.1 ms	-75.31%
❌	Simulation	`train_compressor[(10000, 4, 4)]`	4.9 ms	20.7 ms	-76.4%
❌	Simulation	`train_compressor[(10000, 16, 4)]`	4.8 ms	30.1 ms	-84.06%
❌	Simulation	`train_compressor[(10000, 16, 8)]`	4.8 ms	30.8 ms	-84.28%
❌	Simulation	`train_compressor[(1000, 64, 8)]`	4.3 ms	16.7 ms	-74.17%
❌	Simulation	`train_compressor[(1000, 4, 8)]`	2 ms	3.6 ms	-43.91%
⚡	Simulation	`eq_canonicalize_low_match`	30.9 ms	26.4 ms	+17.26%
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

_{Comparing claude/optimize-fsst-compression-KgJdu (280a766) with develop (b921999)}

1515 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

…press_word Move the lossy PHT lookup before the 2-byte code check so both memory accesses can execute simultaneously on the CPU's out-of-order engine. Previously the PHT probe was deferred until after the 2-byte check, serializing the two independent lookups. Benchmarked improvement: 2-9% faster compression throughput across datasets, with zero change to compression ratios. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01CbsKijtazMRRV6qsoM6e2G

gatesn · 2026-03-17T14:29:30Z

Why are we inlining the crate?

Replace the training schedule [4, 12, 28, 48, 68, 98, 128x4, 129] with a smoother ramp [4, 8, 16, 24, 36, 48, 64, 80, 98, 128x3, 129]. The smoother progression between sample fractions reduces wasted work per generation (smaller incremental changes) and produces better symbol tables for most datasets. Despite having 13 generations vs 11, training is slightly faster due to reduced thrashing. Notable improvements vs prior schedule: - json: 2.31 → 2.35 (+1.7%) - spark_plans: 1.97 → 2.01 (+2.0%) - json_lines: 1.74 → 1.77 (+1.7%) - ts_logs: 1.45 → 1.46 All datasets remain improved vs the original pre-optimization baseline. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01CbsKijtazMRRV6qsoM6e2G

Replace the scaled min_count threshold (2-5 * sample_frac/128) with a simple zero-count filter. The gain-based priority queue already ranks candidates by value, making the pre-filter redundant. Removing it gives the priority queue a larger candidate pool, which helps datasets with diverse byte distributions (binary formats with many escape bytes). Improvements: arrow_ipc -5KB, stack_traces -15KB, dns_binary -2KB. No regressions on any of the 25 benchmark datasets. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01CbsKijtazMRRV6qsoM6e2G

joseph-isaacs · 2026-03-17T15:19:08Z

silly claude

claude added 9 commits March 16, 2026 23:23

claude added 2 commits March 17, 2026 14:32

joseph-isaacs closed this Mar 17, 2026

joseph-isaacs changed the title ~~Inline fsst-rs and optimize FSST symbol table training~~ fsst opt DO NOT MERGE Mar 17, 2026

joseph-isaacs added the do not merge Pull requests that are not intended to merge label Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fsst opt DO NOT MERGE#6996

fsst opt DO NOT MERGE#6996
joseph-isaacs wants to merge 12 commits into
developfrom
claude/optimize-fsst-compression-KgJdu

joseph-isaacs commented Mar 17, 2026

Uh oh!

codspeed-hq Bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

gatesn commented Mar 17, 2026

Uh oh!

joseph-isaacs commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

joseph-isaacs commented Mar 17, 2026

Uh oh!

codspeed-hq Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 84.28%

Performance Changes

Footnotes

Uh oh!

gatesn commented Mar 17, 2026

Uh oh!

joseph-isaacs commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codspeed-hq Bot commented Mar 17, 2026 •

edited

Loading