Use approximate cardinality to decide whether to use dict compression#7759
Use approximate cardinality to decide whether to use dict compression#7759robert3005 wants to merge 1 commit intodevelopfrom
Conversation
Merging this PR will degrade performance by 24.9%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | varbinview_zip_fragmented_mask |
6.5 ms | 7.3 ms | -10.29% |
| ❌ | Simulation | decode_primitives[u8, (1000, 2)] |
15.8 µs | 17.6 µs | -10.29% |
| ⚡ | Simulation | encode_varbinview[(1000, 2)] |
243.8 µs | 168.1 µs | +45.05% |
| ⚡ | Simulation | encode_varbin[(10000, 512)] |
1,029.1 µs | 934.6 µs | +10.11% |
| ❌ | Simulation | varbinview_zip_block_mask |
2.9 ms | 3.7 ms | -21.65% |
| ❌ | Simulation | new_bp_prim_test_between[i64, 32768] |
177.3 µs | 236.1 µs | -24.9% |
| ⚡ | Simulation | dict_compress_string |
8.2 ms | 7.4 ms | +10.54% |
Comparing rk/cardinality-estimator (7fa9309) with develop (5e5572b)
Footnotes
-
138 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Polar Signals Profiling ResultsLatest Run
Previous Runs (3)
Powered by Polar Signals Cloud |
Benchmarks: CompressionVortex (geomean): 1.002x ➖ unknown / unknown (0.992x ➖, 4↑ 5↓)
|
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.933x ➖ datafusion / vortex-file-compressed (0.933x ➖, 2↑ 0↓)
|
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.983x ➖, 1↑ 0↓)
datafusion / vortex-compact (1.006x ➖, 0↑ 0↓)
datafusion / parquet (0.981x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.978x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.056x ➖, 0↑ 1↓)
duckdb / parquet (0.965x ➖, 1↑ 0↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.917x ➖, 42↑ 0↓)
datafusion / vortex-compact (0.955x ➖, 11↑ 0↓)
datafusion / parquet (0.925x ➖, 32↑ 0↓)
duckdb / vortex-file-compressed (1.003x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.008x ➖, 1↑ 1↓)
duckdb / parquet (1.006x ➖, 0↑ 3↓)
duckdb / duckdb (0.991x ➖, 3↑ 1↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.995x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.990x ➖, 0↑ 0↓)
duckdb / parquet (0.986x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.001x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.983x ➖, 0↑ 0↓)
datafusion / parquet (0.985x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.979x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.015x ➖, 0↑ 0↓)
duckdb / parquet (0.948x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.997x ➖, 1↑ 0↓)
datafusion / parquet (1.002x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.975x ➖, 2↑ 0↓)
duckdb / parquet (0.996x ➖, 0↑ 0↓)
duckdb / duckdb (0.967x ➖, 6↑ 0↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (105 files changed, +0.0% overall, 95↑ 10↓)
Totals:
|
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.014x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.014x ➖, 0↑ 0↓)
datafusion / parquet (0.979x ➖, 2↑ 0↓)
datafusion / arrow (1.007x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.016x ➖, 0↑ 1↓)
duckdb / vortex-compact (1.003x ➖, 0↑ 0↓)
duckdb / parquet (1.004x ➖, 0↑ 1↓)
duckdb / duckdb (1.006x ➖, 0↑ 1↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMEFile Size Changes (195 files changed, -98.4% overall, 0↑ 195↓)
Totals:
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.000x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.997x ➖, 0↑ 0↓)
datafusion / parquet (0.994x ➖, 0↑ 0↓)
datafusion / arrow (0.958x ➖, 2↑ 0↓)
duckdb / vortex-file-compressed (0.992x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.997x ➖, 0↑ 0↓)
duckdb / parquet (1.003x ➖, 0↑ 0↓)
duckdb / duckdb (0.998x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMEFile Size Changes (2 files changed, +0.0% overall, 2↑ 0↓)
Totals:
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.946x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.877x ➖, 0↑ 0↓)
datafusion / parquet (0.887x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.963x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.968x ➖, 0↑ 0↓)
duckdb / parquet (0.948x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.962x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.000x ➖, 0↑ 1↓)
datafusion / parquet (0.909x ➖, 2↑ 1↓)
duckdb / vortex-file-compressed (0.976x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.033x ➖, 0↑ 0↓)
duckdb / parquet (0.998x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Replace the exact `HashMap`/`HashSet` previously used to compute distinct-value counts during compression stats generation with Cloudflare's `cardinality-estimator` crate. The estimator gives us a bounded-memory approximation (exact up to ~128 distinct values, then HyperLogLog++) so high-cardinality arrays no longer require an O(n) auxiliary hash table to answer the single question "how many unique values does this have?". - Integer stats swap the hash map for a `CardinalityEstimator` and track the most frequent value via a Boyer-Moore majority candidate plus a second-pass exact count. Sparse/dict schemes only care about the heavy hitter (>= 90% threshold) or a rough distinct ratio, so this is behaviourally equivalent for the decisions they make. - Float and string stats likewise drop their hash sets in favor of the estimator. - The integer and float dictionary encoders now rebuild the exact set of distinct values from the source array at compress time, since they need the values themselves and the stats layer no longer retains them. - `SequenceScheme`'s fast-path check for "all values are distinct" now tolerates the estimator's small approximation error; the deferred callback still validates sequences exactly. Signed-off-by: Robert Kruszewski <github@robertk.io>
fc38e36 to
7fa9309
Compare
File Sizes: TPC-DS SF=1 on NVMEFile Size Changes (4 files changed, +0.0% overall, 2↑ 2↓)
Totals:
|
File Sizes: Statistical and Population GeneticsFile Size Changes (2 files changed, +0.0% overall, 1↑ 1↓)
Totals:
|
Benchmarks: Random AccessVortex (geomean): 0.878x ✅ unknown / unknown (0.920x ➖, 13↑ 0↓)
|
Since we no longer precompute all unique values the layout and compressor dict
encoding logic get unified