Add layered pco encoding stack: P1-P5 array implementations#7926
Add layered pco encoding stack: P1-P5 array implementations#7926joseph-isaacs wants to merge 21 commits into
Conversation
Sketches a decomposition of pco's pipeline (recast → mode → delta → bin partition → tANS) into ~12 standalone Vortex arrays plus a layered compressor with two profiles (Fast-RA, High-ratio) so we can measure which layer earns each fold of compression and where random access degrades. No code yet. Signed-off-by: Claude <noreply@anthropic.com>
The first stage of pco's pipeline as a standalone Vortex array: a bijective, order-preserving cast from any primitive type to an unsigned latent. Element-level random access; no compression on its own — it's the foundation the rest of the layered pco stack will sit on. Signed-off-by: Claude <noreply@anthropic.com>
Measures encode, decode, and scalar_at throughput per primitive type against memcpy and full PcoArray reference baselines. Results captured in encodings/ordered-latent/benches/RESULTS.md. Signed-off-by: Claude <noreply@anthropic.com>
Pco's IntMult mode as a standalone Vortex array. Decomposes an ordered-latent stream into (primary, secondary) such that n = base*primary + secondary in wrapping arithmetic. Element-level random access; metadata is just the base. Signed-off-by: Claude <noreply@anthropic.com>
Measures encode/decode/scalar_at throughput against the full PcoArray baseline on IntMult-favorable input. Signed-off-by: Claude <noreply@anthropic.com>
Empty FloatMult / FloatQuant / PcoDict crates so parallel agents can fill them in without racing on the workspace manifest. Signed-off-by: Claude <noreply@anthropic.com>
Pco's FloatMult mode as a standalone Vortex array. Decomposes f64 input into (primary, secondary) where primary is round(x/base) and secondary is the signed ULP offset of the approximation; round-trip is bit-exact for all f64 including NaN/inf. Element-level random access. Signed-off-by: Claude <noreply@anthropic.com>
Measures encode/decode/scalar_at throughput against full PcoArray on FloatMult-favorable f64 input. Signed-off-by: Claude <noreply@anthropic.com>
Pco's FloatQuant mode as a standalone Vortex array. Splits f64 bits at quantization boundary k into (primary = high 64-k bits, secondary = low k bits). Bit-exact round-trip for any f64. Element-level random access. Signed-off-by: Claude <noreply@anthropic.com>
Measures encode/decode/scalar_at throughput against full PcoArray on FloatQuant-favorable f64 input with k=16. Signed-off-by: Claude <noreply@anthropic.com>
Pco's Dict mode as a standalone Vortex array. Each value is represented as an index into a small first-occurrence-ordered dictionary; index width narrows automatically to u8/u16/u32 based on dict cardinality. Integer input only in this phase. Element-level random access. Signed-off-by: Claude <noreply@anthropic.com>
Measures encode/decode/scalar_at throughput against full PcoArray on dict-favorable i64 input (256 unique values cycled). Signed-off-by: Claude <noreply@anthropic.com>
First-order consecutive integer differences as a standalone Vortex array (i64 only, non-nullable). Establishes the delta layer of the layered pco stack. scalar_at is O(i) by design — the random-access cliff is intentional and measured in the bench. Signed-off-by: Claude <noreply@anthropic.com>
Measures encode/decode/scalar_at throughput against full PcoArray on delta-favorable monotone timestamps and on random i64 control. Surfaces the random-access cliff: scalar_at is O(i) for the layered delta but O(page) for monolithic Pco. Signed-off-by: Claude <noreply@anthropic.com>
The first layer of the layered pco stack that actually shrinks bytes. BinPartition decomposes an i64 stream into (bin_idx, offset) with bin boundaries chosen by sampled quantiles; VarWidthBitPacked stores the offsets at per-bin widths in a single packed bit buffer with a batch- indexed prefix-sum for O(64) random access. Signed-off-by: Claude <noreply@anthropic.com>
Three input scenarios (skewed-low, uniform random, quasi-monotone) over i64 input. Measures encode/decode/scalar_at throughput and compression ratio against full PcoArray. This is the first phase where the layered stack's byte savings are directly comparable to pco's. Signed-off-by: Claude <noreply@anthropic.com>
tANS entropy code over a u8 symbol stream. Implemented from scratch because pco's `ans` module is private (`mod ans;` in pco's lib.rs); the algorithm (table layout, weight quantization, renormalization cutoff) mirrors pco so the produced bit stream is structurally compatible. Single-state (no 4-way interleave) — the SIMD perf optimization is a P6 concern; compression ratio is the same. This is the entropy layer that pairs with BinPartition's bin_idx stream in the high-ratio profile. Decode is sequential (batch- granular random access is a P6 concern). Signed-off-by: Claude <noreply@anthropic.com>
Two scenarios (Zipf-skewed alphabet, uniform random). Reports compression ratio and encode/decode throughput. No scalar_at — tANS decode is sequential by design. Signed-off-by: Claude <noreply@anthropic.com>
A measure-and-compare script comparing five compressors on two representative datasets: full pco (default + 1024-value pages), vanilla btrblocks, our pco-style hybrid (OrderedLatent -> Mode -> Delta with btrblocks compressing each leaf), and our plain layered stack with no entropy bottom. Captures compression ratio, decode throughput, and scalar_at latency. Signed-off-by: Claude <noreply@anthropic.com>
Fuse validate + pack + per-batch prefix sum into a single pass over the encode input, and replace the binary-search bin assignment with a branchless cascade for n_bins <= 16. Decode writes directly into the output's spare capacity instead of using `BufferMut::push` per element. Encode is now 1.39-1.71x across A/B/C (skewed/uniform/quasi-monotone); decode is 2.89x on uniform random and 1.10x on skewed-low but regresses to 0.78x on quasi-monotone, with a geomean of 1.35x. The narrow-width C regression is the price of changing the output write pattern; a fastlanes-style fixed-width path would recover it. Signed-off-by: Claude <noreply@anthropic.com>
Extends the P6 measurement harness with real-data columns (TPC-H at SF=0.1, generated in-process by tpchgen-arrow). Reports all 5 compressor variants on each column. ClickBench and NYC taxi were skipped (no in-tree loader, would require multi-GB downloads and a Parquet read path). Signed-off-by: Claude <noreply@anthropic.com>
Merging this PR will improve performance by 20.36%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | new_alp_prim_test_between[f64, 16384] |
148.8 µs | 126.8 µs | +17.33% |
| ⚡ | Simulation | new_bp_prim_test_between[i16, 32768] |
134.1 µs | 120.1 µs | +11.67% |
| ⚡ | Simulation | new_bp_prim_test_between[i32, 16384] |
109.1 µs | 94.7 µs | +15.25% |
| ⚡ | Simulation | new_bp_prim_test_between[i32, 32768] |
169.9 µs | 141 µs | +20.54% |
| ⚡ | Simulation | new_bp_prim_test_between[i64, 16384] |
144.4 µs | 115 µs | +25.56% |
| ⚡ | Simulation | new_bp_prim_test_between[i64, 32768] |
236.7 µs | 177.9 µs | +33.05% |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/analyze-pco-schemes-91DAI (7d44c51) with develop (7349cd6)
Footnotes
-
24 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.997x ➖ datafusion / vortex-file-compressed (0.997x ➖, 0↑ 0↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.018x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.177x ❌, 0↑ 7↓)
datafusion / parquet (1.004x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.003x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.992x ➖, 1↑ 0↓)
duckdb / parquet (1.000x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.938x ➖, 3↑ 0↓)
datafusion / vortex-compact (0.958x ➖, 0↑ 0↓)
datafusion / parquet (0.954x ➖, 3↑ 2↓)
datafusion / arrow (0.918x ➖, 7↑ 0↓)
duckdb / vortex-file-compressed (0.935x ➖, 2↑ 0↓)
duckdb / vortex-compact (0.941x ➖, 1↑ 0↓)
duckdb / parquet (0.978x ➖, 1↑ 1↓)
duckdb / duckdb (0.947x ➖, 2↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.050x ➖, 1↑ 10↓)
datafusion / vortex-compact (1.037x ➖, 1↑ 6↓)
datafusion / parquet (1.036x ➖, 0↑ 5↓)
duckdb / vortex-file-compressed (1.040x ➖, 1↑ 12↓)
duckdb / vortex-compact (1.035x ➖, 0↑ 7↓)
duckdb / parquet (1.023x ➖, 0↑ 4↓)
duckdb / duckdb (1.026x ➖, 0↑ 2↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.056x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.133x ➖, 0↑ 1↓)
datafusion / parquet (0.944x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.132x ➖, 0↑ 1↓)
duckdb / vortex-compact (0.974x ➖, 1↑ 0↓)
duckdb / parquet (1.010x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.972x ➖, 1↑ 0↓)
duckdb / vortex-compact (1.008x ➖, 0↑ 0↓)
duckdb / parquet (1.005x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.015x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.024x ➖, 0↑ 0↓)
datafusion / parquet (1.006x ➖, 0↑ 0↓)
datafusion / arrow (1.002x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.090x ➖, 0↑ 9↓)
duckdb / vortex-compact (1.089x ➖, 0↑ 6↓)
duckdb / parquet (1.005x ➖, 0↑ 0↓)
duckdb / duckdb (1.001x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: Random AccessVortex (geomean): 0.682x ✅ unknown / unknown (0.745x ✅, 53↑ 0↓)
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.976x ➖, 1↑ 0↓)
datafusion / parquet (0.976x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.974x ➖, 3↑ 1↓)
duckdb / parquet (0.989x ➖, 0↑ 0↓)
duckdb / duckdb (0.955x ➖, 5↑ 0↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.966x ➖, 1↑ 2↓)
datafusion / vortex-compact (1.020x ➖, 0↑ 0↓)
datafusion / parquet (1.024x ➖, 0↑ 3↓)
duckdb / vortex-file-compressed (0.994x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.973x ➖, 0↑ 0↓)
duckdb / parquet (1.023x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: CompressionVortex (geomean): 1.003x ➖ unknown / unknown (1.024x ➖, 2↑ 13↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.996x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.956x ➖, 0↑ 0↓)
datafusion / parquet (1.001x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (0.979x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.975x ➖, 0↑ 0↓)
duckdb / parquet (0.972x ➖, 0↑ 0↓)
Full attributed analysis
|
Summary
Closes: #000
This PR implements phases P1 through P5 of the layered pco encoding stack, decomposing the monolithic
PcoArrayinto composable, first-class Vortex arrays. Each phase corresponds to one stage of the pco algorithm and supports element-level random access where the layer permits.Changes
New encoding crates:
vortex-ordered-latent(P1): Order-preserving unsigned latent transformation. Converts primitive types (int, float) to unsigned latents that preserve sort order, matching pco'sto_latent_ordered/from_latent_ordered.vortex-int-mult(P2a): Integer multiplier decomposition. Decomposes unsigned latents into(primary, secondary)pairs wherevalue = base * primary + secondary, shavinglog2(base)bits off typical primary values for integer-scaled data.vortex-float-mult(P2b): Float multiplier decomposition. Decomposesf64into(primary, secondary)pairs using a fixed base, with bit-exact round-trip reconstruction.vortex-float-quant(P2c): Float quantization. Splitsf64bit patterns at a fixed quantization boundaryk, separating high and low bits for independent compression.vortex-pco-dict(P2d): Dictionary encoding. Represents values as indices into a small dictionary of unique values, supporting integer primitives.vortex-consecutive-delta(P3): First-order delta encoding. Stores a seed and consecutive differences, decoded via prefix sum with wrapping arithmetic.vortex-bin-partition(P4): Bin partitioning with variable-width bit packing. Decomposesi64into(bin_idx, offset)pairs; offsets are stored in a packed bit buffer with per-bin widths and batch-indexed prefix sums for O(64) random access.vortex-ans(P5): tANS entropy coding. Single-state table-based asymmetric numeral systems foru8symbol streams, with from-scratch implementation compatible with pco's byte stream structure.Design documentation:
encodings/pco/DESIGN.md: Comprehensive specification of the layered pco decomposition, including two compression profiles (Fast-RA and High-ratio), random-access characteristics, and rationale for the layer boundaries.Benchmarking infrastructure:
benchmarks/layered-pco-bench/: Measurement harness comparing full pco, vanilla BtrBlocks, and a hybrid compressor (pco-style structural top + BtrBlocks entropy bottom) on synthetic and TPC-H data.encodings/*/benches/with detailed RESULTS.md files documenting throughput and compression ratios.Public API locks:
public-api.lockfiles for all new crates documenting their public surface.Implementation highlights
VTablewithscalar_atsupport where feasible (O(1) for most layers, O(64) for bin-partition, O(N) for tANS).ValidityVTableFromChildwhere applicable.Testing
Existing tests pass. Microbenchmarks in each encoding crate validate encode/decode/scalar_at throughput and compression ratios on synthetic and real data. The layered-pco-bench binary provides end-to-end measurement against full pco and BtrBlocks baselines.
https://claude.ai/code/session_01Gvadeq4qgLPGr74kM8zNjy