Commit 7b3d772
parquet/compress: enable WithAllLitEntropyCompression(true) for zstd (#779)
## Rationale
The `klauspost/compress/zstd` encoder currently disables
`AllLitEntropyCompression` at `SpeedDefault` (the preset that maps to
zstd levels 1–4). Klauspost's encoder short-circuits to storing literals
uncompressed when no LZ matches are found, skipping the entropy-coding
stage. This is a good tradeoff for genuinely incompressible data (random
bytes), but it leaves significant compression on the table for
real-world columnar data where LZ match density is low but byte
distributions are highly skewed — e.g. parquet INT32 decimal columns
whose values cluster in a small range (so the high bytes are mostly
zero).
Enabling `WithAllLitEntropyCompression(true)` forces entropy coding on
literals even without LZ matches, matching the behavior of the C
reference implementation (`facebook/zstd`) at the same nominal levels.
## Impact
Measured on a real-world parquet workload — TPC-DS `store_sales`, 7
Trino-written files, ~9.5M rows, 23 columns including high-cardinality
`Decimal(7,2)` columns — going through Apache Iceberg's compaction path
at ZSTD level 3:
| Config | Output vs input |
|---|---|
| klauspost (current default) | +6.11% inflation |
| **klauspost + WithAllLitEntropyCompression(true)** | **-1.84%
reduction** |
| DataDog/zstd (CGo wrapper around C zstd) level 3 | -2.23% reduction |
| Trino (JNI, C zstd level 3) — reference | -3.99% reduction |
Per-blob benchmark (161 page blobs compressed directly by both
implementations at level 3):
- klauspost current default: 346,287 KB (66.60% of raw)
- klauspost + this fix: 329,249 KB (63.32% of raw)
- DataDog/zstd: 329,648 KB (63.40% of raw)
With this one-line change, klauspost matches (and slightly beats) the C
reference implementation on this workload.
Discussing with @klauspost we concluded that enabling
`AllLitEntropyCompression` is the intended way to close this gap. This
PR applies that setting to arrow-go's zstd codec.
## Trade-off
Slightly slower compression on genuinely incompressible data (the case
`AllLitEntropyCompression` was disabled for). For parquet workloads,
this is typically a non-issue since columns with no structure are rare.
Co-authored-by: Varun Venkatesh <varunvenkatesh@Varuns-MacBook-Pro.local>1 parent 4a0b225 commit 7b3d772
1 file changed
Lines changed: 2 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
64 | 64 | | |
65 | 65 | | |
66 | 66 | | |
67 | | - | |
| 67 | + | |
68 | 68 | | |
69 | 69 | | |
70 | 70 | | |
| |||
92 | 92 | | |
93 | 93 | | |
94 | 94 | | |
95 | | - | |
| 95 | + | |
96 | 96 | | |
97 | 97 | | |
98 | 98 | | |
| |||
0 commit comments