Skip to content

Commit 7b3d772

Browse files
varun0630Varun Venkatesh
andauthored
parquet/compress: enable WithAllLitEntropyCompression(true) for zstd (#779)
## Rationale The `klauspost/compress/zstd` encoder currently disables `AllLitEntropyCompression` at `SpeedDefault` (the preset that maps to zstd levels 1–4). Klauspost's encoder short-circuits to storing literals uncompressed when no LZ matches are found, skipping the entropy-coding stage. This is a good tradeoff for genuinely incompressible data (random bytes), but it leaves significant compression on the table for real-world columnar data where LZ match density is low but byte distributions are highly skewed — e.g. parquet INT32 decimal columns whose values cluster in a small range (so the high bytes are mostly zero). Enabling `WithAllLitEntropyCompression(true)` forces entropy coding on literals even without LZ matches, matching the behavior of the C reference implementation (`facebook/zstd`) at the same nominal levels. ## Impact Measured on a real-world parquet workload — TPC-DS `store_sales`, 7 Trino-written files, ~9.5M rows, 23 columns including high-cardinality `Decimal(7,2)` columns — going through Apache Iceberg's compaction path at ZSTD level 3: | Config | Output vs input | |---|---| | klauspost (current default) | +6.11% inflation | | **klauspost + WithAllLitEntropyCompression(true)** | **-1.84% reduction** | | DataDog/zstd (CGo wrapper around C zstd) level 3 | -2.23% reduction | | Trino (JNI, C zstd level 3) — reference | -3.99% reduction | Per-blob benchmark (161 page blobs compressed directly by both implementations at level 3): - klauspost current default: 346,287 KB (66.60% of raw) - klauspost + this fix: 329,249 KB (63.32% of raw) - DataDog/zstd: 329,648 KB (63.40% of raw) With this one-line change, klauspost matches (and slightly beats) the C reference implementation on this workload. Discussing with @klauspost we concluded that enabling `AllLitEntropyCompression` is the intended way to close this gap. This PR applies that setting to arrow-go's zstd codec. ## Trade-off Slightly slower compression on genuinely incompressible data (the case `AllLitEntropyCompression` was disabled for). For parquet workloads, this is typically a non-issue since columns with no structure are rare. Co-authored-by: Varun Venkatesh <varunvenkatesh@Varuns-MacBook-Pro.local>
1 parent 4a0b225 commit 7b3d772

1 file changed

Lines changed: 2 additions & 2 deletions

File tree

parquet/compress/zstd.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ func (p *zstdEncoderPool) getEncoderFromPool(level zstd.EncoderLevel) *zstd.Enco
6464
if !ok {
6565
pool = &sync.Pool{
6666
New: func() interface{} {
67-
enc, _ := zstd.NewWriter(nil, zstd.WithZeroFrames(true), zstd.WithEncoderLevel(level), zstd.WithEncoderConcurrency(1))
67+
enc, _ := zstd.NewWriter(nil, zstd.WithZeroFrames(true), zstd.WithEncoderLevel(level), zstd.WithEncoderConcurrency(1), zstd.WithAllLitEntropyCompression(true))
6868
return enc
6969
},
7070
}
@@ -92,7 +92,7 @@ func (p *zstdEncoderPool) putEncoderToPool(enc *zstd.Encoder, level zstd.Encoder
9292

9393
func getencoder() *zstd.Encoder {
9494
initEncoder.Do(func() {
95-
enc, _ = zstd.NewWriter(nil, zstd.WithZeroFrames(true))
95+
enc, _ = zstd.NewWriter(nil, zstd.WithZeroFrames(true), zstd.WithAllLitEntropyCompression(true))
9696
})
9797
return enc
9898
}

0 commit comments

Comments
 (0)