parquet/compress: enable WithAllLitEntropyCompression(true) for zstd (#779)

varun0630 · Varun Venkatesh · web-flow · commit 7b3d772a9a75 · 2026-04-24T12:56:44.000-04:00
## Rationale The `klauspost/compress/zstd` encoder currently disables `AllLitEntropyCompression` at `SpeedDefault` (the preset that maps to zstd levels 1–4). Klauspost's encoder short-circuits to storing literals uncompressed when no LZ matches are found, skipping the entropy-coding stage. This is a good tradeoff for genuinely incompressible data (random bytes), but it leaves significant compression on the table for real-world columnar data where LZ match density is low but byte distributions are highly skewed — e.g. parquet INT32 decimal columns whose values cluster in a small range (so the high bytes are mostly zero). Enabling `WithAllLitEntropyCompression(true)` forces entropy coding on literals even without LZ matches, matching the behavior of the C reference implementation (`facebook/zstd`) at the same nominal levels. ## Impact Measured on a real-world parquet workload — TPC-DS `store_sales`, 7 Trino-written files, ~9.5M rows, 23 columns including high-cardinality `Decimal(7,2)` columns — going through Apache Iceberg's compaction path at ZSTD level 3: | Config | Output vs input | |---|---| | klauspost (current default) | +6.11% inflation | | **klauspost + WithAllLitEntropyCompression(true)** | **-1.84% reduction** | | DataDog/zstd (CGo wrapper around C zstd) level 3 | -2.23% reduction | | Trino (JNI, C zstd level 3) — reference | -3.99% reduction | Per-blob benchmark (161 page blobs compressed directly by both implementations at level 3): - klauspost current default: 346,287 KB (66.60% of raw) - klauspost + this fix: 329,249 KB (63.32% of raw) - DataDog/zstd: 329,648 KB (63.40% of raw) With this one-line change, klauspost matches (and slightly beats) the C reference implementation on this workload. Discussing with @klauspost we concluded that enabling `AllLitEntropyCompression` is the intended way to close this gap. This PR applies that setting to arrow-go's zstd codec. ## Trade-off Slightly slower compression on genuinely incompressible data (the case `AllLitEntropyCompression` was disabled for). For parquet workloads, this is typically a non-issue since columns with no structure are rare. Co-authored-by: Varun Venkatesh <varunvenkatesh@Varuns-MacBook-Pro.local>
diff --git a/parquet/compress/zstd.go b/parquet/compress/zstd.go
@@ -64,7 +64,7 @@ func (p *zstdEncoderPool) getEncoderFromPool(level zstd.EncoderLevel) *zstd.Enco
 		if !ok {
 			pool = &sync.Pool{
 				New: func() interface{} {
-					enc, _ := zstd.NewWriter(nil, zstd.WithZeroFrames(true), zstd.WithEncoderLevel(level), zstd.WithEncoderConcurrency(1))
+					enc, _ := zstd.NewWriter(nil, zstd.WithZeroFrames(true), zstd.WithEncoderLevel(level), zstd.WithEncoderConcurrency(1), zstd.WithAllLitEntropyCompression(true))
 					return enc
 				},
 			}
@@ -92,7 +92,7 @@ func (p *zstdEncoderPool) putEncoderToPool(enc *zstd.Encoder, level zstd.Encoder
 
 func getencoder() *zstd.Encoder {
 	initEncoder.Do(func() {
-		enc, _ = zstd.NewWriter(nil, zstd.WithZeroFrames(true))
+		enc, _ = zstd.NewWriter(nil, zstd.WithZeroFrames(true), zstd.WithAllLitEntropyCompression(true))
 	})
 	return enc
 }