perf(compression) - hardcode Zstd decode concurrency to 1 (#2648)

levb · web-flow · commit fb759bc63778 · 2026-05-14T08:43:43.000-07:00
With sufficiently high concurrency of callers, using the built-in
decoder concurrency on our 2Mb frames was detrimental to the throughput.
Setting the decoder concurrency to 1 also saves on allocations.

Findings:
- Sequential: default beats concurrency=1 by ~25% (1535 vs 1235 MB/s)
- Parallel (16): concurrency=1 beats default by ~30% (10328 vs 7920 MB/s
aggregate)
- Allocations: concurrency=1 has 1 alloc/op vs 18 (-94%)

Compares zstd.NewReader(r) vs zstd.NewReader(r,
WithDecoderConcurrency(1))
under both sequential and parallel patterns. Decoders are pulled from a
sync.Pool with Reset reuse, mirroring production's getZstdDecoder.
Source
data is real binaries (3.5× compression ratio), 40 chunks of 2 MiB each
to match DefaultCompressFrameSize. Skips on systems without the
candidate
data files.

Benchmark output captured on AMD Ryzen 7 8845HS (16 cores,
GOMAXPROCS=16),
80 MiB raw → 22.78 MiB compressed (avg frame 583 KiB):

=== source data ===
  source: /home/lev/dev/infra/packages/orchestrator/orchestrator
  chunks: 40 (raw=80 MiB, comp=22.78 MiB)
  ratio:  3.511x (raw/comp)
comp size: min=139715 B (136 KiB), max=1497120 B (1462 KiB), avg=597239
B (583 KiB)

goos: linux
goarch: amd64
pkg: zstdbench
cpu: AMD Ryzen 7 8845HS w/ Radeon 780M Graphics
BenchmarkDecodeDefault-16 2562 1367280 ns/op 1533.81 MB/s 9887 B/op 18
allocs/op
BenchmarkDecodeDefault-16 2553 1364731 ns/op 1536.68 MB/s 9900 B/op 18
allocs/op
BenchmarkDecodeDefault-16 2497 1365620 ns/op 1535.68 MB/s 10091 B/op 18
allocs/op
BenchmarkDecodeConcurrency1-16 2079 1703633 ns/op 1230.99 MB/s 5172 B/op
1 allocs/op
BenchmarkDecodeConcurrency1-16 2070 1696848 ns/op 1235.91 MB/s 5185 B/op
1 allocs/op
BenchmarkDecodeConcurrency1-16 2020 1697646 ns/op 1235.33 MB/s 5322 B/op
1 allocs/op
BenchmarkDecodeDefault_Parallel-16 13562 264797 ns/op 7919.85 MB/s 27322
B/op 18 allocs/op
BenchmarkDecodeDefault_Parallel-16 13591 264851 ns/op 7918.24 MB/s 27964
B/op 18 allocs/op
BenchmarkDecodeDefault_Parallel-16 13576 264678 ns/op 7923.40 MB/s 28114
B/op 18 allocs/op
BenchmarkDecodeConcurrency1_Parallel-16 17623 203340 ns/op 10313.51 MB/s
9827 B/op 1 allocs/op
BenchmarkDecodeConcurrency1_Parallel-16 17707 203043 ns/op 10328.59 MB/s
9795 B/op 1 allocs/op
BenchmarkDecodeConcurrency1_Parallel-16 17697 202858 ns/op 10338.02 MB/s
9816 B/op 1 allocs/op
PASS
ok      zstdbench       57.959s
diff --git a/packages/shared/pkg/storage/compress_decode.go b/packages/shared/pkg/storage/compress_decode.go
@@ -29,6 +29,8 @@ func putLZ4Decoder(dec *lz4.Reader) {
 
 // zstd concurrency is hardcoded to 1: benchmarks show higher values hurt
 // throughput for single 2MiB frame decodes.
+const zstdDecoderConcurrency = 1
+
 var zstdDecoderPool sync.Pool
 
 func getZstdDecoder(r io.Reader) (*zstd.Decoder, error) {
@@ -43,7 +45,7 @@ func getZstdDecoder(r io.Reader) (*zstd.Decoder, error) {
 		return dec, nil
 	}
 
-	return zstd.NewReader(r)
+	return zstd.NewReader(r, zstd.WithDecoderConcurrency(zstdDecoderConcurrency))
 }
 
 func putZstdDecoder(dec *zstd.Decoder) {

Original file line number	Diff line number	Diff line change
`@@ -29,6 +29,8 @@ func putLZ4Decoder(dec *lz4.Reader) {`
`29`	`29`
`30`	`30`	`// zstd concurrency is hardcoded to 1: benchmarks show higher values hurt`
`31`	`31`	`// throughput for single 2MiB frame decodes.`
	`32`	`+const zstdDecoderConcurrency = 1`
	`33`	`+`
`32`	`34`	`var zstdDecoderPool sync.Pool`
`33`	`35`
`34`	`36`	`func getZstdDecoder(r io.Reader) (*zstd.Decoder, error) {`
`@@ -43,7 +45,7 @@ func getZstdDecoder(r io.Reader) (*zstd.Decoder, error) {`
`43`	`45`	`return dec, nil`
`44`	`46`	`}`
`45`	`47`
`46`		`- return zstd.NewReader(r)`
	`48`	`+ return zstd.NewReader(r, zstd.WithDecoderConcurrency(zstdDecoderConcurrency))`
`47`	`49`	`}`
`48`	`50`
`49`	`51`	`func putZstdDecoder(dec *zstd.Decoder) {`