perf(compression) - hardcode Zstd decode concurrency to 1#2648
Conversation
Compares zstd.NewReader(r) vs zstd.NewReader(r, WithDecoderConcurrency(1)) under both sequential and parallel patterns. Decoders are pulled from a sync.Pool with Reset reuse, mirroring production's getZstdDecoder. Source data is real binaries (3.5× compression ratio), 40 chunks of 2 MiB each to match DefaultCompressFrameSize. Skips on systems without the candidate data files. Benchmark output captured on AMD Ryzen 7 8845HS (16 cores, GOMAXPROCS=16), 80 MiB raw → 22.78 MiB compressed (avg frame 583 KiB): === source data === source: /home/lev/dev/infra/packages/orchestrator/orchestrator chunks: 40 (raw=80 MiB, comp=22.78 MiB) ratio: 3.511x (raw/comp) comp size: min=139715 B (136 KiB), max=1497120 B (1462 KiB), avg=597239 B (583 KiB) goos: linux goarch: amd64 pkg: zstdbench cpu: AMD Ryzen 7 8845HS w/ Radeon 780M Graphics BenchmarkDecodeDefault-16 2562 1367280 ns/op 1533.81 MB/s 9887 B/op 18 allocs/op BenchmarkDecodeDefault-16 2553 1364731 ns/op 1536.68 MB/s 9900 B/op 18 allocs/op BenchmarkDecodeDefault-16 2497 1365620 ns/op 1535.68 MB/s 10091 B/op 18 allocs/op BenchmarkDecodeConcurrency1-16 2079 1703633 ns/op 1230.99 MB/s 5172 B/op 1 allocs/op BenchmarkDecodeConcurrency1-16 2070 1696848 ns/op 1235.91 MB/s 5185 B/op 1 allocs/op BenchmarkDecodeConcurrency1-16 2020 1697646 ns/op 1235.33 MB/s 5322 B/op 1 allocs/op BenchmarkDecodeDefault_Parallel-16 13562 264797 ns/op 7919.85 MB/s 27322 B/op 18 allocs/op BenchmarkDecodeDefault_Parallel-16 13591 264851 ns/op 7918.24 MB/s 27964 B/op 18 allocs/op BenchmarkDecodeDefault_Parallel-16 13576 264678 ns/op 7923.40 MB/s 28114 B/op 18 allocs/op BenchmarkDecodeConcurrency1_Parallel-16 17623 203340 ns/op 10313.51 MB/s 9827 B/op 1 allocs/op BenchmarkDecodeConcurrency1_Parallel-16 17707 203043 ns/op 10328.59 MB/s 9795 B/op 1 allocs/op BenchmarkDecodeConcurrency1_Parallel-16 17697 202858 ns/op 10338.02 MB/s 9816 B/op 1 allocs/op PASS ok zstdbench 57.959s Findings: - Sequential: default beats concurrency=1 by ~25% (1535 vs 1235 MB/s) - Parallel: concurrency=1 beats default by ~30% (10328 vs 7920 MB/s aggregate) - Allocations: concurrency=1 has 1 alloc/op vs 18 (-94%)
The comment at compress_decode.go:30-31 said
// zstd concurrency is hardcoded to 1: benchmarks show higher values hurt
// throughput for single 2MiB frame decodes.
but the code called zstd.NewReader(r) with no options, defaulting to
GOMAXPROCS internal worker goroutines per decoder. The previous commit's
benchmark confirms the comment's intent: under the production workload
(many concurrent decoders sharing a sync.Pool), concurrency=1 wins
~30% on aggregate throughput and reduces allocations 18→1 per decode.
There was a problem hiding this comment.
An organization admin can view or raise the cap at claude.ai/admin-settings/claude-code. The cap resets at the start of the next billing period.
Once the cap resets or is raised, reopen this pull request to trigger a review.
❌ 6 Tests Failed:
View the top 1 failed test(s) by shortest run time
View the full list of 7 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
With sufficiently high concurrency of callers, using the built-in decoder concurrency on our 2Mb frames was detrimental to the throughput. Setting the decoder concurrency to 1 also saves on allocations.
Findings:
Compares zstd.NewReader(r) vs zstd.NewReader(r, WithDecoderConcurrency(1))
under both sequential and parallel patterns. Decoders are pulled from a
sync.Pool with Reset reuse, mirroring production's getZstdDecoder. Source
data is real binaries (3.5× compression ratio), 40 chunks of 2 MiB each
to match DefaultCompressFrameSize. Skips on systems without the candidate
data files.
Benchmark output captured on AMD Ryzen 7 8845HS (16 cores, GOMAXPROCS=16),
80 MiB raw → 22.78 MiB compressed (avg frame 583 KiB):
=== source data ===
source: /home/lev/dev/infra/packages/orchestrator/orchestrator
chunks: 40 (raw=80 MiB, comp=22.78 MiB)
ratio: 3.511x (raw/comp)
comp size: min=139715 B (136 KiB), max=1497120 B (1462 KiB), avg=597239 B (583 KiB)
goos: linux
goarch: amd64
pkg: zstdbench
cpu: AMD Ryzen 7 8845HS w/ Radeon 780M Graphics
BenchmarkDecodeDefault-16 2562 1367280 ns/op 1533.81 MB/s 9887 B/op 18 allocs/op
BenchmarkDecodeDefault-16 2553 1364731 ns/op 1536.68 MB/s 9900 B/op 18 allocs/op
BenchmarkDecodeDefault-16 2497 1365620 ns/op 1535.68 MB/s 10091 B/op 18 allocs/op
BenchmarkDecodeConcurrency1-16 2079 1703633 ns/op 1230.99 MB/s 5172 B/op 1 allocs/op
BenchmarkDecodeConcurrency1-16 2070 1696848 ns/op 1235.91 MB/s 5185 B/op 1 allocs/op
BenchmarkDecodeConcurrency1-16 2020 1697646 ns/op 1235.33 MB/s 5322 B/op 1 allocs/op
BenchmarkDecodeDefault_Parallel-16 13562 264797 ns/op 7919.85 MB/s 27322 B/op 18 allocs/op
BenchmarkDecodeDefault_Parallel-16 13591 264851 ns/op 7918.24 MB/s 27964 B/op 18 allocs/op
BenchmarkDecodeDefault_Parallel-16 13576 264678 ns/op 7923.40 MB/s 28114 B/op 18 allocs/op
BenchmarkDecodeConcurrency1_Parallel-16 17623 203340 ns/op 10313.51 MB/s 9827 B/op 1 allocs/op
BenchmarkDecodeConcurrency1_Parallel-16 17707 203043 ns/op 10328.59 MB/s 9795 B/op 1 allocs/op
BenchmarkDecodeConcurrency1_Parallel-16 17697 202858 ns/op 10338.02 MB/s 9816 B/op 1 allocs/op
PASS
ok zstdbench 57.959s