Commit fe836a6
committed
GH-3530: Bypass Hadoop codec abstraction to optimize compression performance
Some of the Parquet compression codecs rely on Hadoop's CompressionCodec.
After evaluating with performance tests that isolate the CPU utilization it
is clear that the Hadoop abstraction introduces considerable overhead.
This PR improves that for Snappy, LZ4_RAW, ZSTD, GZIP, LZO, and BROTLI.
It also migrates Brotli from jbrotli to brotli4j.
Bypass Hadoop CompressionCodec for Snappy (xerial JNI), LZ4_RAW (airlift),
ZSTD (zstd-jni), GZIP (JDK), LZO (airlift), and BROTLI (brotli4j) in both
CodecFactory and DirectCodecFactory, eliminating per-page codec pool lookups,
stream wrapper allocation, and unnecessary buffer copies.
ZSTD: replace streaming ZstdOutputStreamNoFinalizer/ZstdInputStreamNoFinalizer
with reusable ZstdCompressCtx/ZstdDecompressCtx single-call APIs.
GZIP: bypass Hadoop's GzipCodec and its codec-pool/stream-wrapper overhead
with direct JDK GZIPOutputStream/GZIPInputStream. Compression level is
read from the existing "zlib.compress.level" Hadoop configuration key.
LZO: bypass the GPL-licensed com.hadoop.compression.lzo.LzoCodec entirely
using aircompressor's LzoHadoopStreams (Apache 2.0). The framing format
(big-endian length-prefixed blocks) is wire-compatible with Hadoop's LzoCodec,
so existing LZO Parquet files remain readable. Removes the GPL dependency
for LZO support. Uncomment previously disabled LZO benchmarks and tests.
BROTLI: migrate from abandoned brotli-codec (jbrotli, 2016, x86-only) to
brotli4j 1.23.0 (com.aayushatharva.brotli4j) which supports 10 platforms
including linux/darwin/windows aarch64. brotli4j is a runtime-only optional
dependency accessed via reflection (Encoder.compress and Decoder.decompress)
to avoid a compile-time dependency. Uses Decoder.decompress(byte[], int, int)
instead of DirectDecompress to avoid loading classes that reference Netty.
Remove non-aarch64 Maven profile guards and aarch64 test skips.
ByteBuffer decompressors use native APIs with slice + manual position
advancement pattern (matching DirectCodecFactory.BaseDecompressor):
- Snappy: Snappy.uncompress(slice, slice)
- ZSTD: Zstd.decompress(slice, slice)
- LZ4_RAW: decompressor.decompress(slice, slice)
- GZIP: ByteBufferInputStream.wrap(slice) -> GZIPInputStream
- LZO: ByteBufferInputStream.wrap(slice) -> LzoHadoopInputStream
- BROTLI: byte[] copy through Decoder.decompress (no direct ByteBuffer API)
Add BytesInput.toByteArray() zero-copy override in ByteArrayBytesInput.
Add benchmarks: CompressionBenchmark, CpuReadBenchmark, CpuWriteBenchmark,
FileReadBenchmark, FileWriteBenchmark, InMemoryInputFile, InMemoryOutputFile,
ConcurrentReadWriteBenchmark. Remove encoding/row-group benchmarks.
Add 15 new tests in TestDirectCodecFactory, 3 new tests in TestBytesInput.1 parent c7e7acb commit fe836a6
22 files changed
Lines changed: 2441 additions & 120 deletions
File tree
- parquet-benchmarks
- src/main/java/org/apache/parquet/benchmarks
- parquet-common/src
- main/java/org/apache/parquet/bytes
- test/java/org/apache/parquet/bytes
- parquet-hadoop
- src
- main/java/org/apache/parquet/hadoop
- test/java/org/apache/parquet/hadoop
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
87 | 87 | | |
88 | 88 | | |
89 | 89 | | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
90 | 96 | | |
91 | 97 | | |
92 | 98 | | |
93 | 99 | | |
94 | 100 | | |
95 | 101 | | |
96 | 102 | | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
97 | 115 | | |
98 | 116 | | |
99 | 117 | | |
| |||
112 | 130 | | |
113 | 131 | | |
114 | 132 | | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
115 | 139 | | |
116 | 140 | | |
117 | 141 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
39 | | - | |
| 39 | + | |
40 | 40 | | |
41 | 41 | | |
42 | 42 | | |
| |||
Lines changed: 76 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
Lines changed: 163 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
0 commit comments