GH-3522: Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode) by iemejia · Pull Request #3523 · apache/parquet-java

iemejia · 2026-04-21T17:18:05Z

Summary

Closes #3522.

RunLengthBitPackingHybridDecoder allocates a new int[] and byte[] on every PACKED run during decode. The code itself flagged this with a // TODO: reuse a buffer comment. This PR resolves the TODO by reusing the buffers across runs within the same decoder instance, growing them lazily only when a larger run is encountered.

Also adds a currentBufferLength field to track the logical active-region length in packedValuesBuffer (since packedValuesBuffer.length may now exceed the current run's size after a prior larger run grew it).

Benchmark

RleDictionaryIndexDecodingBenchmark (added in #3512) isolates the RLE/bit-packed dictionary-id decode path. 100k INT32 dictionary IDs, BIT_WIDTH=10, JMH -wi 5 -i 10 -f 2 (20 measurement iterations):

Pattern	master (ops/s)	optimized (ops/s)	Improvement
SEQUENTIAL	93,061,521	113,856,860	+22.3%
RANDOM	92,929,824	114,238,638	+22.9%
LOW_CARDINALITY	92,813,229	115,271,347	+24.2%

End-to-end FileReadBenchmark sees a much smaller ~2% improvement because RLE decoding is only one of many pipeline stages; the isolated micro-benchmark shows the true magnitude on the affected code path.

Validation

parquet-column: 573 tests pass
TestRunLengthBitPackingHybridEncoder: 9 tests pass (these round-trip values through the decoder)
Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true

Scope

17 LOC change to a single file. Self-contained and obviously correct (resolves the existing TODO).

How to reproduce

The benchmark is added in #3512. Once that lands, reproduce with:

./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
java -jar parquet-benchmarks/target/parquet-benchmarks.jar \
    'RleDictionaryIndexDecodingBenchmark' -wi 5 -i 10 -f 2

arouel

Have you seen the PR #3467 which addresses the same issue? I would prefer if we help each other in reviewing instead of creating more PRs of the same kind.

Fokko

Thanks for fixing this outstanding TODO @iemejia 🙌

I left one comment because I think it would good to update to the non-deprecated API of parquet-encoding while at it. WDYT?

Fokko · 2026-04-28T18:05:16Z

        int bytesToRead = (int) Math.ceil(currentCount * bitWidth / 8.0);
        bytesToRead = Math.min(bytesToRead, in.available());
-        new DataInputStream(in).readFully(bytes, 0, bytesToRead);
+        new DataInputStream(in).readFully(packedBytesBuffer, 0, bytesToRead);


I think it is on purpose to not close this resource, should we add a comment to it?

Fokko · 2026-04-28T18:06:35Z

            valueIndex < currentCount;
            valueIndex += 8, byteIndex += bitWidth) {
-          packer.unpack8Values(bytes, byteIndex, currentBuffer, valueIndex);
+          packer.unpack8Values(packedBytesBuffer, byteIndex, currentBuffer, valueIndex);


I noticed that unpack8Values(final byte[] input, ...) is deprecated, should we switch to the ByteBuffer overload while we're reworking this?

diff --git a/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java b/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java index da9db00c7..14ec1a87d 100644 --- a/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java +++ b/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java @@ -21,6 +21,7 @@ package org.apache.parquet.column.values.rle; import java.io.DataInputStream; import java.io.IOException; import java.io.InputStream; +import java.nio.ByteBuffer; import org.apache.parquet.Preconditions; import org.apache.parquet.bytes.BytesUtils; import org.apache.parquet.column.values.bitpacking.BytePacker; @@ -52,7 +53,7 @@ public class RunLengthBitPackingHybridDecoder { // Reusable buffers to avoid per-run allocation in PACKED mode private int[] packedValuesBuffer = new int[0]; - private byte[] packedBytesBuffer = new byte[0]; + private ByteBuffer packedBytesBuffer = ByteBuffer.allocate(0); public RunLengthBitPackingHybridDecoder(int bitWidth, InputStream in) { LOG.debug("decoding bitWidth {}", bitWidth); @@ -102,13 +103,13 @@ public class RunLengthBitPackingHybridDecoder { } currentBuffer = packedValuesBuffer; int bytesRequired = numGroups * bitWidth; - if (packedBytesBuffer.length < bytesRequired) { - packedBytesBuffer = new byte[bytesRequired]; + if (packedBytesBuffer.capacity() < bytesRequired) { + packedBytesBuffer = ByteBuffer.allocate(bytesRequired); } // At the end of the file RLE data though, there might not be that many bytes left. int bytesToRead = (int) Math.ceil(currentCount * bitWidth / 8.0); bytesToRead = Math.min(bytesToRead, in.available()); - new DataInputStream(in).readFully(packedBytesBuffer, 0, bytesToRead); + new DataInputStream(in).readFully(packedBytesBuffer.array(), 0, bytesToRead); for (int valueIndex = 0, byteIndex = 0; valueIndex < currentCount; valueIndex += 8, byteIndex += bitWidth) {

…dDecoder PACKED path Allocate the int[] values buffer and byte[] read-staging buffer once per decoder and grow them lazily, instead of allocating fresh arrays on every PACKED run. Resolves the existing "TODO: reuse a buffer" comment. A new currentBufferLength field tracks the logical length of the active region in packedValuesBuffer (which may now exceed the current run's size after a prior larger run grew it). Benchmark (RleDictionaryIndexDecodingBenchmark, 100k INT32, BIT_WIDTH=10, JMH -wi 5 -i 10 -f 2): Pattern | master ops/s | optimized ops/s | Improvement SEQUENTIAL | 93,061,521 | 113,856,860 | +22.3% RANDOM | 92,929,824 | 114,238,638 | +22.9% LOW_CARDINALITY | 92,813,229 | 115,271,347 | +24.2% End-to-end FileReadBenchmark sees ~2% improvement (RLE decoding is a small fraction of full file reads). Validation: 573 parquet-column tests pass. Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.

…eam to direct ByteBuffer Replace the InputStream-based I/O in RunLengthBitPackingHybridDecoder with direct ByteBuffer access using LITTLE_ENDIAN byte order. This eliminates the InputStream/DataInputStream abstraction layer and enables more efficient reads via buffer.get(), buffer.getShort(), buffer.getInt() intrinsics. Changes: - Add ByteBuffer overloads for BytesUtils.readUnsignedVarInt() and readIntLittleEndianPaddedOnBitWidth() - Update all callers (DictionaryValuesReader, RunLengthBitPackingHybrid- ValuesReader, ColumnReaderBase) to pass ByteBuffer instead of InputStream - Remove IOException from readInt() since ByteBuffer operations don't throw checked exceptions, simplifying all call sites - Update tests to use ByteBuffer API directly All 573 parquet-column tests pass.

Buffer four 8-value groups (32 values) before packing, then use the packer's pack32Values entry point instead of calling pack8Values four separate times. This reduces per-group overhead and enables the packer's optimized 32-value code path. When fewer than 32 values remain at run end, flushBitPackedValues falls back to per-8 packing via pack8Values. All 573 parquet-column tests pass.

…KED branch Symmetric to the encoder's pack32Values fast path, the decoder's PACKED branch now batches 4 groups (32 values) into a single unpack32Values call instead of looping unpack8Values four times. Falls back to unpack8Values for residual <4-group tails. This benefits long PACKED runs (>=32 values) by reducing loop overhead and enabling the packer's optimized 32-value code path. Combined benchmark for the full par9 branch (all 4 commits: buffer reuse + ByteBuffer conversion + pack32Values encoder + unpack32Values decoder): RleDictionaryIndexDecodingBenchmark (100k dictionary IDs, JMH -wi 3 -i 5 -f 1): Pattern Before (ops/s) After (ops/s) Improvement SEQUENTIAL 603,445,362 698,066,810 +16% (1.16x) RANDOM 613,691,096 681,685,407 +11% (1.11x) LOW_CARDINALITY 611,963,736 686,200,341 +12% (1.12x) IntEncodingBenchmark.decodeDictionary (100k INT32 values, full dictionary decode path including RLE index decode): Pattern Before (ops/s) After (ops/s) Improvement SEQUENTIAL 418,357,276 539,458,940 +29% (1.29x) RANDOM 417,041,197 527,231,831 +26% (1.26x) LOW_CARDINALITY 605,354,083 628,283,691 +4% HIGH_CARDINALITY 416,731,808 535,763,242 +29% (1.29x) All 573 parquet-column tests pass.

arouel reviewed Apr 21, 2026

View reviewed changes

Fokko approved these changes Apr 28, 2026

View reviewed changes

iemejia added 3 commits May 1, 2026 23:10

iemejia force-pushed the perf-rle-buffer-reuse branch from ba5d027 to 38a48ed Compare May 1, 2026 21:11

iemejia force-pushed the perf-rle-buffer-reuse branch from 38a48ed to af02aef Compare May 1, 2026 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3522: Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode)#3523

GH-3522: Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode)#3523
iemejia wants to merge 4 commits intoapache:masterfrom
iemejia:perf-rle-buffer-reuse

iemejia commented Apr 21, 2026

Uh oh!

arouel left a comment

Uh oh!

Fokko left a comment

Uh oh!

Fokko Apr 28, 2026

Uh oh!

Fokko Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

iemejia commented Apr 21, 2026

Summary

Benchmark

Validation

Scope

Related

How to reproduce

Uh oh!

arouel left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants