Skip to content

GH-3522: Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode)#3523

Open
iemejia wants to merge 4 commits intoapache:masterfrom
iemejia:perf-rle-buffer-reuse
Open

GH-3522: Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode)#3523
iemejia wants to merge 4 commits intoapache:masterfrom
iemejia:perf-rle-buffer-reuse

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Apr 21, 2026

Summary

Closes #3522.

RunLengthBitPackingHybridDecoder allocates a new int[] and byte[] on every PACKED run during decode. The code itself flagged this with a // TODO: reuse a buffer comment. This PR resolves the TODO by reusing the buffers across runs within the same decoder instance, growing them lazily only when a larger run is encountered.

Also adds a currentBufferLength field to track the logical active-region length in packedValuesBuffer (since packedValuesBuffer.length may now exceed the current run's size after a prior larger run grew it).

Benchmark

RleDictionaryIndexDecodingBenchmark (added in #3512) isolates the RLE/bit-packed dictionary-id decode path. 100k INT32 dictionary IDs, BIT_WIDTH=10, JMH -wi 5 -i 10 -f 2 (20 measurement iterations):

Pattern master (ops/s) optimized (ops/s) Improvement
SEQUENTIAL 93,061,521 113,856,860 +22.3%
RANDOM 92,929,824 114,238,638 +22.9%
LOW_CARDINALITY 92,813,229 115,271,347 +24.2%

End-to-end FileReadBenchmark sees a much smaller ~2% improvement because RLE decoding is only one of many pipeline stages; the isolated micro-benchmark shows the true magnitude on the affected code path.

Validation

  • parquet-column: 573 tests pass
  • TestRunLengthBitPackingHybridEncoder: 9 tests pass (these round-trip values through the decoder)
  • Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true

Scope

17 LOC change to a single file. Self-contained and obviously correct (resolves the existing TODO).

Related

Part of the focused performance PR series from https://github.com/iemejia/parquet-perf. The companion ByteStreamSplit writer/reader changes from the same source commit (ba52f82c3) have already been submitted as #3504 and #3506.

How to reproduce

The benchmark is added in #3512. Once that lands, reproduce with:

./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
java -jar parquet-benchmarks/target/parquet-benchmarks.jar \
    'RleDictionaryIndexDecodingBenchmark' -wi 5 -i 10 -f 2

Copy link
Copy Markdown
Contributor

@arouel arouel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen the PR #3467 which addresses the same issue? I would prefer if we help each other in reviewing instead of creating more PRs of the same kind.

Copy link
Copy Markdown
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this outstanding TODO @iemejia 🙌

I left one comment because I think it would good to update to the non-deprecated API of parquet-encoding while at it. WDYT?

int bytesToRead = (int) Math.ceil(currentCount * bitWidth / 8.0);
bytesToRead = Math.min(bytesToRead, in.available());
new DataInputStream(in).readFully(bytes, 0, bytesToRead);
new DataInputStream(in).readFully(packedBytesBuffer, 0, bytesToRead);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is on purpose to not close this resource, should we add a comment to it?

valueIndex < currentCount;
valueIndex += 8, byteIndex += bitWidth) {
packer.unpack8Values(bytes, byteIndex, currentBuffer, valueIndex);
packer.unpack8Values(packedBytesBuffer, byteIndex, currentBuffer, valueIndex);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that unpack8Values(final byte[] input, ...) is deprecated, should we switch to the ByteBuffer overload while we're reworking this?

diff --git a/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java b/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java
index da9db00c7..14ec1a87d 100644
--- a/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java
+++ b/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java
@@ -21,6 +21,7 @@ package org.apache.parquet.column.values.rle;
 import java.io.DataInputStream;
 import java.io.IOException;
 import java.io.InputStream;
+import java.nio.ByteBuffer;
 import org.apache.parquet.Preconditions;
 import org.apache.parquet.bytes.BytesUtils;
 import org.apache.parquet.column.values.bitpacking.BytePacker;
@@ -52,7 +53,7 @@ public class RunLengthBitPackingHybridDecoder {
 
   // Reusable buffers to avoid per-run allocation in PACKED mode
   private int[] packedValuesBuffer = new int[0];
-  private byte[] packedBytesBuffer = new byte[0];
+  private ByteBuffer packedBytesBuffer = ByteBuffer.allocate(0);
 
   public RunLengthBitPackingHybridDecoder(int bitWidth, InputStream in) {
     LOG.debug("decoding bitWidth {}", bitWidth);
@@ -102,13 +103,13 @@ public class RunLengthBitPackingHybridDecoder {
         }
         currentBuffer = packedValuesBuffer;
         int bytesRequired = numGroups * bitWidth;
-        if (packedBytesBuffer.length < bytesRequired) {
-          packedBytesBuffer = new byte[bytesRequired];
+        if (packedBytesBuffer.capacity() < bytesRequired) {
+          packedBytesBuffer = ByteBuffer.allocate(bytesRequired);
         }
         // At the end of the file RLE data though, there might not be that many bytes left.
         int bytesToRead = (int) Math.ceil(currentCount * bitWidth / 8.0);
         bytesToRead = Math.min(bytesToRead, in.available());
-        new DataInputStream(in).readFully(packedBytesBuffer, 0, bytesToRead);
+        new DataInputStream(in).readFully(packedBytesBuffer.array(), 0, bytesToRead);
         for (int valueIndex = 0, byteIndex = 0;
             valueIndex < currentCount;
             valueIndex += 8, byteIndex += bitWidth) {

iemejia added 3 commits May 1, 2026 23:10
…dDecoder PACKED path

Allocate the int[] values buffer and byte[] read-staging buffer once per
decoder and grow them lazily, instead of allocating fresh arrays on every
PACKED run. Resolves the existing "TODO: reuse a buffer" comment.

A new currentBufferLength field tracks the logical length of the active
region in packedValuesBuffer (which may now exceed the current run's
size after a prior larger run grew it).

Benchmark (RleDictionaryIndexDecodingBenchmark, 100k INT32, BIT_WIDTH=10,
JMH -wi 5 -i 10 -f 2):

  Pattern         | master ops/s | optimized ops/s | Improvement
  SEQUENTIAL      |  93,061,521  |   113,856,860   |   +22.3%
  RANDOM          |  92,929,824  |   114,238,638   |   +22.9%
  LOW_CARDINALITY |  92,813,229  |   115,271,347   |   +24.2%

End-to-end FileReadBenchmark sees ~2% improvement (RLE decoding is a
small fraction of full file reads).

Validation: 573 parquet-column tests pass. Built with
-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
…eam to direct ByteBuffer

Replace the InputStream-based I/O in RunLengthBitPackingHybridDecoder with
direct ByteBuffer access using LITTLE_ENDIAN byte order. This eliminates the
InputStream/DataInputStream abstraction layer and enables more efficient
reads via buffer.get(), buffer.getShort(), buffer.getInt() intrinsics.

Changes:
- Add ByteBuffer overloads for BytesUtils.readUnsignedVarInt() and
  readIntLittleEndianPaddedOnBitWidth()
- Update all callers (DictionaryValuesReader, RunLengthBitPackingHybrid-
  ValuesReader, ColumnReaderBase) to pass ByteBuffer instead of InputStream
- Remove IOException from readInt() since ByteBuffer operations don't throw
  checked exceptions, simplifying all call sites
- Update tests to use ByteBuffer API directly

All 573 parquet-column tests pass.
Buffer four 8-value groups (32 values) before packing, then use the
packer's pack32Values entry point instead of calling pack8Values four
separate times. This reduces per-group overhead and enables the packer's
optimized 32-value code path.

When fewer than 32 values remain at run end, flushBitPackedValues falls
back to per-8 packing via pack8Values.

All 573 parquet-column tests pass.
@iemejia iemejia force-pushed the perf-rle-buffer-reuse branch from ba5d027 to 38a48ed Compare May 1, 2026 21:11
…KED branch

Symmetric to the encoder's pack32Values fast path, the decoder's PACKED
branch now batches 4 groups (32 values) into a single unpack32Values call
instead of looping unpack8Values four times. Falls back to unpack8Values
for residual <4-group tails.

This benefits long PACKED runs (>=32 values) by reducing loop overhead
and enabling the packer's optimized 32-value code path.

Combined benchmark for the full par9 branch (all 4 commits: buffer reuse +
ByteBuffer conversion + pack32Values encoder + unpack32Values decoder):

RleDictionaryIndexDecodingBenchmark (100k dictionary IDs, JMH -wi 3 -i 5 -f 1):

  Pattern           Before (ops/s)   After (ops/s)   Improvement
  SEQUENTIAL        603,445,362     698,066,810     +16% (1.16x)
  RANDOM            613,691,096     681,685,407     +11% (1.11x)
  LOW_CARDINALITY   611,963,736     686,200,341     +12% (1.12x)

IntEncodingBenchmark.decodeDictionary (100k INT32 values, full dictionary
decode path including RLE index decode):

  Pattern           Before (ops/s)   After (ops/s)   Improvement
  SEQUENTIAL        418,357,276     539,458,940     +29% (1.29x)
  RANDOM            417,041,197     527,231,831     +26% (1.26x)
  LOW_CARDINALITY   605,354,083     628,283,691     +4%
  HIGH_CARDINALITY  416,731,808     535,763,242     +29% (1.29x)

All 573 parquet-column tests pass.
@iemejia iemejia force-pushed the perf-rle-buffer-reuse branch from 38a48ed to af02aef Compare May 1, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode)

3 participants