GH-3522: Use unpack32Values fast path in RLE hybrid decoder PACKED branch

iemejia · iemejia · commit af02aefceb82 · 2026-05-02T00:06:17.000+02:00
Symmetric to the encoder's pack32Values fast path, the decoder's PACKED
branch now batches 4 groups (32 values) into a single unpack32Values call
instead of looping unpack8Values four times. Falls back to unpack8Values
for residual &lt;4-group tails.

This benefits long PACKED runs (&gt;=32 values) by reducing loop overhead
and enabling the packer's optimized 32-value code path.

Combined benchmark for the full par9 branch (all 4 commits: buffer reuse +
ByteBuffer conversion + pack32Values encoder + unpack32Values decoder):

RleDictionaryIndexDecodingBenchmark (100k dictionary IDs, JMH -wi 3 -i 5 -f 1):

  Pattern           Before (ops/s)   After (ops/s)   Improvement
  SEQUENTIAL        603,445,362     698,066,810     +16% (1.16x)
  RANDOM            613,691,096     681,685,407     +11% (1.11x)
  LOW_CARDINALITY   611,963,736     686,200,341     +12% (1.12x)

IntEncodingBenchmark.decodeDictionary (100k INT32 values, full dictionary
decode path including RLE index decode):

  Pattern           Before (ops/s)   After (ops/s)   Improvement
  SEQUENTIAL        418,357,276     539,458,940     +29% (1.29x)
  RANDOM            417,041,197     527,231,831     +26% (1.26x)
  LOW_CARDINALITY   605,354,083     628,283,691     +4%
  HIGH_CARDINALITY  416,731,808     535,763,242     +29% (1.29x)

All 573 parquet-column tests pass.
diff --git a/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java b/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java
@@ -114,10 +114,18 @@ private void readNext() {
         int bytesToRead = (int) Math.ceil(currentCount * bitWidth / 8.0);
         bytesToRead = Math.min(bytesToRead, buffer.remaining());
         buffer.get(packedBytesBuffer, 0, bytesToRead);
-        for (int valueIndex = 0, byteIndex = 0;
-            valueIndex < currentCount;
-            valueIndex += 8, byteIndex += bitWidth) {
-          packer.unpack8Values(packedBytesBuffer, byteIndex, currentBuffer, valueIndex);
+        // Unpack 32 values (4 groups) at a time when possible — symmetric to the encoder's
+        // pack32Values fast path. Falls back to unpack8Values for any residual groups.
+        int groupIdx = 0;
+        int byteIndex = 0;
+        final int step32 = bitWidth * 4;
+        while (groupIdx + 4 <= numGroups) {
+          packer.unpack32Values(packedBytesBuffer, byteIndex, currentBuffer, groupIdx * 8);
+          groupIdx += 4;
+          byteIndex += step32;
+        }
+        for (; groupIdx < numGroups; groupIdx++, byteIndex += bitWidth) {
+          packer.unpack8Values(packedBytesBuffer, byteIndex, currentBuffer, groupIdx * 8);
         }
         break;
       default: