Commit 75bd3c7
committed
apacheGH-3499: Cache hashCode() for non-reused Binary instances
PLAIN_DICTIONARY encoding of BINARY columns repeatedly hashes Binary keys
during dictionary map lookups, but the existing Binary.hashCode()
implementations (in ByteArraySliceBackedBinary, ByteArrayBackedBinary, and
ByteBufferBackedBinary) recompute the hash byte-by-byte on every call. For
columns with many repeated values this is the dominant cost of
encodeDictionary -- we observed up to 73x slowdown vs. the cached version on
the existing JMH benchmark.
Cache the hash code in a single int field on Binary. Reused Binary instances
(those whose backing array can be mutated by the producer between calls) do
not cache, preserving the existing mutable-buffer semantics.
Thread safety follows the java.lang.String.hashCode() idiom: the cache is a
single int field with sentinel value 0 meaning "not yet computed". Two
threads racing on the first hashCode() call may both compute and write the
same deterministic value, which is benign. A Binary whose true hash equals 0
is recomputed on every call (acceptably rare and still correct). No volatile
or synchronization is needed; both the field load and the field store are
atomic per JLS, and the value is deterministic given the immutable byte
content.
Implementation notes:
- The cache field is package-private (not private) so the three nested
Binary subclasses can read it directly in their hashCode() hot path,
avoiding an extra method-call layer that would otherwise be needed since
inherited private fields are not accessible from nested subclasses.
- A package-private cacheHashCode(int) helper centralises the
isBackingBytesReused check on the slow path.
- New tests in TestBinary cover (a) cached-and-stable hashCode for the three
constant Binary impls, and (b) reused Binary not returning a stale hash
after the backing buffer is replaced.
Benchmark (BinaryEncodingBenchmark.encodeDictionary, 100k BINARY values per
invocation, JMH -wi 5 -i 10 -f 3, 30 samples per row):
Param Before (ops/s) After (ops/s) Improvement
LOW / 10 13,170,110 20,203,480 +53% (1.53x)
LOW / 100 2,955,460 18,048,610 +511% (6.11x)
LOW / 1000 300,693 21,933,470 +7193% (72.9x)
HIGH / 10 847,657 1,336,238 +58% (1.58x)
HIGH / 100 418,327 1,323,284 +216% (3.16x)
HIGH / 1000 72,553 1,296,679 +1687% (17.9x)
The relative gain grows with string length because the per-value hash cost
(byte-loop length) grows linearly while the cached lookup is O(1). LOW
cardinality benefits even more because each unique key is hashed many more
times (once per insertion check across the 100k values).
Negative control: BinaryEncodingBenchmark.encodePlain (which writes Binary
without dictionary lookups, so does not exercise hashCode) is unchanged
within +/- 2.5% across all parameter combinations.
Allocation rate per operation is identical between baseline and optimized
(7.36 B/op for LOW/10, etc.), confirming the speedup comes from CPU saved
on hashing rather than reduced allocations.
All 575 parquet-column tests pass (was 573; +2 new tests for the cache).1 parent 53d7842 commit 75bd3c7
2 files changed
Lines changed: 88 additions & 4 deletions
File tree
- parquet-column/src
- main/java/org/apache/parquet/io/api
- test/java/org/apache/parquet/io/api
Lines changed: 43 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
40 | 52 | | |
41 | 53 | | |
42 | 54 | | |
| |||
101 | 113 | | |
102 | 114 | | |
103 | 115 | | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
104 | 128 | | |
105 | 129 | | |
106 | 130 | | |
| |||
180 | 204 | | |
181 | 205 | | |
182 | 206 | | |
183 | | - | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
184 | 212 | | |
185 | 213 | | |
186 | 214 | | |
| |||
340 | 368 | | |
341 | 369 | | |
342 | 370 | | |
343 | | - | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
344 | 376 | | |
345 | 377 | | |
346 | 378 | | |
| |||
499 | 531 | | |
500 | 532 | | |
501 | 533 | | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
502 | 540 | | |
503 | | - | |
| 541 | + | |
504 | 542 | | |
505 | | - | |
| 543 | + | |
506 | 544 | | |
| 545 | + | |
507 | 546 | | |
508 | 547 | | |
509 | 548 | | |
| |||
Lines changed: 45 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
155 | 155 | | |
156 | 156 | | |
157 | 157 | | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
158 | 203 | | |
159 | 204 | | |
160 | 205 | | |
| |||
0 commit comments