Parquet: Fix variant BINARY upper bound to truncate up#16880
Open
vishnuprakaz wants to merge 2 commits into
Open
Parquet: Fix variant BINARY upper bound to truncate up#16880vishnuprakaz wants to merge 2 commits into
vishnuprakaz wants to merge 2 commits into
Conversation
nssalian
reviewed
Jun 19, 2026
| case BINARY: | ||
| ByteBuffer truncatedBuffer = | ||
| BinaryUtil.truncateBinaryMin((ByteBuffer) value.asPrimitive().get(), 16); | ||
| BinaryUtil.truncateBinaryMax((ByteBuffer) value.asPrimitive().get(), 16); |
Collaborator
There was a problem hiding this comment.
Now that BINARY uses truncateBinaryMax, the all-0xFF overflow case (returns null) is reachable now. Worth a case with an all-0xFF value longer than 16 bytes asserting the upper bound is absent?
Author
|
Thanks for the review @nssalian good catch, added a test for all-0xFF overflow case asserting the upper bound is absent. |
wombatu-kun
approved these changes
Jun 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
For each column, Iceberg records the smallest and largest value in a file so queries can
skip files that can't contain a match. Long values are shortened to 16 bytes to save space
but when shortening the largest value, it has to round up, so the stored "max" stays
at least as big as the real maximum. (If it rounded down, the stored max could be smaller
than a real value, which breaks file skipping.)
For
BINARYvalues inside a Variant column, this rounds the wrong way: the upper bound isshortened with
truncateBinaryMin(rounds down) instead oftruncateBinaryMax(rounds up).So the stored max can come out smaller than a value actually in the file, and a query like
binaryColumn > Xmay decide the file can't match and skip it silently returning wrong(missing) results. The lower bound is correct.
This also disagrees with the code around it: the
STRINGbranch usestruncateStringMax,the lower-bound method uses
truncateBinaryMin, and the non-variantParquetMetrics.truncateUpperBoundusestruncateBinaryMax. Introduced in #12496.What changes are included in this PR?
Use
truncateBinaryMaxfor the BINARY upper bound. One-line change.Are these changes tested?
Yes.
TestVariantMetrics.testShreddedBinaryBoundsTruncationwrites a binary Variant valuelonger than 16 bytes and checks the upper bound rounds up. Fails before the fix (
…0F10,smaller than the value), passes after. Existing tests only used a 4-byte value (under the
16-byte threshold), so the path was untested.
Are there any user-facing changes?
No API changes.