Skip to content

Parquet: Fix variant BINARY upper bound to truncate up#16880

Open
vishnuprakaz wants to merge 2 commits into
apache:mainfrom
vishnuprakaz:parquet-variant-binary-upper-bound
Open

Parquet: Fix variant BINARY upper bound to truncate up#16880
vishnuprakaz wants to merge 2 commits into
apache:mainfrom
vishnuprakaz:parquet-variant-binary-upper-bound

Conversation

@vishnuprakaz

Copy link
Copy Markdown

Rationale for this change

For each column, Iceberg records the smallest and largest value in a file so queries can
skip files that can't contain a match. Long values are shortened to 16 bytes to save space
but when shortening the largest value, it has to round up, so the stored "max" stays
at least as big as the real maximum. (If it rounded down, the stored max could be smaller
than a real value, which breaks file skipping.)

For BINARY values inside a Variant column, this rounds the wrong way: the upper bound is
shortened with truncateBinaryMin (rounds down) instead of truncateBinaryMax (rounds up).
So the stored max can come out smaller than a value actually in the file, and a query like
binaryColumn > X may decide the file can't match and skip it silently returning wrong
(missing) results. The lower bound is correct.

This also disagrees with the code around it: the STRING branch uses truncateStringMax,
the lower-bound method uses truncateBinaryMin, and the non-variant
ParquetMetrics.truncateUpperBound uses truncateBinaryMax. Introduced in #12496.

What changes are included in this PR?

Use truncateBinaryMax for the BINARY upper bound. One-line change.

Are these changes tested?

Yes. TestVariantMetrics.testShreddedBinaryBoundsTruncation writes a binary Variant value
longer than 16 bytes and checks the upper bound rounds up. Fails before the fix (…0F10,
smaller than the value), passes after. Existing tests only used a 4-byte value (under the
16-byte threshold), so the path was untested.

Are there any user-facing changes?

No API changes.

case BINARY:
ByteBuffer truncatedBuffer =
BinaryUtil.truncateBinaryMin((ByteBuffer) value.asPrimitive().get(), 16);
BinaryUtil.truncateBinaryMax((ByteBuffer) value.asPrimitive().get(), 16);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that BINARY uses truncateBinaryMax, the all-0xFF overflow case (returns null) is reachable now. Worth a case with an all-0xFF value longer than 16 bytes asserting the upper bound is absent?

@vishnuprakaz

Copy link
Copy Markdown
Author

Thanks for the review @nssalian good catch, added a test for all-0xFF overflow case asserting the upper bound is absent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants