Skip to content

Core, Data: Validate deletion-vector offset and length#16500

Open
wombatu-kun wants to merge 3 commits into
apache:mainfrom
wombatu-kun:issue/16475-validate-dv-offset-length
Open

Core, Data: Validate deletion-vector offset and length#16500
wombatu-kun wants to merge 3 commits into
apache:mainfrom
wombatu-kun:issue/16475-validate-dv-offset-length

Conversation

@wombatu-kun

@wombatu-kun wombatu-kun commented May 21, 2026

Copy link
Copy Markdown
Contributor

Closes #16475

Hostile or corrupted manifest metadata can claim a negative or near-2 GiB content_offset / content_size_in_bytes on a deletion-vector DeleteFile. The existing precondition in BaseDeleteLoader.validateDV only checks non-null and size <= Integer.MAX_VALUE, so a negative size slips through to new byte[length] and throws NegativeArraySizeException; a negative offset reaches IOUtil.readFully and produces an invalid seek. The same unguarded allocation pattern lives in DVUtil.readDV on the rewrite / DV-merge path used by mergeAndWriteDVsIfRequired.

Changes

  • Add DVUtil.validateDV as the canonical read-path validator (null offset, null size, offset >= 0, size >= 0, size < Integer.MAX_VALUE); DVUtil is now public so the data module can call it.
  • Have both BaseDeleteLoader.validateDV (data) and DVUtil.readDV (core) delegate to it. BaseDeleteLoader keeps its loader-specific referencedDataFile check.
  • For defense-in-depth, also reject negative contentOffset, negative contentSizeInBytes, and contentSizeInBytes >= Integer.MAX_VALUE in FileMetadata.Builder.build() for PUFFIN, mirroring the existing fileSizeInBytes >= 0 / recordCount >= 0 checks at the same site. Producers using the builder can no longer accidentally emit bad metadata; the read-side check still defends against corrupted on-disk manifests where BaseFile.internalSet hydrates fields directly via Avro reflection.

Tests

  • TestDVUtil (new) covers each precondition of the new validator: null offset/size, negative offset/size, the strict < Integer.MAX_VALUE boundary (rejected at Integer.MAX_VALUE, accepted at Integer.MAX_VALUE - 1), and valid zero / typical values.
  • TestFileMetadata (new) verifies the builder rejects negative DV offset, negative DV size, contentSizeInBytes >= Integer.MAX_VALUE, and still accepts valid values.
  • TestBaseDeleteLoader (new, in package org.apache.iceberg.data) uses a Mockito mock that bypasses the FileMetadata builder (which now rejects invalid values at construction), and asserts BaseDeleteLoader.loadPositionDeletes throws before any I/O is attempted.
  • Existing TestRowDelta (324 tests), TestDeletionVectorStruct, TestContentFileParser, TestDeleteFiles regression suites continue to pass.

Out of scope

PuffinReader / BlobMetadata similarly don't range-check blob offsets and lengths today, but Puffin is also used for non-DV blobs (statistics) and "valid" at that layer is a separate discussion of footer-authoritative vs. caller-supplied bounds. Leaving Puffin-level hardening as a follow-up.

Hostile or corrupted manifest metadata can claim a negative or near-2 GiB `content_offset`/`content_size_in_bytes` on a deletion-vector `DeleteFile`. The existing precondition in `BaseDeleteLoader.validateDV` only checked non-null and `size <= Integer.MAX_VALUE`, so a negative size slipped through to `new byte[length]` and threw `NegativeArraySizeException`; a negative offset reached `IOUtil.readFully` and produced an invalid seek. The same unguarded allocation pattern lives in `DVUtil.readDV` on the rewrite/DV-merge path used by `mergeAndWriteDVsIfRequired`.

Add `ContentFileUtil.validateDV` as the canonical read-path validator (null offset, null size, offset >= 0, size >= 0, size <= Integer.MAX_VALUE) and delegate from both `BaseDeleteLoader.validateDV` and `DVUtil.readDV`. `BaseDeleteLoader` keeps its loader-specific `referencedDataFile` check.

For defense-in-depth, also reject negative `contentOffset`/`contentSizeInBytes` in `FileMetadata.Builder.build()` for `PUFFIN`, mirroring the existing `fileSizeInBytes >= 0` / `recordCount >= 0` checks. Producers using the builder can no longer accidentally emit bad metadata; the read-side check still defends against corrupted on-disk manifests where `BaseFile.internalSet` hydrates fields directly via Avro reflection.

Tests cover the new validator end-to-end, the builder rejection, and an integration check that `BaseDeleteLoader.loadPositionDeletes` rejects a tampered DV before any I/O attempt.

Closes apache#16475

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Preconditions.checkArgument(
dv.contentSizeInBytes() >= 0, "Invalid DV, length must be non-negative: %s", dvDesc(dv));
Preconditions.checkArgument(
dv.contentSizeInBytes() <= Integer.MAX_VALUE,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this need to be strictly less than?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 30256ab.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, probably something to add in the AGENTS.md code, no emdash, no non-ascii unicode except for UTF-related experiments. separate PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #16653 to add the no-emdash / ASCII-only rule to AGENTS.md.

ContentFileUtil.isDV(deleteFile),
"Cannot read, not a deletion vector: %s",
deleteFile.location());
ContentFileUtil.validateDV(deleteFile);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this class already handles DeleteFile instances, I'd probably not modify ContentFileUtil and put the new method here. People aren't going to go looking for a DV validation method in ContentFileUtil.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 30256ab. DVUtil is now public so BaseDeleteLoader (data module) can call it.

Preconditions.checkArgument(
contentOffset >= 0, "Content offset must be non-negative for DV: %s", contentOffset);
Preconditions.checkArgument(
contentSizeInBytes >= 0,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check < Integer.MAX_VALUE here as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 30256ab.

// Wraps a well-formed DV but reports tampered (offset, size). Simulates a corrupted manifest,
// since the FileMetadata builder now rejects negative values at construction time.
private static DeleteFile tamperedDV(Long offset, Long size) {
return new Delegates.DelegatingDeleteFile(VALID_DV) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too clever and it's odd that the same PR has multiple ways to inject a delete with an invalid offset or size. I think this should pick one strategy. I think it probably makes more sense to use a mock, but I don't have a strong opinion.

Also, it would be reasonable (and straightforward) to just use the DeleteFile constructor and bypass the builder rather than doing things like this. The actual class allows these values because we don't validate and fail while filling the objects with data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 30256ab (Mockito mock).

};
}

// The validator must fire before any I/O attempt — if it does not, the loader will call this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove AI artifacts, like

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 30256ab. Removed across the whole branch diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wombatu-kun wombatu-kun requested a review from rdblue May 28, 2026 02:48
* before they are consumed by a reader. Hostile or corrupted manifest metadata may otherwise
* trigger a {@link NegativeArraySizeException}, an invalid seek, or a multi-gigabyte allocation
* when the DV blob is read.
*/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, add IllegalArgumentException to the javadocs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 00d7783.

* when the DV blob is read.
*/
public static void validateDV(DeleteFile dv) {
Preconditions.checkArgument(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's enough repetition here to make factoring stuff out something to consider, though I'll leave that to others to consider.
Everything looks for non-null, positive, and sizes must be < max_int.

split it out and you've got less to test

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left the five checks inline - each carries a distinct message (offset vs length, null vs negative vs 2GB) and they run through the public validateDV, so factoring wouldn't reduce the test surface. Happy to revisit if others prefer the split.

.withReferencedDataFile("/tmp/data.parquet")
.withContentOffset(0L)
.withContentSizeInBytes(Integer.MAX_VALUE)
.build())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that all FileMetadata.Builder.Builder() setters are valid, you could factor these out to have a builder which creates a valid file, then each test overrides with the invalid value, e.g

   assertThatThrownBy(() -> 
      deleteFile()
       .withContentSizeInBytes(Integer.MAX_VALUE)
       .build()))
   ...

It'd make it clearer what is being invalid in each test.

again, personal opinion and I'll yield to others

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 00d7783. Added a validDvBuilder() returning a valid builder; each test now overrides only the single invalid field.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Jul 3, 2026
@wombatu-kun

Copy link
Copy Markdown
Contributor Author

no stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deletion-vector length and offset are not validated enough to stop crash or huge allocation attacks

3 participants