Core, Data: Validate deletion-vector offset and length#16500
Core, Data: Validate deletion-vector offset and length#16500wombatu-kun wants to merge 3 commits into
Conversation
Hostile or corrupted manifest metadata can claim a negative or near-2 GiB `content_offset`/`content_size_in_bytes` on a deletion-vector `DeleteFile`. The existing precondition in `BaseDeleteLoader.validateDV` only checked non-null and `size <= Integer.MAX_VALUE`, so a negative size slipped through to `new byte[length]` and threw `NegativeArraySizeException`; a negative offset reached `IOUtil.readFully` and produced an invalid seek. The same unguarded allocation pattern lives in `DVUtil.readDV` on the rewrite/DV-merge path used by `mergeAndWriteDVsIfRequired`. Add `ContentFileUtil.validateDV` as the canonical read-path validator (null offset, null size, offset >= 0, size >= 0, size <= Integer.MAX_VALUE) and delegate from both `BaseDeleteLoader.validateDV` and `DVUtil.readDV`. `BaseDeleteLoader` keeps its loader-specific `referencedDataFile` check. For defense-in-depth, also reject negative `contentOffset`/`contentSizeInBytes` in `FileMetadata.Builder.build()` for `PUFFIN`, mirroring the existing `fileSizeInBytes >= 0` / `recordCount >= 0` checks. Producers using the builder can no longer accidentally emit bad metadata; the read-side check still defends against corrupted on-disk manifests where `BaseFile.internalSet` hydrates fields directly via Avro reflection. Tests cover the new validator end-to-end, the builder rejection, and an integration check that `BaseDeleteLoader.loadPositionDeletes` rejects a tampered DV before any I/O attempt. Closes apache#16475 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| Preconditions.checkArgument( | ||
| dv.contentSizeInBytes() >= 0, "Invalid DV, length must be non-negative: %s", dvDesc(dv)); | ||
| Preconditions.checkArgument( | ||
| dv.contentSizeInBytes() <= Integer.MAX_VALUE, |
There was a problem hiding this comment.
Doesn't this need to be strictly less than?
There was a problem hiding this comment.
FWIW, probably something to add in the AGENTS.md code, no emdash, no non-ascii unicode except for UTF-related experiments. separate PR.
There was a problem hiding this comment.
Opened #16653 to add the no-emdash / ASCII-only rule to AGENTS.md.
| ContentFileUtil.isDV(deleteFile), | ||
| "Cannot read, not a deletion vector: %s", | ||
| deleteFile.location()); | ||
| ContentFileUtil.validateDV(deleteFile); |
There was a problem hiding this comment.
Since this class already handles DeleteFile instances, I'd probably not modify ContentFileUtil and put the new method here. People aren't going to go looking for a DV validation method in ContentFileUtil.
There was a problem hiding this comment.
Done in 30256ab. DVUtil is now public so BaseDeleteLoader (data module) can call it.
| Preconditions.checkArgument( | ||
| contentOffset >= 0, "Content offset must be non-negative for DV: %s", contentOffset); | ||
| Preconditions.checkArgument( | ||
| contentSizeInBytes >= 0, |
There was a problem hiding this comment.
Why not check < Integer.MAX_VALUE here as well?
| // Wraps a well-formed DV but reports tampered (offset, size). Simulates a corrupted manifest, | ||
| // since the FileMetadata builder now rejects negative values at construction time. | ||
| private static DeleteFile tamperedDV(Long offset, Long size) { | ||
| return new Delegates.DelegatingDeleteFile(VALID_DV) { |
There was a problem hiding this comment.
I think this is too clever and it's odd that the same PR has multiple ways to inject a delete with an invalid offset or size. I think this should pick one strategy. I think it probably makes more sense to use a mock, but I don't have a strong opinion.
Also, it would be reasonable (and straightforward) to just use the DeleteFile constructor and bypass the builder rather than doing things like this. The actual class allows these values because we don't validate and fail while filling the objects with data.
| }; | ||
| } | ||
|
|
||
| // The validator must fire before any I/O attempt — if it does not, the loader will call this |
There was a problem hiding this comment.
Please remove AI artifacts, like —
There was a problem hiding this comment.
Done in 30256ab. Removed across the whole branch diff.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| * before they are consumed by a reader. Hostile or corrupted manifest metadata may otherwise | ||
| * trigger a {@link NegativeArraySizeException}, an invalid seek, or a multi-gigabyte allocation | ||
| * when the DV blob is read. | ||
| */ |
There was a problem hiding this comment.
nit, add IllegalArgumentException to the javadocs
| * when the DV blob is read. | ||
| */ | ||
| public static void validateDV(DeleteFile dv) { | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
there's enough repetition here to make factoring stuff out something to consider, though I'll leave that to others to consider.
Everything looks for non-null, positive, and sizes must be < max_int.
split it out and you've got less to test
There was a problem hiding this comment.
Left the five checks inline - each carries a distinct message (offset vs length, null vs negative vs 2GB) and they run through the public validateDV, so factoring wouldn't reduce the test surface. Happy to revisit if others prefer the split.
| .withReferencedDataFile("/tmp/data.parquet") | ||
| .withContentOffset(0L) | ||
| .withContentSizeInBytes(Integer.MAX_VALUE) | ||
| .build()) |
There was a problem hiding this comment.
given that all FileMetadata.Builder.Builder() setters are valid, you could factor these out to have a builder which creates a valid file, then each test overrides with the invalid value, e.g
assertThatThrownBy(() ->
deleteFile()
.withContentSizeInBytes(Integer.MAX_VALUE)
.build()))
...
It'd make it clearer what is being invalid in each test.
again, personal opinion and I'll yield to others
There was a problem hiding this comment.
Done 00d7783. Added a validDvBuilder() returning a valid builder; each test now overrides only the single invalid field.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
no stale |
Closes #16475
Hostile or corrupted manifest metadata can claim a negative or near-2 GiB
content_offset/content_size_in_byteson a deletion-vectorDeleteFile. The existing precondition inBaseDeleteLoader.validateDVonly checks non-null andsize <= Integer.MAX_VALUE, so a negative size slips through tonew byte[length]and throwsNegativeArraySizeException; a negative offset reachesIOUtil.readFullyand produces an invalid seek. The same unguarded allocation pattern lives inDVUtil.readDVon the rewrite / DV-merge path used bymergeAndWriteDVsIfRequired.Changes
DVUtil.validateDVas the canonical read-path validator (null offset, null size,offset >= 0,size >= 0,size < Integer.MAX_VALUE);DVUtilis now public so the data module can call it.BaseDeleteLoader.validateDV(data) andDVUtil.readDV(core) delegate to it.BaseDeleteLoaderkeeps its loader-specificreferencedDataFilecheck.contentOffset, negativecontentSizeInBytes, andcontentSizeInBytes >= Integer.MAX_VALUEinFileMetadata.Builder.build()forPUFFIN, mirroring the existingfileSizeInBytes >= 0/recordCount >= 0checks at the same site. Producers using the builder can no longer accidentally emit bad metadata; the read-side check still defends against corrupted on-disk manifests whereBaseFile.internalSethydrates fields directly via Avro reflection.Tests
TestDVUtil(new) covers each precondition of the new validator: null offset/size, negative offset/size, the strict< Integer.MAX_VALUEboundary (rejected atInteger.MAX_VALUE, accepted atInteger.MAX_VALUE - 1), and valid zero / typical values.TestFileMetadata(new) verifies the builder rejects negative DV offset, negative DV size,contentSizeInBytes >= Integer.MAX_VALUE, and still accepts valid values.TestBaseDeleteLoader(new, in packageorg.apache.iceberg.data) uses a Mockito mock that bypasses theFileMetadatabuilder (which now rejects invalid values at construction), and assertsBaseDeleteLoader.loadPositionDeletesthrows before any I/O is attempted.TestRowDelta(324 tests),TestDeletionVectorStruct,TestContentFileParser,TestDeleteFilesregression suites continue to pass.Out of scope
PuffinReader/BlobMetadatasimilarly don't range-check blob offsets and lengths today, but Puffin is also used for non-DV blobs (statistics) and "valid" at that layer is a separate discussion of footer-authoritative vs. caller-supplied bounds. Leaving Puffin-level hardening as a follow-up.