Filter NewSubfileType / SubIFDs from extra_tags leakage (#1657)#1660
Merged
Conversation
Reading a TIFF overview level or any TIFF carrying tag 254 (NewSubfileType) or 330 (SubIFDs) previously leaked those tags into the returned DataArray's attrs['extra_tags'] because neither was in _MANAGED_TAGS. A subsequent to_geotiff or write_geotiff_gpu call then re-emitted the leaked tags on the output IFD, producing two distinct corruption modes: - NewSubfileType=1 written onto the primary IFD, so GDAL / rasterio filter the output away when looking for the primary image. - SubIFDs with stale absolute byte offsets pointing into the new file's pixel data, crashing readers that follow the chain. All four backends (numpy, dask, cupy, dask+cupy) leaked identically since they share _populate_attrs_from_geo_info. Fix: add TAG_NEW_SUBFILE_TYPE and TAG_SUB_IFDS to _MANAGED_TAGS so neither is collected into GeoInfo.extra_tags on read. Add the same two IDs to a new _DANGEROUS_EXTRA_TAG_IDS frozenset in _writer.py and skip them in both extra_tags emission loops (_assemble_tiff and write_streaming) as a belt-and-braces guard against caller-supplied dangerous attrs['extra_tags']. Add 11 regression tests covering: read overview on all four backends, SubIFDs leak from a multi-page source, round-trip overview->write on three backends, and writer-side filtering of caller-supplied dangerous extra_tags (with a benign Software (305) canary to confirm the filter is not too aggressive).
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes GeoTIFF metadata corruption caused by NewSubfileType (254) and SubIFDs (330) leaking through attrs['extra_tags'] on read and being re-emitted on write, which could mis-mark the primary IFD as an overview and/or write stale absolute SubIFD offsets.
Changes:
- Filter
NewSubfileTypeandSubIFDsout of read-sideextra_tagsby adding them to_MANAGED_TAGS. - Add a writer-side “belt-and-braces” denylist (
_DANGEROUS_EXTRA_TAG_IDS) to prevent caller-suppliedextra_tagsfrom emitting these IDs. - Add regression tests covering overview reads across backends, SubIFDs leak prevention, and writer-side filtering behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
xrspatial/geotiff/tests/test_extra_tags_safe_filter_1657.py |
New regression coverage to ensure tags 254/330 don’t leak into attrs['extra_tags'] and aren’t re-emitted on write. |
xrspatial/geotiff/_writer.py |
Filters dangerous tag IDs from extra_tags emission in _assemble_tiff and write_streaming. |
xrspatial/geotiff/_header.py |
Adds TAG_SUB_IFDS = 330 constant for shared tag ID usage. |
xrspatial/geotiff/_geotags.py |
Treats TAG_NEW_SUBFILE_TYPE and TAG_SUB_IFDS as managed so they don’t flow into GeoInfo.extra_tags. |
.claude/sweep-metadata-state.csv |
Updates audit state record to reflect issue #1657 as inspected/fixed. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
839
to
+848
| # Extra tags (pass-through from source file) | ||
| if extra_tags is not None: | ||
| for etag_id, etype_id, ecount, evalue in extra_tags: | ||
| # Skip any tag we already wrote to avoid duplicates | ||
| # Skip any tag we already wrote to avoid duplicates, | ||
| # and skip dangerous tags (NewSubfileType, SubIFDs) | ||
| # that would mis-mark the IFD or carry stale offsets. | ||
| # See issue #1657. | ||
| existing_ids = {t[0] for t in tags} | ||
| if etag_id not in existing_ids: | ||
| if (etag_id not in existing_ids | ||
| and etag_id not in _DANGEROUS_EXTRA_TAG_IDS): |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1657. Reading a TIFF overview level or any TIFF carrying tag 254 (NewSubfileType) or 330 (SubIFDs) used to leak those tags into the returned DataArray's
attrs['extra_tags']because neither was in_MANAGED_TAGS. A subsequentto_geotifforwrite_geotiff_gpucall then re-emitted them onto the output IFD, producing two distinct corruption modes:NewSubfileType=1written onto the primary IFD, so GDAL / rasterio filter the output away when looking for the primary image.All four backends (numpy, dask, cupy, dask+cupy) leaked identically since they share
_populate_attrs_from_geo_info.Fix
TAG_NEW_SUBFILE_TYPEandTAG_SUB_IFDSto_MANAGED_TAGSin_geotags.pyso neither is collected intoGeoInfo.extra_tagson read._DANGEROUS_EXTRA_TAG_IDSfrozenset in_writer.pyand skip the same two IDs in bothextra_tagsemission loops (_assemble_tiffandwrite_streaming). This is a belt-and-braces guard against caller-supplied dangerousattrs['extra_tags'].Test plan
xrspatial/geotiff/tests/suite still passes (1477 passed, 7 skipped, no regressions)to_geotiffproduces a primary IFD with no NewSubfileType and no SubIFDsContext
Caught by the
/sweep-metadataaudit (Cat 1 attrs + Cat 5 cross-backend consistency). The leak was identical on all four backends -- numpy, dask, cupy, dask+cupy -- because theextra_tagsfield travels through the shared_populate_attrs_from_geo_infohelper.