Skip to content

Filter NewSubfileType / SubIFDs from extra_tags leakage (#1657)#1660

Merged
brendancol merged 2 commits into
mainfrom
deep-sweep-metadata-geotiff-2026-05-12
May 12, 2026
Merged

Filter NewSubfileType / SubIFDs from extra_tags leakage (#1657)#1660
brendancol merged 2 commits into
mainfrom
deep-sweep-metadata-geotiff-2026-05-12

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Fixes #1657. Reading a TIFF overview level or any TIFF carrying tag 254 (NewSubfileType) or 330 (SubIFDs) used to leak those tags into the returned DataArray's attrs['extra_tags'] because neither was in _MANAGED_TAGS. A subsequent to_geotiff or write_geotiff_gpu call then re-emitted them onto the output IFD, producing two distinct corruption modes:

  • NewSubfileType=1 written onto the primary IFD, so GDAL / rasterio filter the output away when looking for the primary image.
  • SubIFDs with stale absolute byte offsets pointing into the new file's pixel data, crashing readers that follow the chain.

All four backends (numpy, dask, cupy, dask+cupy) leaked identically since they share _populate_attrs_from_geo_info.

Fix

  • Add TAG_NEW_SUBFILE_TYPE and TAG_SUB_IFDS to _MANAGED_TAGS in _geotags.py so neither is collected into GeoInfo.extra_tags on read.
  • Add a new _DANGEROUS_EXTRA_TAG_IDS frozenset in _writer.py and skip the same two IDs in both extra_tags emission loops (_assemble_tiff and write_streaming). This is a belt-and-braces guard against caller-supplied dangerous attrs['extra_tags'].
  • 11 regression tests covering read overview on all four backends, SubIFDs leak from a multi-page source, round-trip overview -> write on three backends, and writer-side filtering of caller-supplied dangerous extra_tags (with a benign Software (305) canary to confirm the filter is not too aggressive).

Test plan

  • All 11 new regression tests pass on numpy / dask / cupy / dask+cupy
  • Full xrspatial/geotiff/tests/ suite still passes (1477 passed, 7 skipped, no regressions)
  • End-to-end probe: read overview level 1 then to_geotiff produces a primary IFD with no NewSubfileType and no SubIFDs

Context

Caught by the /sweep-metadata audit (Cat 1 attrs + Cat 5 cross-backend consistency). The leak was identical on all four backends -- numpy, dask, cupy, dask+cupy -- because the extra_tags field travels through the shared _populate_attrs_from_geo_info helper.

Reading a TIFF overview level or any TIFF carrying tag 254
(NewSubfileType) or 330 (SubIFDs) previously leaked those tags into
the returned DataArray's attrs['extra_tags'] because neither was in
_MANAGED_TAGS. A subsequent to_geotiff or write_geotiff_gpu call then
re-emitted the leaked tags on the output IFD, producing two distinct
corruption modes:

- NewSubfileType=1 written onto the primary IFD, so GDAL / rasterio
  filter the output away when looking for the primary image.
- SubIFDs with stale absolute byte offsets pointing into the new
  file's pixel data, crashing readers that follow the chain.

All four backends (numpy, dask, cupy, dask+cupy) leaked identically
since they share _populate_attrs_from_geo_info.

Fix: add TAG_NEW_SUBFILE_TYPE and TAG_SUB_IFDS to _MANAGED_TAGS so
neither is collected into GeoInfo.extra_tags on read. Add the same
two IDs to a new _DANGEROUS_EXTRA_TAG_IDS frozenset in _writer.py and
skip them in both extra_tags emission loops (_assemble_tiff and
write_streaming) as a belt-and-braces guard against caller-supplied
dangerous attrs['extra_tags'].

Add 11 regression tests covering: read overview on all four backends,
SubIFDs leak from a multi-page source, round-trip overview->write
on three backends, and writer-side filtering of caller-supplied
dangerous extra_tags (with a benign Software (305) canary to confirm
the filter is not too aggressive).
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 12, 2026
@brendancol brendancol requested a review from Copilot May 12, 2026 12:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes GeoTIFF metadata corruption caused by NewSubfileType (254) and SubIFDs (330) leaking through attrs['extra_tags'] on read and being re-emitted on write, which could mis-mark the primary IFD as an overview and/or write stale absolute SubIFD offsets.

Changes:

  • Filter NewSubfileType and SubIFDs out of read-side extra_tags by adding them to _MANAGED_TAGS.
  • Add a writer-side “belt-and-braces” denylist (_DANGEROUS_EXTRA_TAG_IDS) to prevent caller-supplied extra_tags from emitting these IDs.
  • Add regression tests covering overview reads across backends, SubIFDs leak prevention, and writer-side filtering behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
xrspatial/geotiff/tests/test_extra_tags_safe_filter_1657.py New regression coverage to ensure tags 254/330 don’t leak into attrs['extra_tags'] and aren’t re-emitted on write.
xrspatial/geotiff/_writer.py Filters dangerous tag IDs from extra_tags emission in _assemble_tiff and write_streaming.
xrspatial/geotiff/_header.py Adds TAG_SUB_IFDS = 330 constant for shared tag ID usage.
xrspatial/geotiff/_geotags.py Treats TAG_NEW_SUBFILE_TYPE and TAG_SUB_IFDS as managed so they don’t flow into GeoInfo.extra_tags.
.claude/sweep-metadata-state.csv Updates audit state record to reflect issue #1657 as inspected/fixed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 839 to +848
# Extra tags (pass-through from source file)
if extra_tags is not None:
for etag_id, etype_id, ecount, evalue in extra_tags:
# Skip any tag we already wrote to avoid duplicates
# Skip any tag we already wrote to avoid duplicates,
# and skip dangerous tags (NewSubfileType, SubIFDs)
# that would mis-mark the IFD or carry stale offsets.
# See issue #1657.
existing_ids = {t[0] for t in tags}
if etag_id not in existing_ids:
if (etag_id not in existing_ids
and etag_id not in _DANGEROUS_EXTRA_TAG_IDS):
@brendancol brendancol merged commit 932e9c5 into main May 12, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: NewSubfileType and SubIFDs tags leak via extra_tags, corrupt write output

2 participants