geotiff: grow HTTP COG header prefetch past 64 KiB#1727
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the GeoTIFF HTTP COG reader by making _parse_cog_http_meta dynamically grow its initial header/metadata prefetch beyond the previous 64 KiB limit, so COGs with deeper IFD chains or large out-of-line tag payloads can be parsed correctly over HTTP range requests.
Changes:
- Replace the fixed 16 KiB + one-time 64 KiB retry logic with a doubling “grow loop” capped by
MAX_HTTP_HEADER_BYTES. - Add helper logic to decide whether the parsed IFD chain is fully resolved (vs truncated by the current buffer).
- Add a new test suite covering the fast path, grow path, end-to-end HTTP reads, and cap/exception behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
xrspatial/geotiff/_reader.py |
Adds grow-loop prefetch logic (with initial/cap constants and an extent helper) for HTTP COG metadata parsing. |
xrspatial/geotiff/tests/test_http_meta_buffer_1718.py |
Introduces tests that reproduce the >64 KiB metadata/IFD-chain scenario and validate the new grow behavior and cap error. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1596
to
+1611
| while True: | ||
| try: | ||
| ifds = parse_all_ifds(header_bytes, header) | ||
| required = _ifd_required_extent(ifds, header, len(header_bytes)) | ||
| # Chain is fully resolved when every IFD parsed cleanly and | ||
| # the tail next_ifd_offset is reachable within the buffer | ||
| # (required == 0 means end-of-chain). | ||
| if ifds and required <= len(header_bytes): | ||
| break | ||
| except ValueError: | ||
| # parse_ifd raises when an out-of-line tag points past the | ||
| # buffer. Treat it the same as a truncated chain: grow and | ||
| # retry. If we are already at the cap and still failing, let | ||
| # the next iteration's cap check raise a clear error. | ||
| ifds = [] | ||
|
|
Comment on lines
+1626
to
+1627
| except ValueError: | ||
| ifds = [] |
Comment on lines
+1536
to
+1548
| """Return the highest byte offset the parsed IFDs reference. | ||
|
|
||
| Used to decide whether the prefetch buffer is large enough to hold the | ||
| entire IFD chain plus every out-of-line tag value. We compare this | ||
| against ``len(data)`` in :func:`_parse_cog_http_meta`; if it exceeds the | ||
| buffer, the chain is truncated and the caller must grow and retry. | ||
|
|
||
| The walk re-derives each tag's value-area placement directly from the | ||
| IFD layout (entry table base + entry slot) rather than re-parsing the | ||
| raw bytes. For out-of-line tags ``parse_ifd`` already resolved the | ||
| pointer and validated ``ptr + size <= data_len``; the *interesting* | ||
| extent for the grow loop is the next-IFD pointer of the chain tail, | ||
| plus an "is there a next IFD we have not yet seen" probe. |
Comment on lines
+216
to
+223
| import tempfile | ||
| with tempfile.NamedTemporaryFile(suffix='_cap_1718.tif', delete=False) as f: | ||
| path = f.name | ||
| write(arr, path, compression='deflate', tiled=True, tile_size=16, | ||
| cog=True, overview_levels=[1]) | ||
| with open(path, 'rb') as f: | ||
| payload = bytearray(f.read()) | ||
|
|
Comment on lines
+229
to
+236
| # far-off value that no buffer growth will ever satisfy. | ||
| bo = header.byte_order | ||
| first_ifd_off = header.first_ifd_offset | ||
| import struct as _struct | ||
| num_entries = _struct.unpack_from(f'{bo}H', payload, first_ifd_off)[0] | ||
| next_off_pos = first_ifd_off + 2 + num_entries * 12 | ||
| far = 10**12 # 1 TB, well past any cap | ||
| _struct.pack_into(f'{bo}I', payload, next_off_pos, far & 0xFFFFFFFF) |
_parse_cog_http_meta used to fetch 16 KiB and retry once with 64 KiB. COGs whose IFD chain or out-of-line tag arrays (TileOffsets, GeoAsciiParams, GDAL_METADATA) sat past 64 KiB silently lost IFDs or raised from parse_ifd's bounds checks. Replace the two-shot with a grow loop that doubles the buffer until the IFD chain resolves, capped at MAX_HTTP_HEADER_BYTES (4 MiB). Fast path is unchanged. Closes #1718.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_parse_cog_http_metaused to fetch 16 KiB, parse IFDs, and retry once with 64 KiB if nothing parsed. Files whose IFD chain or out-of-line tag arrays (TileOffsets, GeoAsciiParams, GDAL_METADATA) sat past 64 KiB silently lost IFDs or raisedValueErrorfromparse_ifd's bounds checks. Local-file reads of the same files worked because they aren't bounded by the prefetch.Fix
Replace the 16 KiB / 64 KiB two-shot with a grow loop:
INITIAL_HTTP_HEADER_BYTES = 16 KiB. Fast path is unchanged.parse_all_ifdsraised because an out-of-line tag pointer landed past the buffer, treat it as truncation.next_ifd_offset;parse_ifdalready validates tag value pointers againstlen(data)before returning).MAX_HTTP_HEADER_BYTES = 4 MiBand raise a clearValueErrorwhen exceeded.Repro
A COG written with a large
gdal_metadata_xmland several overview levels places IFD #2 past 65536 bytes. Before this PR,_read_cog_httpeither lost the overview IFDs or raised. After, it grows the prefetch and reads correctly.Tests
xrspatial/geotiff/tests/test_http_meta_buffer_1718.py:test_small_cog_uses_single_initial_read: small COG fires exactly one 16 KiB read.test_ifd_chain_past_64kib_resolves: direct_parse_cog_http_metaagainst an in-memory source returns the full IFD list when the chain extends past 64 KiB.test_end_to_end_http_read_with_big_metadata: full_read_cog_httpround-trip via a localhttp.server, plus an overview-level read.test_cap_raises_clear_error_on_excessive_chain: patched cap + crafted next-IFD pointer triggers the clearValueError.Test plan
pytest xrspatial/geotiff/tests/test_http_meta_buffer_1718.py -xvs(4 passed)pytest xrspatial/geotiff/tests/test_cog_http_concurrent.py -x(6 passed)xrspatial/geotiff/tests/suite (1819 passed; 4 pre-existing failures unrelated to this change)Closes #1718.