geotiff: add attrs['masked_nodata'] alongside attrs['nodata'] (#1988)#2008
Merged
Conversation
Contributor
Author
Self-reviewBlockersNone. Suggestions addressed in follow-up commit f7e6cb6
Nits (left as-is)
What looks good
Final test counts
|
brendancol
added a commit
that referenced
this pull request
May 17, 2026
Second review pass on PR #2008 surfaced two items: * The chunked VRT call site (``_backends/vrt.py:_read_vrt_chunked``) pointed at ``declared_dtype`` "below" but the dtype block sits earlier in the function and ``final_dtype`` is the value actually threaded into ``_set_nodata_attrs``. Rephrase the inline comment so the cross-reference matches the control flow. * ``TestDaskNumpy`` had no out-of-range integer-sentinel case parallel to the existing eager-path coverage. Add one so the dask backend's ``effective_dtype = file_dtype`` fallback branch (out-of-range sentinel cannot fit the source dtype, masking skipped, graph stays integer) is asserted to land with ``masked_nodata=False``. No production-code behaviour change. The new test exercises an already-handled code path; the comment edit is non-functional.
Add ``attrs['masked_nodata']`` alongside ``attrs['nodata']`` on every read path. ``nodata`` carries the declared file sentinel; the new boolean ``masked_nodata`` records whether the in-memory array has been NaN-masked. The two were previously fused into the presence of ``attrs['nodata']``, which left downstream code unable to tell a float-with-NaN array from an int-with-sentinel array. Read paths (numpy, dask, gpu, dask+gpu, vrt eager, vrt chunked) now delegate to a new ``_set_nodata_attrs`` helper in ``_attrs.py``. The helper writes ``masked_nodata`` based on the final array dtype: float dtype means the sentinel-to-NaN promotion ran (or the array is already NaN-aware); integer dtype means the literal sentinel is still in-band. The writer side keeps reading ``attrs['nodata']`` through the existing ``_resolve_nodata_attr`` helper, so this change is purely additive: no write-path code branches on ``masked_nodata`` yet, no existing nodata test changes its expectation. The new ``test_nodata_semantics_split_1988.py`` covers the matrix of source-has-sentinel x backend x masked-or-not.
Address review feedback: the new test file did not exercise the VRT or GPU read paths even though the PR modified five sites across ``_backends/vrt.py`` and ``_backends/gpu.py``. Add direct coverage: * ``TestVRTEager`` -- float-typed VRT over int source with a hit, int-typed VRT with no hit, and a VRT band missing ``<NoDataValue>``. * ``TestVRTChunked`` -- declared-float64 graph for in-range int sentinel. * ``TestGPU`` (guarded by ``_gpu_only``) -- ``read_geotiff_gpu`` on int source with hit, int source no-hit (stays int), and the dask + GPU path with in-range sentinel. Drop the unused ``import xarray as xr`` while in this file.
Second review pass on PR #2008 surfaced two items: * The chunked VRT call site (``_backends/vrt.py:_read_vrt_chunked``) pointed at ``declared_dtype`` "below" but the dtype block sits earlier in the function and ``final_dtype`` is the value actually threaded into ``_set_nodata_attrs``. Rephrase the inline comment so the cross-reference matches the control flow. * ``TestDaskNumpy`` had no out-of-range integer-sentinel case parallel to the existing eager-path coverage. Add one so the dask backend's ``effective_dtype = file_dtype`` fallback branch (out-of-range sentinel cannot fit the source dtype, masking skipped, graph stays integer) is asserted to land with ``masked_nodata=False``. No production-code behaviour change. The new test exercises an already-handled code path; the comment edit is non-functional.
The CI environment doesn't ship ``rasterio``, so a top-level ``import rasterio`` aborts the entire pytest collection step. Guard the import with ``pytest.importorskip`` (matches the pattern already used in ``test_attrs_contract_version_1984.py``) so the module skips cleanly when rasterio isn't installed and runs in full when it is.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the foundational chunk of #1988: split the declared file sentinel from the masked-state signal in DataArray attrs on every geotiff read path.
attrs['nodata']-- declared file sentinel as a scalar of the source dtype. Set whenever the source declared one. Independent of whether masking ran.attrs['masked_nodata']-- new boolean.Trueiff the reader replaced sentinel pixels with NaN in the in-memory array (final dtype is float).Falseiff the array still carries the literal integer sentinel.The two signals were previously fused:
attrs['nodata']doubled as "the file declared this sentinel" and "the reader already replaced sentinel pixels with NaN." Downstream code branching on the second meaning had to use the first as a proxy and lost the ability to tell a float-NaN array from an int-with-sentinel array.Changes
_set_nodata_attrs(attrs, nodata, *, array_dtype)in_attrs.py. Sets both keys atomically from the final array dtype.open_geotiffeager numpy (__init__.py)read_geotiff_dask(_backends/dask.py) -- uses the declared graph dtyperead_geotiff_gpu-- three sites (eager tiled, eager stripped, chunked dask+gpu) in_backends/gpu.pyread_vrteager + chunked in_backends/vrt.pytest_nodata_semantics_split_1988.py: source-has-sentinel x backend x {float-NaN-masked, int-sentinel-preserved}, plus direct unit coverage of_set_nodata_attrs.What this does not change
This PR is additive. The writer still reads
attrs['nodata']through_resolve_nodata_attrand ignoresmasked_nodata; no existing nodata test changes its expectation. Migrating writers and downstream xrspatial code to branch onmasked_nodatais the follow-up scope on #1988.Test plan
pytest xrspatial/geotiff/tests/test_nodata_semantics_split_1988.py(12 passed)test_nodata_nan_int_1774,test_nodata_int64_precision_1847,test_nodata_attr_aliases_1582,test_miniswhite_nodata_1809,test_vrt_band_nodata_1598,test_overview_nodata_inheritance_1739,test_vrt_int_source_float_dtype_1616,test_vrt_multiband_int_nodata_1611(88 passed)xrspatial/geotiff/suite (excluding the pre-existing fuzz failure ontest_fuzz_hypothesis_1661.py, which reproduces onmain): 952 passed