Skip to content

geotiff: add attrs['masked_nodata'] alongside attrs['nodata'] (#1988)#2008

Merged
brendancol merged 4 commits into
mainfrom
issue-1988
May 18, 2026
Merged

geotiff: add attrs['masked_nodata'] alongside attrs['nodata'] (#1988)#2008
brendancol merged 4 commits into
mainfrom
issue-1988

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Closes the foundational chunk of #1988: split the declared file sentinel from the masked-state signal in DataArray attrs on every geotiff read path.

  • attrs['nodata'] -- declared file sentinel as a scalar of the source dtype. Set whenever the source declared one. Independent of whether masking ran.
  • attrs['masked_nodata'] -- new boolean. True iff the reader replaced sentinel pixels with NaN in the in-memory array (final dtype is float). False iff the array still carries the literal integer sentinel.

The two signals were previously fused: attrs['nodata'] doubled as "the file declared this sentinel" and "the reader already replaced sentinel pixels with NaN." Downstream code branching on the second meaning had to use the first as a proxy and lost the ability to tell a float-NaN array from an int-with-sentinel array.

Changes

  • New _set_nodata_attrs(attrs, nodata, *, array_dtype) in _attrs.py. Sets both keys atomically from the final array dtype.
  • Every read backend calls it from one site, after any masking and any user-requested dtype cast:
    • open_geotiff eager numpy (__init__.py)
    • read_geotiff_dask (_backends/dask.py) -- uses the declared graph dtype
    • read_geotiff_gpu -- three sites (eager tiled, eager stripped, chunked dask+gpu) in _backends/gpu.py
    • read_vrt eager + chunked in _backends/vrt.py
  • Test matrix in test_nodata_semantics_split_1988.py: source-has-sentinel x backend x {float-NaN-masked, int-sentinel-preserved}, plus direct unit coverage of _set_nodata_attrs.

What this does not change

This PR is additive. The writer still reads attrs['nodata'] through _resolve_nodata_attr and ignores masked_nodata; no existing nodata test changes its expectation. Migrating writers and downstream xrspatial code to branch on masked_nodata is the follow-up scope on #1988.

Test plan

  • pytest xrspatial/geotiff/tests/test_nodata_semantics_split_1988.py (12 passed)
  • All existing nodata regression suites pass: test_nodata_nan_int_1774, test_nodata_int64_precision_1847, test_nodata_attr_aliases_1582, test_miniswhite_nodata_1809, test_vrt_band_nodata_1598, test_overview_nodata_inheritance_1739, test_vrt_int_source_float_dtype_1616, test_vrt_multiband_int_nodata_1611 (88 passed)
  • Full xrspatial/geotiff/ suite (excluding the pre-existing fuzz failure on test_fuzz_hypothesis_1661.py, which reproduces on main): 952 passed

@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 17, 2026
@brendancol
Copy link
Copy Markdown
Contributor Author

Self-review

Blockers

None.

Suggestions addressed in follow-up commit f7e6cb6

  • xrspatial/geotiff/tests/test_nodata_semantics_split_1988.py:29 had an unused import xarray as xr. Dropped.
  • The original test file had no GPU or VRT coverage despite the PR touching five sites across _backends/gpu.py (three) and _backends/vrt.py (two). Added:
    • TestVRTEager (3 tests): float-typed VRT over int source with hit, int-typed VRT with no hit, VRT band with no <NoDataValue>.
    • TestVRTChunked (1 test): declared-float64 graph for in-range int sentinel.
    • TestGPU (3 tests, guarded by _gpu_only): read_geotiff_gpu on int source with hit, int source no-hit (stays int), and the dask + GPU path with in-range sentinel.

Nits (left as-is)

  • xrspatial/geotiff/tests/test_nodata_semantics_split_1988.py:189 -- the hand-crafted uint16 TIFF builder duplicates _build_uint16_tiff from test_nodata_nan_int_1774.py:54. Local self-documenting duplication, leave for a future sweep.
  • xrspatial/geotiff/_attrs.py:108 -- bool(np.dtype(array_dtype).kind == 'f') -- the bool(...) wrapper is technically redundant (== on strings already yields a Python bool) but harmless and self-documenting.

What looks good

  • Single helper _set_nodata_attrs in _attrs.py keeps the contract in one place; every read site delegates to it.
  • The masked_nodata signal is derived from the final array dtype, which is the right invariant to anchor on.
  • Cross-backend divergence is intentional and documented: eager numpy stays int when the sentinel does not hit (masked_nodata=False), while dask declares float64 up front for any in-range sentinel (masked_nodata=True). Both reflect their actual array dtype, so the contract masked_nodata == (dtype is float) holds end-to-end.
  • The change is additive. No existing nodata test had to be modified. The writer side is untouched and the next geotiff: split declared nodata from masked-data state in DataArray attrs #1988 follow-up can migrate _resolve_nodata_attr / to_geotiff to consult masked_nodata.

Final test counts

  • New file: 19 tests, all passing locally (incl. 3 GPU tests on a cupy+CUDA env).
  • Existing nodata regression suites: 88 passing.
  • Full xrspatial/geotiff/ suite (excluding pre-existing test_fuzz_hypothesis_1661.py failure that reproduces on main): 952 passing.

brendancol added a commit that referenced this pull request May 17, 2026
Second review pass on PR #2008 surfaced two items:

* The chunked VRT call site (``_backends/vrt.py:_read_vrt_chunked``)
  pointed at ``declared_dtype`` "below" but the dtype block sits
  earlier in the function and ``final_dtype`` is the value actually
  threaded into ``_set_nodata_attrs``. Rephrase the inline comment so
  the cross-reference matches the control flow.
* ``TestDaskNumpy`` had no out-of-range integer-sentinel case parallel
  to the existing eager-path coverage. Add one so the dask backend's
  ``effective_dtype = file_dtype`` fallback branch (out-of-range
  sentinel cannot fit the source dtype, masking skipped, graph stays
  integer) is asserted to land with ``masked_nodata=False``.

No production-code behaviour change. The new test exercises an
already-handled code path; the comment edit is non-functional.
Add ``attrs['masked_nodata']`` alongside ``attrs['nodata']`` on every
read path. ``nodata`` carries the declared file sentinel; the new
boolean ``masked_nodata`` records whether the in-memory array has
been NaN-masked. The two were previously fused into the presence of
``attrs['nodata']``, which left downstream code unable to tell a
float-with-NaN array from an int-with-sentinel array.

Read paths (numpy, dask, gpu, dask+gpu, vrt eager, vrt chunked) now
delegate to a new ``_set_nodata_attrs`` helper in ``_attrs.py``. The
helper writes ``masked_nodata`` based on the final array dtype:
float dtype means the sentinel-to-NaN promotion ran (or the array
is already NaN-aware); integer dtype means the literal sentinel is
still in-band.

The writer side keeps reading ``attrs['nodata']`` through the
existing ``_resolve_nodata_attr`` helper, so this change is purely
additive: no write-path code branches on ``masked_nodata`` yet, no
existing nodata test changes its expectation. The new
``test_nodata_semantics_split_1988.py`` covers the matrix of
source-has-sentinel x backend x masked-or-not.
Address review feedback: the new test file did not exercise the VRT
or GPU read paths even though the PR modified five sites across
``_backends/vrt.py`` and ``_backends/gpu.py``. Add direct coverage:

* ``TestVRTEager`` -- float-typed VRT over int source with a hit,
  int-typed VRT with no hit, and a VRT band missing ``<NoDataValue>``.
* ``TestVRTChunked`` -- declared-float64 graph for in-range int
  sentinel.
* ``TestGPU`` (guarded by ``_gpu_only``) -- ``read_geotiff_gpu`` on
  int source with hit, int source no-hit (stays int), and the dask
  + GPU path with in-range sentinel.

Drop the unused ``import xarray as xr`` while in this file.
Second review pass on PR #2008 surfaced two items:

* The chunked VRT call site (``_backends/vrt.py:_read_vrt_chunked``)
  pointed at ``declared_dtype`` "below" but the dtype block sits
  earlier in the function and ``final_dtype`` is the value actually
  threaded into ``_set_nodata_attrs``. Rephrase the inline comment so
  the cross-reference matches the control flow.
* ``TestDaskNumpy`` had no out-of-range integer-sentinel case parallel
  to the existing eager-path coverage. Add one so the dask backend's
  ``effective_dtype = file_dtype`` fallback branch (out-of-range
  sentinel cannot fit the source dtype, masking skipped, graph stays
  integer) is asserted to land with ``masked_nodata=False``.

No production-code behaviour change. The new test exercises an
already-handled code path; the comment edit is non-functional.
The CI environment doesn't ship ``rasterio``, so a top-level
``import rasterio`` aborts the entire pytest collection step. Guard
the import with ``pytest.importorskip`` (matches the pattern already
used in ``test_attrs_contract_version_1984.py``) so the module skips
cleanly when rasterio isn't installed and runs in full when it is.
@brendancol brendancol merged commit bc9ee54 into main May 18, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant