Skip to content

Detect mismatched CF encoding attrs on concat#1005

Draft
TomNicholas wants to merge 5 commits into
zarr-developers:mainfrom
TomNicholas:fix-concat-cf-encoding
Draft

Detect mismatched CF encoding attrs on concat#1005
TomNicholas wants to merge 5 commits into
zarr-developers:mainfrom
TomNicholas:fix-concat-cf-encoding

Conversation

@TomNicholas

Copy link
Copy Markdown
Member

Summary

  • Fix silent data corruption when xr.concat (and any other path through np.concatenate on ManifestArray) is called on virtual datasets whose source files were CF-packed with different scale_factor / add_offset / _FillValue / missing_value. Previously only the first array's attrs survived, and the surviving values were applied to every chunk after decode_cf — silently mis-decoding everything sourced from non-first files.
  • Route CF decoding attrs onto the inner ManifestArray.metadata.attributes (where np.concatenate dispatch can see them) rather than the wrapping xr.Variable.attrs; have check_combinable_zarr_arrays raise on any mismatch. Arbitrary attrs continue to live on xr.Variable.attrs.
  • Writers (icechunk, kerchunk) gain a small helper extract_cf_encoding_attrs(var) so they can pull the CF attrs back out when materializing the destination store's metadata, keeping the round-trip identical.

Closes #1004.

Test plan

  • New TestConcatMismatchedCFEncoding (in tests/test_parsers/test_hdf/test_hdf.py) builds real HDF5 files with mismatched scale_factor / add_offset and asserts xr.concat raises ValueError. Positive control with matching attrs concatenates cleanly. Committed first as a failing-by-design test so the diff makes the fix easy to read.
  • Manual round-trip reproducer (HDF parse → xr.concat of matching files → to_icechunkxr.open_zarr + decode_cf) produces correctly decoded values matching xr.open_dataset on the source.
  • Full local suite passes: 485 tests + manifests, writers, parsers, xarray. Only pre-existing lithops failures (Python 3.11/3.13 runtime mismatch) skipped — confirmed on main too.
  • CI

Notes for reviewers

  • Relationship to the pre-existing check_compatible_encodings in writers/icechunk.py: that check fires only during to_icechunk(..., append_dim=...) and inspects var.encoding, which is the xarray-level slot CF attrs live in after decode_cf or when manually set with encoding={...}. For parser-produced virtual datasets (which never go through decode_cf) var.encoding is empty, so the existing check is a no-op on the most common real-world flow. The new check covers ManifestArray.metadata.attributes. They overlap (both fire on icechunk-append with mismatched parser data) but each covers a flow the other doesn't.
  • to_virtual_variable now splits attrs by role. User-visible side effect: on a virtual variable, vds["foo"].attrs no longer contains _FillValue / scale_factor / etc. — they are available via vds["foo"].variable.data.metadata.attributes (and end up on var.attrs after writing + re-opening through xarray as normal).
  • Kerchunk's MultiZarrToZarr.second_pass has the same bug pattern (see issue xr.concat of virtual datasets silently drops mismatched CF encoding (scale_factor, add_offset, _FillValue) #1004 "Related"). Worth flagging upstream too — not addressed here.

TomNicholas and others added 4 commits May 26, 2026 15:51
Two virtual datasets whose source files were CF-packed with different
scale_factor / add_offset attributes are currently silently concatenable
- np.concatenate keeps only the first array's attrs, corrupting decoded
values for every chunk that came from a later file. Encode the expected
behaviour (raise ValueError) so the fix can target it.
Two virtual datasets whose source files were CF-packed with different
scale_factor / add_offset / _FillValue / missing_value silently
concatenated, with np.concatenate keeping only the first array's attrs.
After the next write + xr.open_zarr + decode_cf, the surviving attrs
were applied to every chunk - including chunks that were packed with a
different value, silently corrupting decoded values for everything
sourced from non-first files.

Move CF decoding attributes from the wrapping xr.Variable onto the
inner ManifestArray.metadata.attributes (where np.concatenate dispatch
can inspect them), and have check_combinable_zarr_arrays raise on any
mismatch across inputs. Arbitrary metadata still lives on
xr.Variable.attrs. Writers (icechunk, kerchunk) gain a small helper
extract_cf_encoding_attrs(var) so they can pull the CF attrs back out
when materializing the destination store's metadata, keeping the
round-trip identical.
@codecov

codecov Bot commented May 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.88889% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.93%. Comparing base (ac60cf7) to head (7e812f8).
⚠️ Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
virtualizarr/manifests/utils.py 85.71% 2 Missing ⚠️
virtualizarr/xarray.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1005      +/-   ##
==========================================
- Coverage   89.97%   89.93%   -0.04%     
==========================================
  Files          36       36              
  Lines        2224     2246      +22     
==========================================
+ Hits         2001     2020      +19     
- Misses        223      226       +3     
Files with missing lines Coverage Δ
virtualizarr/manifests/array.py 85.98% <100.00%> (+0.13%) ⬆️
virtualizarr/writers/icechunk.py 91.71% <100.00%> (+0.04%) ⬆️
virtualizarr/writers/kerchunk.py 73.11% <100.00%> (+0.29%) ⬆️
virtualizarr/xarray.py 86.17% <80.00%> (-0.27%) ⬇️
virtualizarr/manifests/utils.py 91.02% <85.71%> (-1.17%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

xr.concat of virtual datasets silently drops mismatched CF encoding (scale_factor, add_offset, _FillValue)

1 participant