Detect mismatched CF encoding attrs on concat#1005
Draft
TomNicholas wants to merge 5 commits into
Draft
Conversation
Two virtual datasets whose source files were CF-packed with different scale_factor / add_offset attributes are currently silently concatenable - np.concatenate keeps only the first array's attrs, corrupting decoded values for every chunk that came from a later file. Encode the expected behaviour (raise ValueError) so the fix can target it.
Two virtual datasets whose source files were CF-packed with different scale_factor / add_offset / _FillValue / missing_value silently concatenated, with np.concatenate keeping only the first array's attrs. After the next write + xr.open_zarr + decode_cf, the surviving attrs were applied to every chunk - including chunks that were packed with a different value, silently corrupting decoded values for everything sourced from non-first files. Move CF decoding attributes from the wrapping xr.Variable onto the inner ManifestArray.metadata.attributes (where np.concatenate dispatch can inspect them), and have check_combinable_zarr_arrays raise on any mismatch across inputs. Arbitrary metadata still lives on xr.Variable.attrs. Writers (icechunk, kerchunk) gain a small helper extract_cf_encoding_attrs(var) so they can pull the CF attrs back out when materializing the destination store's metadata, keeping the round-trip identical.
for more information, see https://pre-commit.ci
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1005 +/- ##
==========================================
- Coverage 89.97% 89.93% -0.04%
==========================================
Files 36 36
Lines 2224 2246 +22
==========================================
+ Hits 2001 2020 +19
- Misses 223 226 +3
🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
xr.concat(and any other path throughnp.concatenateonManifestArray) is called on virtual datasets whose source files were CF-packed with differentscale_factor/add_offset/_FillValue/missing_value. Previously only the first array's attrs survived, and the surviving values were applied to every chunk afterdecode_cf— silently mis-decoding everything sourced from non-first files.ManifestArray.metadata.attributes(wherenp.concatenatedispatch can see them) rather than the wrappingxr.Variable.attrs; havecheck_combinable_zarr_arraysraise on any mismatch. Arbitrary attrs continue to live onxr.Variable.attrs.icechunk,kerchunk) gain a small helperextract_cf_encoding_attrs(var)so they can pull the CF attrs back out when materializing the destination store's metadata, keeping the round-trip identical.Closes #1004.
Test plan
TestConcatMismatchedCFEncoding(intests/test_parsers/test_hdf/test_hdf.py) builds real HDF5 files with mismatchedscale_factor/add_offsetand assertsxr.concatraisesValueError. Positive control with matching attrs concatenates cleanly. Committed first as a failing-by-design test so the diff makes the fix easy to read.xr.concatof matching files →to_icechunk→xr.open_zarr+decode_cf) produces correctly decoded values matchingxr.open_dataseton the source.maintoo.Notes for reviewers
check_compatible_encodingsinwriters/icechunk.py: that check fires only duringto_icechunk(..., append_dim=...)and inspectsvar.encoding, which is the xarray-level slot CF attrs live in afterdecode_cfor when manually set withencoding={...}. For parser-produced virtual datasets (which never go throughdecode_cf)var.encodingis empty, so the existing check is a no-op on the most common real-world flow. The new check coversManifestArray.metadata.attributes. They overlap (both fire on icechunk-append with mismatched parser data) but each covers a flow the other doesn't.to_virtual_variablenow splits attrs by role. User-visible side effect: on a virtual variable,vds["foo"].attrsno longer contains_FillValue/scale_factor/ etc. — they are available viavds["foo"].variable.data.metadata.attributes(and end up onvar.attrsafter writing + re-opening through xarray as normal).MultiZarrToZarr.second_passhas the same bug pattern (see issue xr.concat of virtual datasets silently drops mismatched CF encoding (scale_factor, add_offset, _FillValue) #1004 "Related"). Worth flagging upstream too — not addressed here.