Detect mismatched CF scale/offset on concat via codecs#1010
Draft
TomNicholas wants to merge 2 commits into
Draft
Detect mismatched CF scale/offset on concat via codecs#1010TomNicholas wants to merge 2 commits into
TomNicholas wants to merge 2 commits into
Conversation
Add TestConcatMismatchedCFEncoding covering xr.concat of virtual datasets whose source files were CF-packed with different scale_factor / add_offset. These tests fail on main, demonstrating the silent data-corruption bug in zarr-developers#1004 before the fix lands.
Convert CF scale_factor/add_offset packing into the zarr scale_offset + cast_value codecs in the HDF parser, stripping those attributes. Files packed with different parameters then have different codec pipelines, so the existing check_same_codecs refuses to concatenate them instead of silently keeping only the first file's parameters and corrupting decoded values for chunks sourced from other files. No concatenation code was changed. - filters.py: cf_codecs_from_dataset builds the scale_offset + cast_value codecs and reports the decoded float dtype and the scale/offset. - hdf.py: wire the codecs into the array metadata, decode the fill value into the float domain, and drop scale_factor/add_offset from attrs. - kerchunk.py: the zarr v2 writer can't represent these codecs, so revert them to scale_factor/add_offset attributes (and the integer dtype) on write, where xarray's decode_cf reapplies them on read. - pyproject: add cast-value-rs to the hdf extra (needed by cast_value). Codec-based decoding (packed / scale) agrees with xarray's decode_cf (packed * scale_factor) only to within ~1 ULP, so two loadable-variable tests compare CF-packed variables with assert_allclose. Closes zarr-developers#1004 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1010 +/- ##
==========================================
- Coverage 89.97% 89.94% -0.03%
==========================================
Files 36 36
Lines 2224 2278 +54
==========================================
+ Hits 2001 2049 +48
- Misses 223 229 +6
🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Closes #1004.
Concatenating virtual datasets whose source files were CF-packed with different
scale_factor/add_offsetpreviously succeeded silently, keeping only the first array's attributes and corrupting decoded values for every chunk sourced from a non-first file.This fixes it entirely within the HDF parser by representing CF scale/offset packing as the new zarr
scale_offset+cast_valuecodecs (from zarr-developers/zarr-python#3874) and stripping the corresponding attributes. Because the packing now lives in the codec pipeline, files packed with different parameters produce different codec pipelines, and the existingcheck_same_codecscheck refuses to concatenate them. No concatenation code was changed.How
parsers/hdf/filters.py—cf_codecs_from_datasetbuildsScaleOffset(scale=1/scale_factor, offset=add_offset)+CastValue(<int dtype>)and reports the decoded float dtype.parsers/hdf/hdf.py— wires the codecs into the array metadata (before the byte-level filters), decodes the fill value into the float domain, and dropsscale_factor/add_offsetfrom attrs.writers/kerchunk.py— kerchunk targets zarr v2/numcodecs, which can't represent these codecs, so the writer recognises them and reverts toscale_factor/add_offsetattributes + the integer dtype on write, where xarray'sdecode_cfreapplies them on read.pyproject.toml— addscast-value-rsto thehdfextra (required bycast_valueat parse time).Tests
TestConcatMismatchedCFEncoding(HDF): mismatchedscale_factor/add_offsetraise on concat (surfaced as theScaleOffsetcodec mismatch fromcheck_same_codecs); matching encoding round-trips through icechunk with correct decoded values.Known caveat (feedback wanted)
Codec-based decoding computes
packed / scale, whereas xarray'sdecode_cfcomputespacked * scale_factor. These agree to within ~1 ULP but are not bit-identical for non-power-of-two scale factors (thescale_offsetcodec only offers a divide on decode). This is below the quantization floor that packing already imposes, but it means two loadable-variable tests now compare CF-packed variables withassert_allcloserather thanassert_identical. See related discussion in zarr-developers/zarr-python#3874.TODO before un-drafting
decode_cfdivergence is acceptable as-is or warrants raising upstream on zarr-python#3874.pixi.lockfor the newcast-value-rsdep.