Skip to content

Detect mismatched CF scale/offset on concat via codecs#1010

Draft
TomNicholas wants to merge 2 commits into
zarr-developers:mainfrom
TomNicholas:detect-mismatched-cf-encoding
Draft

Detect mismatched CF scale/offset on concat via codecs#1010
TomNicholas wants to merge 2 commits into
zarr-developers:mainfrom
TomNicholas:detect-mismatched-cf-encoding

Conversation

@TomNicholas

Copy link
Copy Markdown
Member

What

Closes #1004.

Concatenating virtual datasets whose source files were CF-packed with different scale_factor/add_offset previously succeeded silently, keeping only the first array's attributes and corrupting decoded values for every chunk sourced from a non-first file.

This fixes it entirely within the HDF parser by representing CF scale/offset packing as the new zarr scale_offset + cast_value codecs (from zarr-developers/zarr-python#3874) and stripping the corresponding attributes. Because the packing now lives in the codec pipeline, files packed with different parameters produce different codec pipelines, and the existing check_same_codecs check refuses to concatenate them. No concatenation code was changed.

How

  • parsers/hdf/filters.pycf_codecs_from_dataset builds ScaleOffset(scale=1/scale_factor, offset=add_offset) + CastValue(<int dtype>) and reports the decoded float dtype.
  • parsers/hdf/hdf.py — wires the codecs into the array metadata (before the byte-level filters), decodes the fill value into the float domain, and drops scale_factor/add_offset from attrs.
  • writers/kerchunk.py — kerchunk targets zarr v2/numcodecs, which can't represent these codecs, so the writer recognises them and reverts to scale_factor/add_offset attributes + the integer dtype on write, where xarray's decode_cf reapplies them on read.
  • pyproject.toml — adds cast-value-rs to the hdf extra (required by cast_value at parse time).

Tests

  • TestConcatMismatchedCFEncoding (HDF): mismatched scale_factor/add_offset raise on concat (surfaced as the ScaleOffset codec mismatch from check_same_codecs); matching encoding round-trips through icechunk with correct decoded values.
  • Full suite green.

Known caveat (feedback wanted)

Codec-based decoding computes packed / scale, whereas xarray's decode_cf computes packed * scale_factor. These agree to within ~1 ULP but are not bit-identical for non-power-of-two scale factors (the scale_offset codec only offers a divide on decode). This is below the quantization floor that packing already imposes, but it means two loadable-variable tests now compare CF-packed variables with assert_allclose rather than assert_identical. See related discussion in zarr-developers/zarr-python#3874.

TODO before un-drafting

  • Decide whether the ~1 ULP decode_cf divergence is acceptable as-is or warrants raising upstream on zarr-python#3874.
  • Regenerate pixi.lock for the new cast-value-rs dep.
  • Consider whether DMR++ / other parsers should share the CF→codec path.

TomNicholas and others added 2 commits June 5, 2026 15:21
Add TestConcatMismatchedCFEncoding covering xr.concat of virtual
datasets whose source files were CF-packed with different
scale_factor / add_offset. These tests fail on main, demonstrating
the silent data-corruption bug in zarr-developers#1004 before the fix lands.
Convert CF scale_factor/add_offset packing into the zarr scale_offset +
cast_value codecs in the HDF parser, stripping those attributes. Files
packed with different parameters then have different codec pipelines, so
the existing check_same_codecs refuses to concatenate them instead of
silently keeping only the first file's parameters and corrupting decoded
values for chunks sourced from other files. No concatenation code was
changed.

- filters.py: cf_codecs_from_dataset builds the scale_offset + cast_value
  codecs and reports the decoded float dtype and the scale/offset.
- hdf.py: wire the codecs into the array metadata, decode the fill value
  into the float domain, and drop scale_factor/add_offset from attrs.
- kerchunk.py: the zarr v2 writer can't represent these codecs, so revert
  them to scale_factor/add_offset attributes (and the integer dtype) on
  write, where xarray's decode_cf reapplies them on read.
- pyproject: add cast-value-rs to the hdf extra (needed by cast_value).

Codec-based decoding (packed / scale) agrees with xarray's decode_cf
(packed * scale_factor) only to within ~1 ULP, so two loadable-variable
tests compare CF-packed variables with assert_allclose.

Closes zarr-developers#1004

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.94%. Comparing base (496fd0f) to head (09d374b).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1010      +/-   ##
==========================================
- Coverage   89.97%   89.94%   -0.03%     
==========================================
  Files          36       36              
  Lines        2224     2278      +54     
==========================================
+ Hits         2001     2049      +48     
- Misses        223      229       +6     
Files with missing lines Coverage Δ
virtualizarr/parsers/hdf/filters.py 98.43% <100.00%> (+0.45%) ⬆️
virtualizarr/parsers/hdf/hdf.py 95.80% <100.00%> (+0.21%) ⬆️
virtualizarr/writers/kerchunk.py 77.27% <100.00%> (+4.44%) ⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

xr.concat of virtual datasets silently drops mismatched CF encoding (scale_factor, add_offset, _FillValue)

1 participant