Detect mismatched CF scale/offset on concat via codecs by TomNicholas · Pull Request #1010 · zarr-developers/VirtualiZarr

TomNicholas · 2026-06-10T14:48:11Z

What

Closes #1004.

Concatenating virtual datasets whose source files were CF-packed with different scale_factor/add_offset previously succeeded silently, keeping only the first array's attributes and corrupting decoded values for every chunk sourced from a non-first file.

This fixes it entirely within the HDF parser by representing CF scale/offset packing as the new zarr scale_offset + cast_value codecs (from zarr-developers/zarr-python#3874) and stripping the corresponding attributes. Because the packing now lives in the codec pipeline, files packed with different parameters produce different codec pipelines, and the existing check_same_codecs check refuses to concatenate them. No concatenation code was changed.

How

parsers/hdf/filters.py — cf_codecs_from_dataset builds ScaleOffset(scale=1/scale_factor, offset=add_offset) + CastValue(<int dtype>) and reports the decoded float dtype.
parsers/hdf/hdf.py — wires the codecs into the array metadata (before the byte-level filters), decodes the fill value into the float domain, and drops scale_factor/add_offset from attrs.
writers/kerchunk.py — kerchunk targets zarr v2/numcodecs, which can't represent these codecs, so the writer recognises them and reverts to scale_factor/add_offset attributes + the integer dtype on write, where xarray's decode_cf reapplies them on read.
pyproject.toml — adds cast-value-rs to the hdf extra (required by cast_value at parse time).

Tests

TestConcatMismatchedCFEncoding (HDF): mismatched scale_factor/add_offset raise on concat (surfaced as the ScaleOffset codec mismatch from check_same_codecs); matching encoding round-trips through icechunk with correct decoded values.
Full suite green.

Known caveat (feedback wanted)

Codec-based decoding computes packed / scale, whereas xarray's decode_cf computes packed * scale_factor. These agree to within ~1 ULP but are not bit-identical for non-power-of-two scale factors (the scale_offset codec only offers a divide on decode). This is below the quantization floor that packing already imposes, but it means two loadable-variable tests now compare CF-packed variables with assert_allclose rather than assert_identical. See related discussion in zarr-developers/zarr-python#3874.

TODO before un-drafting

Decide whether the ~1 ULP decode_cf divergence is acceptable as-is or warrants raising upstream on zarr-python#3874.
Regenerate pixi.lock for the new cast-value-rs dep.
Consider whether DMR++ / other parsers should share the CF→codec path.

Add TestConcatMismatchedCFEncoding covering xr.concat of virtual datasets whose source files were CF-packed with different scale_factor / add_offset. These tests fail on main, demonstrating the silent data-corruption bug in zarr-developers#1004 before the fix lands.

Convert CF scale_factor/add_offset packing into the zarr scale_offset + cast_value codecs in the HDF parser, stripping those attributes. Files packed with different parameters then have different codec pipelines, so the existing check_same_codecs refuses to concatenate them instead of silently keeping only the first file's parameters and corrupting decoded values for chunks sourced from other files. No concatenation code was changed. - filters.py: cf_codecs_from_dataset builds the scale_offset + cast_value codecs and reports the decoded float dtype and the scale/offset. - hdf.py: wire the codecs into the array metadata, decode the fill value into the float domain, and drop scale_factor/add_offset from attrs. - kerchunk.py: the zarr v2 writer can't represent these codecs, so revert them to scale_factor/add_offset attributes (and the integer dtype) on write, where xarray's decode_cf reapplies them on read. - pyproject: add cast-value-rs to the hdf extra (needed by cast_value). Codec-based decoding (packed / scale) agrees with xarray's decode_cf (packed * scale_factor) only to within ~1 ULP, so two loadable-variable tests compare CF-packed variables with assert_allclose. Closes zarr-developers#1004 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-10T14:49:58Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.94%. Comparing base (496fd0f) to head (09d374b).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1010      +/-   ##
==========================================
- Coverage   89.97%   89.94%   -0.03%     
==========================================
  Files          36       36              
  Lines        2224     2278      +54     
==========================================
+ Hits         2001     2049      +48     
- Misses        223      229       +6

Files with missing lines	Coverage Δ
virtualizarr/parsers/hdf/filters.py	`98.43% <100.00%> (+0.45%)`	⬆️
virtualizarr/parsers/hdf/hdf.py	`95.80% <100.00%> (+0.21%)`	⬆️
virtualizarr/writers/kerchunk.py	`77.27% <100.00%> (+4.44%)`	⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TomNicholas and others added 2 commits June 5, 2026 15:21

TomNicholas temporarily deployed to test-release June 10, 2026 14:48 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect mismatched CF scale/offset on concat via codecs#1010

Detect mismatched CF scale/offset on concat via codecs#1010
TomNicholas wants to merge 2 commits into
zarr-developers:mainfrom
TomNicholas:detect-mismatched-cf-encoding

TomNicholas commented Jun 10, 2026

Uh oh!

codecov Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

TomNicholas commented Jun 10, 2026

What

How

Tests

Known caveat (feedback wanted)

TODO before un-drafting

Uh oh!

codecov Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 10, 2026 •

edited

Loading