Summary
When two HDF5/netCDF files store data that share dtype/chunks/codecs but were packed with different CF decoding attributes (scale_factor, add_offset, _FillValue, missing_value), MultiZarrToZarr.translate silently keeps only the first file's .zattrs and discards the others. Reading the combined references back with xarray + decode_cf=True then applies the surviving values to every chunk — including chunks that were packed with a different scale/offset — silently mis-decoding everything sourced from non-first files.
Reproducer
import os, tempfile
import h5py, numpy as np, xarray as xr
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr
def write_packed_hdf5(path, scale_factor, add_offset, x_start):
raw = np.array([0, 1, 2, 3], dtype="i2")
with h5py.File(path, "w") as f:
d = f.create_dataset("foo", data=raw.reshape(4, 1), chunks=(4, 1))
d.attrs["scale_factor"] = np.float64(scale_factor)
d.attrs["add_offset"] = np.float64(add_offset)
d.attrs["_FillValue"] = np.int16(-9999)
x = f.create_dataset("x", data=np.arange(x_start, x_start + 4, dtype="i4"))
x.make_scale("x"); d.dims[0].attach_scale(x)
y = f.create_dataset("y", data=np.arange(1, dtype="i4"))
y.make_scale("y"); d.dims[1].attach_scale(y)
with tempfile.TemporaryDirectory() as td:
td = os.path.realpath(td)
a, b = os.path.join(td, "a.nc"), os.path.join(td, "b.nc")
# Same packed bytes in each file; different CF decoding metadata;
# non-overlapping x so kerchunk's coordinate-based concat keeps both.
write_packed_hdf5(a, scale_factor=0.1, add_offset=0.0, x_start=0)
write_packed_hdf5(b, scale_factor=0.01, add_offset=100.0, x_start=4)
refs_a = SingleHdf5ToZarr(a).translate()
refs_b = SingleHdf5ToZarr(b).translate()
combined = MultiZarrToZarr(
[refs_a, refs_b], concat_dims=["x"], identical_dims=["y"],
).translate()
ds = xr.open_dataset(
"reference://", engine="zarr",
backend_kwargs={"consolidated": False,
"storage_options": {"fo": combined}},
decode_cf=True,
)
print("combined foo/.zattrs:", combined["refs"]["foo/.zattrs"])
print("expected:", np.concatenate([
xr.open_dataset(a)["foo"].values, xr.open_dataset(b)["foo"].values,
]).ravel())
print("actual: ", ds["foo"].values.ravel())
Output:
combined foo/.zattrs: {"_ARRAY_DIMENSIONS": ["x", "y"], "add_offset": 0.0, "scale_factor": 0.1}
expected: [0. 0.1 0.2 0.3 100. 100.01 100.02 100.03]
actual: [0. 0.1 0.2 0.3 0. 0.1 0.2 0.3]
4 of 8 decoded values are silently wrong. scale_factor=0.01 / add_offset=100.0 from b.nc has been discarded; a.nc's 0.1 / 0.0 is applied to both files' chunks.
Why it happens
MultiZarrToZarr.second_pass (combine.py:592-594) writes the data variable's .zattrs from the first occurrence of the array and never compares the subsequent files' .zattrs:
self.out[f"{var or v}/.zarray"] = ujson.dumps(zarray)
# other attributes copied as-is from first occurrence of this array
self.out[f"{var or v}/.zattrs"] = ujson.dumps(zattrs)
Subsequent inputs are gated by the did_them set (line 574), so their .zattrs is read (line 550) but never inspected. Chunk shape mismatch is checked (lines 540-548); attribute mismatch is not.
Impact
Affects any user combining files with MultiZarrToZarr where the source files differ in CF packing (common across satellite missions, reprocessed datasets, or files produced by tools that re-derive scale_factor per file). No warning, no error — just wrong numbers after decode_cf. The user typically doesn't see it until they cross-validate against the source files or notice the values are physically implausible for the affected slice.
Related
Filed against VirtualiZarr for the equivalent bug in its ManifestArray concat path: zarr-developers/VirtualiZarr#1004.
Summary
When two HDF5/netCDF files store data that share dtype/chunks/codecs but were packed with different CF decoding attributes (
scale_factor,add_offset,_FillValue,missing_value),MultiZarrToZarr.translatesilently keeps only the first file's.zattrsand discards the others. Reading the combined references back withxarray + decode_cf=Truethen applies the surviving values to every chunk — including chunks that were packed with a different scale/offset — silently mis-decoding everything sourced from non-first files.Reproducer
Output:
4 of 8 decoded values are silently wrong.
scale_factor=0.01 / add_offset=100.0fromb.nchas been discarded;a.nc's0.1 / 0.0is applied to both files' chunks.Why it happens
MultiZarrToZarr.second_pass(combine.py:592-594) writes the data variable's.zattrsfrom the first occurrence of the array and never compares the subsequent files'.zattrs:Subsequent inputs are gated by the
did_themset (line 574), so their.zattrsis read (line 550) but never inspected. Chunk shape mismatch is checked (lines 540-548); attribute mismatch is not.Impact
Affects any user combining files with
MultiZarrToZarrwhere the source files differ in CF packing (common across satellite missions, reprocessed datasets, or files produced by tools that re-derivescale_factorper file). No warning, no error — just wrong numbers afterdecode_cf. The user typically doesn't see it until they cross-validate against the source files or notice the values are physically implausible for the affected slice.Related
Filed against VirtualiZarr for the equivalent bug in its
ManifestArrayconcat path: zarr-developers/VirtualiZarr#1004.