Skip to content

MultiZarrToZarr silently drops mismatched CF encoding (scale_factor, add_offset, _FillValue) across input files #586

@TomNicholas

Description

@TomNicholas

Summary

When two HDF5/netCDF files store data that share dtype/chunks/codecs but were packed with different CF decoding attributes (scale_factor, add_offset, _FillValue, missing_value), MultiZarrToZarr.translate silently keeps only the first file's .zattrs and discards the others. Reading the combined references back with xarray + decode_cf=True then applies the surviving values to every chunk — including chunks that were packed with a different scale/offset — silently mis-decoding everything sourced from non-first files.

Reproducer

import os, tempfile
import h5py, numpy as np, xarray as xr
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr


def write_packed_hdf5(path, scale_factor, add_offset, x_start):
    raw = np.array([0, 1, 2, 3], dtype="i2")
    with h5py.File(path, "w") as f:
        d = f.create_dataset("foo", data=raw.reshape(4, 1), chunks=(4, 1))
        d.attrs["scale_factor"] = np.float64(scale_factor)
        d.attrs["add_offset"] = np.float64(add_offset)
        d.attrs["_FillValue"] = np.int16(-9999)
        x = f.create_dataset("x", data=np.arange(x_start, x_start + 4, dtype="i4"))
        x.make_scale("x"); d.dims[0].attach_scale(x)
        y = f.create_dataset("y", data=np.arange(1, dtype="i4"))
        y.make_scale("y"); d.dims[1].attach_scale(y)


with tempfile.TemporaryDirectory() as td:
    td = os.path.realpath(td)
    a, b = os.path.join(td, "a.nc"), os.path.join(td, "b.nc")
    # Same packed bytes in each file; different CF decoding metadata;
    # non-overlapping x so kerchunk's coordinate-based concat keeps both.
    write_packed_hdf5(a, scale_factor=0.1,  add_offset=0.0,   x_start=0)
    write_packed_hdf5(b, scale_factor=0.01, add_offset=100.0, x_start=4)

    refs_a = SingleHdf5ToZarr(a).translate()
    refs_b = SingleHdf5ToZarr(b).translate()
    combined = MultiZarrToZarr(
        [refs_a, refs_b], concat_dims=["x"], identical_dims=["y"],
    ).translate()

    ds = xr.open_dataset(
        "reference://", engine="zarr",
        backend_kwargs={"consolidated": False,
                        "storage_options": {"fo": combined}},
        decode_cf=True,
    )
    print("combined foo/.zattrs:", combined["refs"]["foo/.zattrs"])
    print("expected:", np.concatenate([
        xr.open_dataset(a)["foo"].values, xr.open_dataset(b)["foo"].values,
    ]).ravel())
    print("actual:  ", ds["foo"].values.ravel())

Output:

combined foo/.zattrs: {"_ARRAY_DIMENSIONS": ["x", "y"], "add_offset": 0.0, "scale_factor": 0.1}
expected: [0.   0.1   0.2   0.3   100.  100.01 100.02 100.03]
actual:   [0.   0.1   0.2   0.3     0.    0.1    0.2    0.3]

4 of 8 decoded values are silently wrong. scale_factor=0.01 / add_offset=100.0 from b.nc has been discarded; a.nc's 0.1 / 0.0 is applied to both files' chunks.

Why it happens

MultiZarrToZarr.second_pass (combine.py:592-594) writes the data variable's .zattrs from the first occurrence of the array and never compares the subsequent files' .zattrs:

self.out[f"{var or v}/.zarray"] = ujson.dumps(zarray)
# other attributes copied as-is from first occurrence of this array
self.out[f"{var or v}/.zattrs"] = ujson.dumps(zattrs)

Subsequent inputs are gated by the did_them set (line 574), so their .zattrs is read (line 550) but never inspected. Chunk shape mismatch is checked (lines 540-548); attribute mismatch is not.

Impact

Affects any user combining files with MultiZarrToZarr where the source files differ in CF packing (common across satellite missions, reprocessed datasets, or files produced by tools that re-derive scale_factor per file). No warning, no error — just wrong numbers after decode_cf. The user typically doesn't see it until they cross-validate against the source files or notice the values are physically implausible for the affected slice.

Related

Filed against VirtualiZarr for the equivalent bug in its ManifestArray concat path: zarr-developers/VirtualiZarr#1004.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions