MultiZarrToZarr silently drops mismatched CF encoding (scale_factor, add_offset, _FillValue) across input files

### Summary

When two HDF5/netCDF files store data that share dtype/chunks/codecs but were packed with different CF decoding attributes (`scale_factor`, `add_offset`, `_FillValue`, `missing_value`), `MultiZarrToZarr.translate` silently keeps only the first file's `.zattrs` and discards the others. Reading the combined references back with `xarray + decode_cf=True` then applies the surviving values to every chunk — including chunks that were packed with a different scale/offset — silently mis-decoding everything sourced from non-first files.

### Reproducer

```python
import os, tempfile
import h5py, numpy as np, xarray as xr
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr


def write_packed_hdf5(path, scale_factor, add_offset, x_start):
    raw = np.array([0, 1, 2, 3], dtype="i2")
    with h5py.File(path, "w") as f:
        d = f.create_dataset("foo", data=raw.reshape(4, 1), chunks=(4, 1))
        d.attrs["scale_factor"] = np.float64(scale_factor)
        d.attrs["add_offset"] = np.float64(add_offset)
        d.attrs["_FillValue"] = np.int16(-9999)
        x = f.create_dataset("x", data=np.arange(x_start, x_start + 4, dtype="i4"))
        x.make_scale("x"); d.dims[0].attach_scale(x)
        y = f.create_dataset("y", data=np.arange(1, dtype="i4"))
        y.make_scale("y"); d.dims[1].attach_scale(y)


with tempfile.TemporaryDirectory() as td:
    td = os.path.realpath(td)
    a, b = os.path.join(td, "a.nc"), os.path.join(td, "b.nc")
    # Same packed bytes in each file; different CF decoding metadata;
    # non-overlapping x so kerchunk's coordinate-based concat keeps both.
    write_packed_hdf5(a, scale_factor=0.1,  add_offset=0.0,   x_start=0)
    write_packed_hdf5(b, scale_factor=0.01, add_offset=100.0, x_start=4)

    refs_a = SingleHdf5ToZarr(a).translate()
    refs_b = SingleHdf5ToZarr(b).translate()
    combined = MultiZarrToZarr(
        [refs_a, refs_b], concat_dims=["x"], identical_dims=["y"],
    ).translate()

    ds = xr.open_dataset(
        "reference://", engine="zarr",
        backend_kwargs={"consolidated": False,
                        "storage_options": {"fo": combined}},
        decode_cf=True,
    )
    print("combined foo/.zattrs:", combined["refs"]["foo/.zattrs"])
    print("expected:", np.concatenate([
        xr.open_dataset(a)["foo"].values, xr.open_dataset(b)["foo"].values,
    ]).ravel())
    print("actual:  ", ds["foo"].values.ravel())
```

Output:

```
combined foo/.zattrs: {"_ARRAY_DIMENSIONS": ["x", "y"], "add_offset": 0.0, "scale_factor": 0.1}
expected: [0.   0.1   0.2   0.3   100.  100.01 100.02 100.03]
actual:   [0.   0.1   0.2   0.3     0.    0.1    0.2    0.3]
```

4 of 8 decoded values are silently wrong. `scale_factor=0.01 / add_offset=100.0` from `b.nc` has been discarded; `a.nc`'s `0.1 / 0.0` is applied to both files' chunks.

### Why it happens

`MultiZarrToZarr.second_pass` (`combine.py:592-594`) writes the data variable's `.zattrs` from the first occurrence of the array and never compares the subsequent files' `.zattrs`:

```python
self.out[f"{var or v}/.zarray"] = ujson.dumps(zarray)
# other attributes copied as-is from first occurrence of this array
self.out[f"{var or v}/.zattrs"] = ujson.dumps(zattrs)
```

Subsequent inputs are gated by the `did_them` set (line 574), so their `.zattrs` is read (line 550) but never inspected. Chunk shape mismatch is checked (lines 540-548); attribute mismatch is not.

### Impact

Affects any user combining files with `MultiZarrToZarr` where the source files differ in CF packing (common across satellite missions, reprocessed datasets, or files produced by tools that re-derive `scale_factor` per file). No warning, no error — just wrong numbers after `decode_cf`. The user typically doesn't see it until they cross-validate against the source files or notice the values are physically implausible for the affected slice.

### Related

Filed against VirtualiZarr for the equivalent bug in its `ManifestArray` concat path: https://github.com/zarr-developers/VirtualiZarr/issues/1004.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiZarrToZarr silently drops mismatched CF encoding (scale_factor, add_offset, _FillValue) across input files #586

Summary

Reproducer

Why it happens

Impact

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

MultiZarrToZarr silently drops mismatched CF encoding (scale_factor, add_offset, _FillValue) across input files #586

Description

Summary

Reproducer

Why it happens

Impact

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions