Skip to content

How to handle non-JSON serializable attributes? #715

@maxrjones

Description

@maxrjones

The NISAR test currently fails because it has an attribute value of inf (the float) which leads to ValueError: Out of range float values are not JSON compliant: inf when trying to write to either Icechunk or Kerchunk. I wonder how we should handle cases on non-JSON serializable attributes with Zarr V3? Some options:

  • Add a parameter to to_icechunk and to_kerchunk that provides the user the option to raise an error, drop the attribute, or cast to a string
  • Catch the upstream error an raise a more informative error about which variable / attribute is causing the issue
  • Defer to parsers and provide documentation about the requirement for objects to be JSON serializable

Relevant Zarr spec discussion: zarr-developers/zarr-specs#351

It's slow to debug over the network, so a recommended approach for an MVCE is to download https://nisar.asf.earthdatacloud.nasa.gov/NISAR-SAMPLE-DATA/GCOV/ALOS1_Rosamond_20081012/NISAR_L2_PR_GCOV_001_005_A_219_4020_SHNA_A_20081012T060910_20081012T060926_P01101_F_N_J_001.h5 and reproduce locally:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "earthaccess",
#     "obstore",
#     "virtualizarr[hdf, icechunk]",
#     "xarray[io]",
#     "zarr>=3.1.3"
# ]
# ///


import xarray as xr
from obstore.store import LocalStore

from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import HDFParser
from virtualizarr.registry import ObjectStoreRegistry
from icechunk import Repository, Storage, local_filesystem_storage, RepositoryConfig, VirtualChunkContainer, local_filesystem_store


def main():
    data_dir = "/Users/max/Documents/Code/zarr-developers/VirtualiZarr/.data/"
    file = "NISAR_L2_PR_GCOV_001_005_A_219_4020_SHNA_A_20081012T060910_20081012T060926_P01101_F_N_J_001.h5"

    config = RepositoryConfig.default()
    config.set_virtual_chunk_container(
        VirtualChunkContainer(
            url_prefix=f"file://{data_dir}",
            store=local_filesystem_store(data_dir),
        ),
    )

    storage = Storage.new_in_memory()
    # create an in-memory icechunk repository that includes the virtual chunk containers
    repo = Repository.create(storage, config)
    session = repo.writable_session("main")

    hdf_group = "science/LSAR/GCOV/grids/frequencyA"
    store = LocalStore()
    registry = ObjectStoreRegistry()
    registry.register("file://", store)
    drop_variables = ["listOfCovarianceTerms", "listOfPolarizations"]
    parser = HDFParser(group=hdf_group, drop_variables=drop_variables)
    with (
        xr.open_dataset(
            f"{data_dir}{file}",
            engine="h5netcdf",
            group=hdf_group,
            drop_variables=drop_variables,
            phony_dims="access",
        ) as dsXR,
        open_virtual_dataset(
            url=f"file://{data_dir}{file}",
            registry=registry,
            parser=parser,
        ) as vds,
    ):
        vds.vz.to_icechunk(session.store)

        with xr.open_zarr(session.store, zarr_format=3, consolidated=False) as dsV:    
            xr.testing.assert_equal(dsXR, dsV)

if __name__ == "__main__":
    main()

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions