Skip to content

Virtual CMORisation #949

@charles-turner-1

Description

@charles-turner-1

Apologies if this has been talked about before - I've done a search through all the issues here but I didn't anything.

I've just been playing with using virtualizarr to produce icechunk stores holding different views into a multifile netcdf dataset. The idea here is roughly that we could do something like 'virtual cmorisation', where we use a virtualisation to produce (at least) 2 virtualised datasets from the same set of files.

I assume this could also be done with a 'raw' and a 'cmor' branch in the icechunk store (or something along those lines), but I'll avoid getting into that for now.

For example, we could have one virtual zarr store where we've just virtualised to make opening the dataset more efficient:

vds = vz.open_virtual_mfdataset(...)
vds.vz.icechunk(...)

and a separate virtual zarr store where we've done some renaming

vds = vz.open_virtual_mfdataset(...)
vds_cmorised = vds.rename({
    "non-cmor-name-1" : "cmor-name-1",
    ...
})
vds.vz.icechunk(...)

and one where we apply cf metadata to change units:

vds = vz.open_virtual_mfdataset(...)
vds_cmorised = vds.rename({
    "temperature_C" : "temperature_f", # dumb example but whatever
    ...
})
vds_cmorised["temperature_f"].attrs["add_offset"] = 32
vds_cmorised["temperature_f"].attrs["scale_factor"] = 1.8
vds.vz.icechunk(...)

And then use the decode_cf option to apply those transformations upon opening the dataset, ie. xr.open_zarr(session.store, consolidated=False, decode_cf=True) .


So far as I can tell, this is roughly the limit of the scope of this package - so more complicated CMORisation stuff like computing derived variables (eg. np.sqrt( u**2 + v**2) => wind speed) would be out of scope here. Where to actually implement this (if it's feasible) is another question I think?

What I'm wondering is whether it would be in principle possible to take the ManifestArray object and apply something that looks like post decode steps. I think conceptually this would be similar to hooking a preprocess function like in xr.open_mfdataset straight into the decoding of the manifest arrays, such that a user would never need to specify how to preprocess their chunks - it would be baked into the virtualisation.

Eg. in the example above, if the user fails to specify decode_cf=True, then they wind up getting temperature in degrees C, labelled as degrees F. This is obviously not great. We could bake decode_cf=True in as an opening kwarg eg. via intake, but it would be less error prone to move it down as far as possible.

I've had a look into the codebase, but I've only just started thinking through this idea properly, and I'm not sure a. how feasible it is, and b. how useful it would be, so I thought I'd just ask!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions