Apologies if this has been talked about before - I've done a search through all the issues here but I didn't anything.
I've just been playing with using virtualizarr to produce icechunk stores holding different views into a multifile netcdf dataset. The idea here is roughly that we could do something like 'virtual cmorisation', where we use a virtualisation to produce (at least) 2 virtualised datasets from the same set of files.
I assume this could also be done with a 'raw' and a 'cmor' branch in the icechunk store (or something along those lines), but I'll avoid getting into that for now.
For example, we could have one virtual zarr store where we've just virtualised to make opening the dataset more efficient:
vds = vz.open_virtual_mfdataset(...)
vds.vz.icechunk(...)
and a separate virtual zarr store where we've done some renaming
vds = vz.open_virtual_mfdataset(...)
vds_cmorised = vds.rename({
"non-cmor-name-1" : "cmor-name-1",
...
})
vds.vz.icechunk(...)
and one where we apply cf metadata to change units:
vds = vz.open_virtual_mfdataset(...)
vds_cmorised = vds.rename({
"temperature_C" : "temperature_f", # dumb example but whatever
...
})
vds_cmorised["temperature_f"].attrs["add_offset"] = 32
vds_cmorised["temperature_f"].attrs["scale_factor"] = 1.8
vds.vz.icechunk(...)
And then use the decode_cf option to apply those transformations upon opening the dataset, ie. xr.open_zarr(session.store, consolidated=False, decode_cf=True) .
So far as I can tell, this is roughly the limit of the scope of this package - so more complicated CMORisation stuff like computing derived variables (eg. np.sqrt( u**2 + v**2) => wind speed) would be out of scope here. Where to actually implement this (if it's feasible) is another question I think?
What I'm wondering is whether it would be in principle possible to take the ManifestArray object and apply something that looks like post decode steps. I think conceptually this would be similar to hooking a preprocess function like in xr.open_mfdataset straight into the decoding of the manifest arrays, such that a user would never need to specify how to preprocess their chunks - it would be baked into the virtualisation.
Eg. in the example above, if the user fails to specify decode_cf=True, then they wind up getting temperature in degrees C, labelled as degrees F. This is obviously not great. We could bake decode_cf=True in as an opening kwarg eg. via intake, but it would be less error prone to move it down as far as possible.
I've had a look into the codebase, but I've only just started thinking through this idea properly, and I'm not sure a. how feasible it is, and b. how useful it would be, so I thought I'd just ask!
Apologies if this has been talked about before - I've done a search through all the issues here but I didn't anything.
I've just been playing with using virtualizarr to produce icechunk stores holding different views into a multifile netcdf dataset. The idea here is roughly that we could do something like 'virtual cmorisation', where we use a virtualisation to produce (at least) 2 virtualised datasets from the same set of files.
I assume this could also be done with a 'raw' and a 'cmor' branch in the icechunk store (or something along those lines), but I'll avoid getting into that for now.
For example, we could have one virtual zarr store where we've just virtualised to make opening the dataset more efficient:
and a separate virtual zarr store where we've done some renaming
and one where we apply cf metadata to change units:
And then use the
decode_cfoption to apply those transformations upon opening the dataset, ie.xr.open_zarr(session.store, consolidated=False, decode_cf=True).So far as I can tell, this is roughly the limit of the scope of this package - so more complicated CMORisation stuff like computing derived variables (eg.
np.sqrt( u**2 + v**2)=> wind speed) would be out of scope here. Where to actually implement this (if it's feasible) is another question I think?What I'm wondering is whether it would be in principle possible to take the ManifestArray object and apply something that looks like post decode steps. I think conceptually this would be similar to hooking a
preprocessfunction like inxr.open_mfdatasetstraight into the decoding of the manifest arrays, such that a user would never need to specify how to preprocess their chunks - it would be baked into the virtualisation.Eg. in the example above, if the user fails to specify
decode_cf=True, then they wind up getting temperature in degrees C, labelled as degrees F. This is obviously not great. We could bakedecode_cf=Truein as an opening kwarg eg. via intake, but it would be less error prone to move it down as far as possible.I've had a look into the codebase, but I've only just started thinking through this idea properly, and I'm not sure a. how feasible it is, and b. how useful it would be, so I thought I'd just ask!