Virtual CMORisation

Apologies if this has been talked about before - I've done a search through all the issues here but I didn't anything.

I've just been playing with using virtualizarr to produce icechunk stores holding different views into a multifile netcdf dataset. The idea here is roughly that we could do something like 'virtual cmorisation', where we use a virtualisation to produce (at least) 2 virtualised datasets from the same set of files. 

I assume this could also be done with a 'raw' and a 'cmor' branch in the icechunk store (or something along those lines), but I'll avoid getting into that for now.

For example, we could have one virtual zarr store where we've just virtualised to make opening the dataset more efficient:
```python
vds = vz.open_virtual_mfdataset(...)
vds.vz.icechunk(...)
```

and a separate virtual zarr store where we've done some renaming

```python
vds = vz.open_virtual_mfdataset(...)
vds_cmorised = vds.rename({
    "non-cmor-name-1" : "cmor-name-1",
    ...
})
vds.vz.icechunk(...)
```

and one where we apply cf metadata to change units:

```python
vds = vz.open_virtual_mfdataset(...)
vds_cmorised = vds.rename({
    "temperature_C" : "temperature_f", # dumb example but whatever
    ...
})
vds_cmorised["temperature_f"].attrs["add_offset"] = 32
vds_cmorised["temperature_f"].attrs["scale_factor"] = 1.8
vds.vz.icechunk(...)
```

And then use the `decode_cf` option to apply those transformations upon opening the dataset, ie. `xr.open_zarr(session.store, consolidated=False, decode_cf=True)` .
___

So far as I can tell, this is roughly the limit of the scope of this package - so more complicated CMORisation stuff like computing derived variables (eg. `np.sqrt( u**2 +  v**2)` => wind speed) would be out of scope here. Where to actually implement this (if it's feasible) is another question I think?

What I'm wondering is whether it would be in principle possible to take the ManifestArray object and apply something that looks like post decode steps. I think conceptually this would be similar to hooking a `preprocess` function like in `xr.open_mfdataset` straight into the decoding of the manifest arrays, such that a user would never need to specify how to preprocess their chunks - it would be baked into the virtualisation. 

Eg. in the example above, if the user fails to specify `decode_cf=True`, then they wind up getting temperature in degrees C, labelled as degrees F. This is obviously not great. We could bake `decode_cf=True` in as an opening kwarg eg. via intake, but it would be less error prone to move it down as far as possible.

I've had a look into the codebase, but I've only just started thinking through this idea properly, and I'm not sure a. how feasible it is, and b. how useful it would be, so I thought I'd just ask!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Virtual CMORisation #949

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Virtual CMORisation #949

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions