Skip to content

Feature/virtualizarr support#861

Closed
neilSchroeder wants to merge 7 commits intocubed-dev:mainfrom
neilSchroeder:feature/virtualizarr-support
Closed

Feature/virtualizarr support#861
neilSchroeder wants to merge 7 commits intocubed-dev:mainfrom
neilSchroeder:feature/virtualizarr-support

Conversation

@neilSchroeder
Copy link
Copy Markdown
Contributor

Virtual Array Support

What

This is a shot in the dark at attempting to implement ManifestArray support in cubed.

Why

There's an ideal workflow that looks like:

  1. A file is stored as a NetCDF
  2. Open that file as a virtual dataset with VirtualiZarr
  3. Manipulate that array with Cubed
  4. Write to Zarr

This makes use of the zero-copy functionality of VirtualiZarr and allows for embarassingly parallel workflows that start with NetCDFs and do not need intermediate Zarrs.

Some Comments

@tomwhite & @TomNicholas I would love your feedback on the implementation here, I sorta hacked this together and tried to get it to work with using a real-world example. It works for my use case now, but I'm not convinced I've come up with a particularly thorough/generalizable implementation.

@TomNicholas
Copy link
Copy Markdown
Member

Hi @neilSchroeder ! This is cool to see, but I'm very unclear what problem you're trying to solve here. I don't understand what the purpose of step 3. Manipulate that array with Cubed is.

I must be missing something fundamental - maybe you could show us what your user-space code actually does?

@neilSchroeder
Copy link
Copy Markdown
Contributor Author

neilSchroeder commented Jan 13, 2026

Hi @TomNicholas, happy to clarify. Here's some pseudo code that I think illustrates the problem:

# Step 1: Create virtual dataset
manifest_dataset = virtualizarr.open_virtual_mfdataset(
    s3_files,  # 8,760 file paths
    parallel="lithops"  # Uses serverless workers to read metadata
)
# Result: manifest_dataset contains only references
#   Example: {"path": "s3://bucket/file.nc", "offset": 1024, "length": 8192}
#   No actual data downloaded

# Step 2: Convert VirtualiZarr manifests to Cubed arrays
temperature = cubed.from_manifest(
    load_chunk=lambda key: fetch_from_s3(manifest[key]),  # Serializable loader
    shape=(8760, 39, 150, 150),
    chunks=(1, 39, 50, 50),
    dtype='float32'
)
# Result: Cubed array with lazy chunk loaders
#   Still no data movement - just a computation plan

# Step 3: Build computation DAG (still lazy)
pressure = cubed.from_manifest(...)
interpolated = cubed_operations.interp_to_pressure_levels(
    temperature, 
    pressure, 
    target_levels=[1000, 850, 500]  # Standard atmospheric levels
)
# Result: Computation graph, no execution yet

# Step 4: Execute on serverless workers (Lithops/Lambda)
interpolated.to_zarr(output_path, mode="w", consolidated=True)
# Result: Each worker:
#   1. Gets chunk key: (time=100, level=slice, y=0:50, x=0:50)
#   2. Calls load_chunk(key) → fetches ONLY that chunk from S3
#   3. Processes chunk → interpolates to pressure levels
#   4. Writes result to Zarr
# Data never passes through client

I was getting stuck attempting to go directly from virtual datasets to cubed because of the ManifestArrays, so this is my attempt to solve that issue. I tried to find documentation on how to go about this, but couldn't come up with anything that would let me use lithops to get embarassingly parallel output to zarr.

@TomNicholas
Copy link
Copy Markdown
Member

TomNicholas commented Jan 13, 2026

Thanks for the clarification, that's helpful.

IMO this is a very weird way to use VirtualiZarr.

You're never persisting the virtual references anywhere, so how is this any better than doing xr.open_mfdataset with chunked_array_type="cubed"?

(Admittedly the parallel='lithops' kwarg wouldn't work in xarray today, but that's a pretty easy addition, which I've suggested before somewhere.)

@TomNicholas
Copy link
Copy Markdown
Member

TomNicholas commented Jan 13, 2026

I was getting stuck attempting to go directly from virtual datasets to cubed because of the ManifestArrays

You're supposed to serialize the contents of the ManifestArrays to Icechunk/Kerchunk before attempting to read back the data. That works today.

@neilSchroeder
Copy link
Copy Markdown
Contributor Author

(Admittedly the parallel='lithops' kwarg wouldn't work in xarray today, but that's a pretty easy addition, which I've suggested before somewhere.)

This is a huge limiting factor for me. The dataset I'm trying to read is written such that each hourly timestep is its own file, so without lithops I'm stuck reading 8761 files for each year, there are 120 some years I'd like to process across 5 simulations and a host of grid sizes.

I guess I could have gone the other way and implemented lithops in xarray, but this was the path that made the most sense to me.

@TomNicholas how would you have chosen to solve this?

@TomNicholas
Copy link
Copy Markdown
Member

TomNicholas commented Jan 13, 2026

@TomNicholas how would you have chosen to solve this?

If I expected to open these datasets as a single datacube regularly in future, I would have simply done

manifest_dataset = virtualizarr.open_virtual_mfdataset(
    s3_files,  # 8,760 file paths
    parallel="lithops"  # Uses serverless workers to read metadata
)

# serialize the manifests as virtual chunks
ic_repo = ... # this could be arraylake's `client.get_repo()` of course
session = ic_repo.writable_session("main")
manifest_dataset.vz.to_icechunk(session.store)
session.commit("wrote all the virtual chunks")
# (would also need to set the virtual chunk containers)

# to access, just open the Icechunk store as normal, but as cubed arrays
session = ic_repo.readonly_session("main")
ds = xr.open_zarr(session.store, chunked_array_type="cubed")

If for some reason I didn't want to create or serialize virtual refs, then I would have looked at adding lithops parallelization to xarray.open_mfdataset.

@TomNicholas
Copy link
Copy Markdown
Member

Does that make sense @neilSchroeder ? Did you try the approach I suggested instead?

@neilSchroeder
Copy link
Copy Markdown
Contributor Author

Yeah @TomNicholas I tried the approach you suggested. It's probably the right way to do things, as it does allow me to persist the virtual references, but it's a bit slower than I'd like. Skipping the virtual persistence was the point of my code, as it makes a significant speed boost from legacy product to derived ARCO product. But its not a great philosophical approach.

@TomNicholas
Copy link
Copy Markdown
Member

TomNicholas commented Jan 14, 2026

it's a bit slower than I'd like. Skipping the virtual persistence was the point of my code, as it makes a significant speed boost from legacy product to derived ARCO product.

Why does that matter? Isn't persisting the virtualization a one-time cost? Whereas in your approach above you have to use an entire fleet of lambas to open the data every time...

@neilSchroeder
Copy link
Copy Markdown
Contributor Author

It's a one time cost assuming the data doesn't move right? I'm not too worried about it moving in this case, so it is a one time cost. But I could see a use case for less stable datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants