Feature/virtualizarr support by neilSchroeder · Pull Request #861 · cubed-dev/cubed

neilSchroeder · 2026-01-13T18:23:15Z

Virtual Array Support

What

This is a shot in the dark at attempting to implement ManifestArray support in cubed.

Why

There's an ideal workflow that looks like:

A file is stored as a NetCDF
Open that file as a virtual dataset with VirtualiZarr
Manipulate that array with Cubed
Write to Zarr

This makes use of the zero-copy functionality of VirtualiZarr and allows for embarassingly parallel workflows that start with NetCDFs and do not need intermediate Zarrs.

Some Comments

@tomwhite & @TomNicholas I would love your feedback on the implementation here, I sorta hacked this together and tried to get it to work with using a real-world example. It works for my use case now, but I'm not convinced I've come up with a particularly thorough/generalizable implementation.

…ations

…er/cubed into feature/virtualizarr-support

TomNicholas · 2026-01-13T18:43:32Z

Hi @neilSchroeder ! This is cool to see, but I'm very unclear what problem you're trying to solve here. I don't understand what the purpose of step 3. Manipulate that array with Cubed is.

I must be missing something fundamental - maybe you could show us what your user-space code actually does?

neilSchroeder · 2026-01-13T19:39:08Z

Hi @TomNicholas, happy to clarify. Here's some pseudo code that I think illustrates the problem:

# Step 1: Create virtual dataset
manifest_dataset = virtualizarr.open_virtual_mfdataset(
    s3_files,  # 8,760 file paths
    parallel="lithops"  # Uses serverless workers to read metadata
)
# Result: manifest_dataset contains only references
#   Example: {"path": "s3://bucket/file.nc", "offset": 1024, "length": 8192}
#   No actual data downloaded

# Step 2: Convert VirtualiZarr manifests to Cubed arrays
temperature = cubed.from_manifest(
    load_chunk=lambda key: fetch_from_s3(manifest[key]),  # Serializable loader
    shape=(8760, 39, 150, 150),
    chunks=(1, 39, 50, 50),
    dtype='float32'
)
# Result: Cubed array with lazy chunk loaders
#   Still no data movement - just a computation plan

# Step 3: Build computation DAG (still lazy)
pressure = cubed.from_manifest(...)
interpolated = cubed_operations.interp_to_pressure_levels(
    temperature, 
    pressure, 
    target_levels=[1000, 850, 500]  # Standard atmospheric levels
)
# Result: Computation graph, no execution yet

# Step 4: Execute on serverless workers (Lithops/Lambda)
interpolated.to_zarr(output_path, mode="w", consolidated=True)
# Result: Each worker:
#   1. Gets chunk key: (time=100, level=slice, y=0:50, x=0:50)
#   2. Calls load_chunk(key) → fetches ONLY that chunk from S3
#   3. Processes chunk → interpolates to pressure levels
#   4. Writes result to Zarr
# Data never passes through client

I was getting stuck attempting to go directly from virtual datasets to cubed because of the ManifestArrays, so this is my attempt to solve that issue. I tried to find documentation on how to go about this, but couldn't come up with anything that would let me use lithops to get embarassingly parallel output to zarr.

TomNicholas · 2026-01-13T19:56:55Z

Thanks for the clarification, that's helpful.

IMO this is a very weird way to use VirtualiZarr.

You're never persisting the virtual references anywhere, so how is this any better than doing xr.open_mfdataset with chunked_array_type="cubed"?

(Admittedly the parallel='lithops' kwarg wouldn't work in xarray today, but that's a pretty easy addition, which I've suggested before somewhere.)

TomNicholas · 2026-01-13T20:00:31Z

I was getting stuck attempting to go directly from virtual datasets to cubed because of the ManifestArrays

You're supposed to serialize the contents of the ManifestArrays to Icechunk/Kerchunk before attempting to read back the data. That works today.

neilSchroeder · 2026-01-13T20:05:38Z

(Admittedly the parallel='lithops' kwarg wouldn't work in xarray today, but that's a pretty easy addition, which I've suggested before somewhere.)

This is a huge limiting factor for me. The dataset I'm trying to read is written such that each hourly timestep is its own file, so without lithops I'm stuck reading 8761 files for each year, there are 120 some years I'd like to process across 5 simulations and a host of grid sizes.

I guess I could have gone the other way and implemented lithops in xarray, but this was the path that made the most sense to me.

@TomNicholas how would you have chosen to solve this?

TomNicholas · 2026-01-13T20:13:09Z

@TomNicholas how would you have chosen to solve this?

If I expected to open these datasets as a single datacube regularly in future, I would have simply done

manifest_dataset = virtualizarr.open_virtual_mfdataset(
    s3_files,  # 8,760 file paths
    parallel="lithops"  # Uses serverless workers to read metadata
)

# serialize the manifests as virtual chunks
ic_repo = ... # this could be arraylake's `client.get_repo()` of course
session = ic_repo.writable_session("main")
manifest_dataset.vz.to_icechunk(session.store)
session.commit("wrote all the virtual chunks")
# (would also need to set the virtual chunk containers)

# to access, just open the Icechunk store as normal, but as cubed arrays
session = ic_repo.readonly_session("main")
ds = xr.open_zarr(session.store, chunked_array_type="cubed")

If for some reason I didn't want to create or serialize virtual refs, then I would have looked at adding lithops parallelization to xarray.open_mfdataset.

TomNicholas · 2026-01-14T16:32:51Z

Does that make sense @neilSchroeder ? Did you try the approach I suggested instead?

neilSchroeder · 2026-01-14T16:58:07Z

Yeah @TomNicholas I tried the approach you suggested. It's probably the right way to do things, as it does allow me to persist the virtual references, but it's a bit slower than I'd like. Skipping the virtual persistence was the point of my code, as it makes a significant speed boost from legacy product to derived ARCO product. But its not a great philosophical approach.

TomNicholas · 2026-01-14T16:59:26Z

it's a bit slower than I'd like. Skipping the virtual persistence was the point of my code, as it makes a significant speed boost from legacy product to derived ARCO product.

Why does that matter? Isn't persisting the virtualization a one-time cost? Whereas in your approach above you have to use an entire fleet of lambas to open the data every time...

neilSchroeder · 2026-01-14T17:06:17Z

It's a one time cost assuming the data doesn't move right? I'm not too worried about it moving in this case, so it is a one time cost. But I could see a use case for less stable datasets.

neilSchroeder and others added 7 commits January 13, 2026 11:15

Add support for VirtualiZarr integration with optional import

c784841

Add from_manifest function for lazy loading of chunks from a callback

3138ac8

Add tests for from_manifest functionality with various chunk configur…

63e3c48

…ations

Add tests for codec handling in VirtualiZarr integration

0c96aed

Merge branch 'cubed-dev:main' into feature/virtualizarr-support

3522712

adding virtualizarr support

bfc3fc6

Merge branch 'feature/virtualizarr-support' of github.com:neilSchroed…

9cf0b3c

…er/cubed into feature/virtualizarr-support

neilSchroeder closed this Jan 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/virtualizarr support#861

Feature/virtualizarr support#861
neilSchroeder wants to merge 7 commits intocubed-dev:mainfrom
neilSchroeder:feature/virtualizarr-support

neilSchroeder commented Jan 13, 2026

Uh oh!

TomNicholas commented Jan 13, 2026

Uh oh!

neilSchroeder commented Jan 13, 2026 •

edited

Loading

Uh oh!

TomNicholas commented Jan 13, 2026 •

edited

Loading

Uh oh!

TomNicholas commented Jan 13, 2026 •

edited

Loading

Uh oh!

neilSchroeder commented Jan 13, 2026

Uh oh!

TomNicholas commented Jan 13, 2026 •

edited

Loading

Uh oh!

TomNicholas commented Jan 14, 2026

Uh oh!

neilSchroeder commented Jan 14, 2026

Uh oh!

TomNicholas commented Jan 14, 2026 •

edited

Loading

Uh oh!

neilSchroeder commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

neilSchroeder commented Jan 13, 2026

Virtual Array Support

What

Why

Some Comments

Uh oh!

TomNicholas commented Jan 13, 2026

Uh oh!

neilSchroeder commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomNicholas commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomNicholas commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neilSchroeder commented Jan 13, 2026

Uh oh!

TomNicholas commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomNicholas commented Jan 14, 2026

Uh oh!

neilSchroeder commented Jan 14, 2026

Uh oh!

TomNicholas commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neilSchroeder commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

neilSchroeder commented Jan 13, 2026 •

edited

Loading

TomNicholas commented Jan 13, 2026 •

edited

Loading

TomNicholas commented Jan 13, 2026 •

edited

Loading

TomNicholas commented Jan 13, 2026 •

edited

Loading

TomNicholas commented Jan 14, 2026 •

edited

Loading