Feature/virtualizarr support#861
Conversation
…er/cubed into feature/virtualizarr-support
|
Hi @neilSchroeder ! This is cool to see, but I'm very unclear what problem you're trying to solve here. I don't understand what the purpose of step I must be missing something fundamental - maybe you could show us what your user-space code actually does? |
|
Hi @TomNicholas, happy to clarify. Here's some pseudo code that I think illustrates the problem: # Step 1: Create virtual dataset
manifest_dataset = virtualizarr.open_virtual_mfdataset(
s3_files, # 8,760 file paths
parallel="lithops" # Uses serverless workers to read metadata
)
# Result: manifest_dataset contains only references
# Example: {"path": "s3://bucket/file.nc", "offset": 1024, "length": 8192}
# No actual data downloaded
# Step 2: Convert VirtualiZarr manifests to Cubed arrays
temperature = cubed.from_manifest(
load_chunk=lambda key: fetch_from_s3(manifest[key]), # Serializable loader
shape=(8760, 39, 150, 150),
chunks=(1, 39, 50, 50),
dtype='float32'
)
# Result: Cubed array with lazy chunk loaders
# Still no data movement - just a computation plan
# Step 3: Build computation DAG (still lazy)
pressure = cubed.from_manifest(...)
interpolated = cubed_operations.interp_to_pressure_levels(
temperature,
pressure,
target_levels=[1000, 850, 500] # Standard atmospheric levels
)
# Result: Computation graph, no execution yet
# Step 4: Execute on serverless workers (Lithops/Lambda)
interpolated.to_zarr(output_path, mode="w", consolidated=True)
# Result: Each worker:
# 1. Gets chunk key: (time=100, level=slice, y=0:50, x=0:50)
# 2. Calls load_chunk(key) → fetches ONLY that chunk from S3
# 3. Processes chunk → interpolates to pressure levels
# 4. Writes result to Zarr
# Data never passes through clientI was getting stuck attempting to go directly from virtual datasets to cubed because of the ManifestArrays, so this is my attempt to solve that issue. I tried to find documentation on how to go about this, but couldn't come up with anything that would let me use lithops to get embarassingly parallel output to zarr. |
|
Thanks for the clarification, that's helpful. IMO this is a very weird way to use VirtualiZarr. You're never persisting the virtual references anywhere, so how is this any better than doing (Admittedly the |
You're supposed to serialize the contents of the ManifestArrays to Icechunk/Kerchunk before attempting to read back the data. That works today. |
This is a huge limiting factor for me. The dataset I'm trying to read is written such that each hourly timestep is its own file, so without lithops I'm stuck reading 8761 files for each year, there are 120 some years I'd like to process across 5 simulations and a host of grid sizes. I guess I could have gone the other way and implemented lithops in xarray, but this was the path that made the most sense to me. @TomNicholas how would you have chosen to solve this? |
If I expected to open these datasets as a single datacube regularly in future, I would have simply done manifest_dataset = virtualizarr.open_virtual_mfdataset(
s3_files, # 8,760 file paths
parallel="lithops" # Uses serverless workers to read metadata
)
# serialize the manifests as virtual chunks
ic_repo = ... # this could be arraylake's `client.get_repo()` of course
session = ic_repo.writable_session("main")
manifest_dataset.vz.to_icechunk(session.store)
session.commit("wrote all the virtual chunks")
# (would also need to set the virtual chunk containers)
# to access, just open the Icechunk store as normal, but as cubed arrays
session = ic_repo.readonly_session("main")
ds = xr.open_zarr(session.store, chunked_array_type="cubed")If for some reason I didn't want to create or serialize virtual refs, then I would have looked at adding lithops parallelization to |
|
Does that make sense @neilSchroeder ? Did you try the approach I suggested instead? |
|
Yeah @TomNicholas I tried the approach you suggested. It's probably the right way to do things, as it does allow me to persist the virtual references, but it's a bit slower than I'd like. Skipping the virtual persistence was the point of my code, as it makes a significant speed boost from legacy product to derived ARCO product. But its not a great philosophical approach. |
Why does that matter? Isn't persisting the virtualization a one-time cost? Whereas in your approach above you have to use an entire fleet of lambas to open the data every time... |
|
It's a one time cost assuming the data doesn't move right? I'm not too worried about it moving in this case, so it is a one time cost. But I could see a use case for less stable datasets. |
Virtual Array Support
What
This is a shot in the dark at attempting to implement
ManifestArraysupport incubed.Why
There's an ideal workflow that looks like:
This makes use of the zero-copy functionality of VirtualiZarr and allows for embarassingly parallel workflows that start with NetCDFs and do not need intermediate Zarrs.
Some Comments
@tomwhite & @TomNicholas I would love your feedback on the implementation here, I sorta hacked this together and tried to get it to work with using a real-world example. It works for my use case now, but I'm not convinced I've come up with a particularly thorough/generalizable implementation.