Skip to content

Generalize ChunkManifest to hold native chunks as well as virtual refs #851

@TomNicholas

Description

@TomNicholas

Generalizing the ChunkManifest class to hold native chunks as well as virtual refs would unlock several features.

It's needed for:

  1. Writing an IcechunkParser (Reading virtual references back out into VirtualiZarr Manifests earth-mover/icechunk#104)
  2. Concatenating virtual and non-virtual data in memory (Icechunk already supports doing this on-disk via append_dim)
  3. ManifestStore.to_icechunk()/kerchunk(), and thereby making xarray an optional dependency (Make xarray an optional dependency? #521)

The problem is that the current implementation of ChunkManifest uses a clever trick: it's just 3 numpy arrays in a trenchcoat. This gives us loads of stuff for free:

  • Efficient contiguous in-memory representation
  • Efficient handling of Variable-length strings (via the numpy 2 dtype)
  • Efficient functions for iterating over every element
  • Efficient multi-dimensional concat/stack functions for merging chunk manifests
  • No unusual or non-python dependencies

But I don't know how to keep that design and also store non-virtual chunks in arbitrary locations within those arrays.

Some alternatives:

  • Numpy object array (almost certainly very inefficient)-
  • Keep the 3 numpy arrays, but have an auxiliary buffer in which to store non-virtual chunks in memory (proposed on Support in-lined Kerchunk references #794 by @maxrjones)
  • Write our own in-memory chunk manifest class in rust

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions