| layout | single | ||||
|---|---|---|---|---|---|
| author_profile | false | ||||
| title | Zarr Components | ||||
| sidebar |
|
Zarr consists of several components, both abstract and concrete. These span both the physical storage layer and the conceptual structural layer. Zarr-related projects all follow the Zarr specification (and hence data model), but otherwise may choose to implement other layers however they wish.
These abstract components together describe what type of data can be stored in zarr, and how to store it, without assuming you are working in a particular programming language, or with a particular storage system.
Specification: All zarr-related projects obey the Zarr Specification, which formally describes how to serialize and de-serialize array data as byte streams as well as store metadata via an Abstract Key-Value Store Interface. A system of Codecs is used to describe the encoding and serialization steps.
Data Model: The specification's description of the Stored Representation implies a particular data model, based on the HDF Abstract Data Model. It consists of a heirarchical tree of groups and arrays, with optional arbitrary metadata at every node. This model is completely domain-agnostic.
Format: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format". Most, but not all, zarr implementations will serialize to this format.
Extensions: Zarr provides a core set of generally-useful features, but extensions to this core are encouraged. These might take the form of domain-specific metadata conventions, new codecs, or additions to the data model via extension points. These can be abstract, or enforced by implementations or client libraries however they like, but generally should be opt-in.
Concrete implementations of the abstract components can be implemented in any language. The canonical reference implementation is Zarr-Python, but there are many other implementations. Zarr-Python contains reference examples of useful constructs that can be re-implemented in other languages.
Abstract Base Classes: Zarr-python's zarr.abc module contains abstract base classes enforcing a particular python realization of the specification's Key-Value Store interface, based on a MutableMapping-like API.
This component is concrete in the sense that it is implemented in a specific programming language, and enforces particular syntax for getting and setting values in a key-value store.
Store Implementations: Zarr-python's zarr.storage module contains concrete implementations of the Store ABC for interacting with particular storage systems, such as a local filesystem or object storage.
These write data in the Native Zarr Format.
It's expected that most users of zarr from python will just use one of these implementations.
User API: Zarr-python's zarr.api module contains functions and classes for interacting with any concrete implementation of the zarr.abc.Store interface.
This allows user applications to use a standard zarr API to read and write from a variety of common storage systems.
One of Zarr's greatest strengths is its flexibility, or "hackability". In addition to the generality of using key-value stores as the main abstraction, individual projects can achieve powerful functionality by intelligently using only some of the Zarr components. Here are a few interesting zarr-related projects, which selectively make use of a subset of different zarr components, both abstract and concrete.
-
MongoDBStore is a concrete store implementation in python, which stores values in a MongoDB NoSQL database under zarr keys. It is therefore spec-compliant, and can be interacted with via the zarr-python user API, but does not write data in the native zarr format.
-
VirtualiZarr provides a concrete store implementation in python (the
ManifestStore) which stores references to locations and byte ranges of chunks on disk inside "chunk manifests", which reside inside files stored in other formats such as netCDF. These references are generated by "readers", which do the job of parsing the file structure and mapping the contents to the zarr data model. VirtualiZarr therefore eschews the native zarr format but still provides spec-compliant access to non-zarr-formatted data using zarr-python's API, without duplicating the original data. The manifests effectively act as an indirection layer between the zarr-spec-compliant key interface, and the actual location of the chunks in storage. -
NCZarr and Lindi can both in some sense be considered as the opposite of VirtualiZarr - they allow interacting with zarr-formatted data on disk via a non-zarr API. Lindi maps zarr's data model to the HDF data model and allows access to via the
h5pylibrary through theLindiH5pyFileclass. NCZarr allows interacting with zarr-formatted data via the netcdf-c library. Note that both libraries implement optional additional optimizations by going beyond the zarr specification and format on disk, which is not recommended. -
Tensorstore is a general storage library written in C++ that can write to the Zarr format (so is a spec-compliant non-python "native" store implementation) but also to other array formats such as N5. As it can write to multiple different storage sytems, it effectively has its own set of concrete store implementations. Additional features are provided, notably using an Optionally-Cooperative Distributed B+Tree (OCDBT) on top of a base key-value store to implement ACID transactions. It still stores all data using the native Zarr Format, but versions keys at the store level.
-
Icechunk is a cloud-native tensor storage engine which also provides ACID transactions, but does so via indirection between a zarr-spec-compliant key-value store interface and a specialized non-zarr-native storage layout on-disk (for which Icechunk has it's own format spec). Whilst the core icechunk client is written in rust, the
icechunk-pythonclient implements a concrete subclass of the zarr-pythonStoreABC. Therefore libraries such as xarray can use the zarr-python user API to read and write to icechunk stores, effectively treating them as version-controlled zarr stores. Icechunk also integrates with VirtualiZarr as a serialization format for the byte range references, together allowing data stored in non-zarr formats to be committed to a persistent icechunk store and read back via the zarr-python API without copying the original data.