Every Zarr array has a "data type", which defines the meaning and physical layout of the array's elements. As Zarr Python is tightly integrated with NumPy, it's easy to create arrays with NumPy data types:
>>> import zarr
>>> import numpy as np
>>> z = zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8'))
>>> z
<Array memory:... shape=(10,) dtype=uint8>Unlike NumPy arrays, Zarr arrays are designed to accessed by Zarr implementations in different programming languages. This means Zarr data types must be interpreted correctly when clients read an array. Each Zarr data type defines procedures for encoding and decoding both the data type itself, and scalars from that data type to and from Zarr array metadata. And these serialization procedures depend on the Zarr format.
Version 2 of the Zarr format defined its data types relative to
NumPy's data types,
and added a few non-NumPy data types as well. Thus the JSON identifier for a NumPy-compatible data
type is just the NumPy str attribute of that data type:
>>> import zarr
>>> import numpy as np
>>> import json
>>>
>>> store = {}
>>> np_dtype = np.dtype('int64')
>>> z = zarr.create_array(store=store, shape=(1,), dtype=np_dtype, zarr_format=2)
>>> dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"]
>>> dtype_meta
'<i8'
>>> assert dtype_meta == np_dtype.strNote
The < character in the data type metadata encodes the
endianness,
or "byte order", of the data type. Following NumPy's example,
in Zarr version 2 each data type has an endianness where applicable.
However, Zarr version 3 data types do not store endianness information.
In addition to defining a representation of the data type itself (which in the example above was
just a simple string "<i8"), Zarr also
defines a metadata representation for scalars associated with each data type. This is necessary
because Zarr arrays have a JSON-serializable fill_value attribute that defines a scalar value to use when reading
uninitialized chunks of a Zarr array.
Integer and float scalars are stored as JSON numbers, except for special floats like NaN,
positive infinity, and negative infinity, which are stored as strings.
More broadly, each Zarr data type defines its own rules for how scalars of that type are stored in
JSON.
Zarr V3 brings several key changes to how data types are represented:
Zarr V3 identifies the basic data types as strings like
"int8","int16", etc.By contrast, Zarr V2 uses the NumPy character code representation for data types: In Zarr V2,
int8is represented as"|i1".A Zarr V3 data type does not have endianness. This is a departure from Zarr V2, where multi-byte data types are defined with endianness information. Instead, Zarr V3 requires that endianness, where applicable, is specified in the
codecsattribute of array metadata.While some Zarr V3 data types are identified by strings, others can be identified by a
JSONobject. For example, consider this specification of adatetimedata type:{ "name": "numpy.datetime64", "configuration": { "unit": "s", "scale_factor": 10 } }Zarr V2 generally uses structured string representations to convey the same information. The data type given in the previous example would be represented as the string
">M[10s]"in Zarr V2. This is more compact, but can be harder to parse.
For more about data types in Zarr V3, see the V3 specification.
The two Zarr formats that Zarr Python supports specify data types in two different ways:
data types in Zarr version 2 are encoded as NumPy-compatible strings, while data types in Zarr version
3 are encoded as either strings or JSON objects,
and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types.
To abstract over these syntactical and semantic differences, Zarr Python uses a class called
ZDType provide Zarr V2 and Zarr V3 compatibility
routines for ""native" data types. In this context, a "native" data type is a Python class,
typically defined in another library, that models an array's data type. For example, np.uint8 is a native
data type defined in NumPy, which Zarr Python wraps with a ZDType instance called
UInt8.
Each data type supported by Zarr Python is modeled by ZDType subclass, which provides an
API for the following operations:
- Wrapping / unwrapping a native data type
- Encoding / decoding a data type to / from Zarr V2 and Zarr V3 array metadata.
- Encoding / decoding a scalar value to / from Zarr V2 and Zarr V3 array metadata.
Create a ZDType from a native data type:
>>> from zarr.core.dtype import Int8
>>> import numpy as np
>>> int8 = Int8.from_native_dtype(np.dtype('int8'))Convert back to native data type:
>>> native_dtype = int8.to_native_dtype()
>>> assert native_dtype == np.dtype('int8')Get the default scalar value for the data type:
>>> default_value = int8.default_scalar()
>>> assert default_value == np.int8(0)Serialize to JSON for Zarr V2 and V3
>>> json_v2 = int8.to_json(zarr_format=2)
>>> json_v2
{'name': '|i1', 'object_codec_id': None}
>>> json_v3 = int8.to_json(zarr_format=3)
>>> json_v3
'int8'Serialize a scalar value to JSON:
>>> json_value = int8.to_json_scalar(42, zarr_format=3)
>>> json_value
42Deserialize a scalar value from JSON:
>>> scalar_value = int8.from_json_scalar(42, zarr_format=3)
>>> assert scalar_value == np.int8(42)