Our config right now contains this logic for defining a default encoding scheme for a given data type:
|
"v2_default_filters": { |
|
"numeric": None, |
|
"string": [{"id": "vlen-utf8"}], |
|
"bytes": [{"id": "vlen-bytes"}], |
|
"raw": None, |
|
}, |
|
"v3_default_filters": {"numeric": [], "string": [], "bytes": []}, |
|
"v3_default_serializer": { |
|
"numeric": {"name": "bytes", "configuration": {"endian": "little"}}, |
|
"string": {"name": "vlen-utf8"}, |
|
"bytes": {"name": "vlen-bytes"}, |
|
}, |
|
"v3_default_compressors": { |
|
"numeric": [ |
|
{"name": "zstd", "configuration": {"level": 0, "checksum": False}}, |
|
], |
|
"string": [ |
|
{"name": "zstd", "configuration": {"level": 0, "checksum": False}}, |
|
], |
|
"bytes": [ |
|
{"name": "zstd", "configuration": {"level": 0, "checksum": False}}, |
|
], |
|
}, |
This approach is problematic because it requires dividing our data types into separate categories which are not very well defined -- is a fixed-length utf32 data type a "string" or "numeric" type?
Given the changes coming in #2874, I propose the following alteration to our approach here:
-
Pull this stuff out of the config entirely.
-
Confine all this logic to a single function for automatically picking a chunk encoding based on a data type + a requested chunk encoding. This function should also check for incompatibility between a data type and a requested chunk encoding. For example, if someone requests a variable-length string data type but does not specify vlen-utf8 as a serializer, then they should get a clear, early error.
These would be breaking changes, but our current approach is, IMO, unworkable.
Our config right now contains this logic for defining a default encoding scheme for a given data type:
zarr-python/src/zarr/core/config.py
Lines 85 to 107 in af55fcf
This approach is problematic because it requires dividing our data types into separate categories which are not very well defined -- is a fixed-length utf32 data type a "string" or "numeric" type?
Given the changes coming in #2874, I propose the following alteration to our approach here:
Pull this stuff out of the config entirely.
Confine all this logic to a single function for automatically picking a chunk encoding based on a data type + a requested chunk encoding. This function should also check for incompatibility between a data type and a requested chunk encoding. For example, if someone requests a variable-length string data type but does not specify vlen-utf8 as a serializer, then they should get a clear, early error.
These would be breaking changes, but our current approach is, IMO, unworkable.