Add label_multiset data type and Zarr-native codec#55
Add label_multiset data type and Zarr-native codec#55mkitti wants to merge 2 commits intozarr-developers:mainfrom
Conversation
Defines a variable-width Zarr data type for label multisets, where each voxel holds a multiset of (uint64 labelId, uint32 count) pairs. Used by imglib2-label-multisets and the Paintera annotation tool for volumetric segmentation with multi-resolution downscaling support. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Defines an array-to-bytes codec for the label_multiset data type using an all-little-endian layout: listEntryOffsets[N] (uint32 LE) followed by listData (LE). Exploits per-chunk list deduplication for efficient storage of uniform regions. For N5 interoperability use n5_label_multiset inside n5_varlen instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
clbarnes
left a comment
There was a problem hiding this comment.
The codec seems like it would be more broadly applicable than just label multisets - the label multisetness of it isn't distinguished from any other time you may want to hold an index into the (large) data. The index could have its own codecs and location just like the sharding_indexed codec. This could be used either for deduplication/ compression, as it is here, or for making variable-length types partially decodable.
| Each entry list occupies `4 + 12·numEntries` bytes. An empty entry list (`numEntries = 0`) | ||
| is valid and occupies exactly 4 bytes. | ||
|
|
||
| Entries within a list should be sorted by `labelId` in ascending unsigned order, with no |
There was a problem hiding this comment.
SHOULD be sorted, but I guess MUST not have duplicates?
|
|
||
| ## Reserved label IDs | ||
|
|
||
| Five label ID values are reserved at the top of the unsigned 64-bit range: |
There was a problem hiding this comment.
BACKGROUND isn't at the top of the range
| ## Multiresolution | ||
|
|
||
| Downscaled resolution levels are stored as separate Zarr arrays within a multiscale group, | ||
| compatible with the [OME-Zarr multiscales | ||
| specification](https://ngff.openmicroscopy.org/latest/). Each downscaled voxel aggregates | ||
| the entry lists from its higher-resolution children, summing counts for matching label IDs. |
There was a problem hiding this comment.
This could be presented as a non-normative recommendation so it's not strictly tied to OME-Zarr. There is another proposal for multiscales from the geo community https://github.com/zarr-conventions/multiscales
| @@ -0,0 +1,173 @@ | |||
| # label_multiset data type | |||
There was a problem hiding this comment.
This seems like it could be described as a use case and conventions on top of a generic varlen/ list data type on top of a struct.
| A label ID is *regular* if it is ≤ `MAX_ID` as an unsigned integer (i.e., ≤ | ||
| `0xFFFFFFFFFFFFFFFC`). | ||
|
|
||
| ## ArgMax |
There was a problem hiding this comment.
This isn't really part of the data type, nor is it encoded; it seems like it's a convention in one type of processing for it. Maybe this could be in a "usage patterns" section, possibly along with the reserved label IDs above?
Summary
Registers the
label_multisetdata type and its Zarr-nativelabel_multisetarray-to-bytes codec, used by imglib2-label-multisets and the Paintera connectome annotation tool.Data type (
label_multiset)A variable-width data type where each voxel holds a multiset of
(uint64 labelId, uint32 count)pairs. Key properties:BACKGROUND(0x0),MAX_ID(0xFFFFFFFFFFFFFFFC),OUTSIDE,INVALID,TRANSPARENT"0xFFFFFFFFFFFFFFFE"(INVALID singleton)Codec (
label_multiset)An all-little-endian array-to-bytes codec with no configuration:
List deduplication is the key compression mechanism: voxels sharing identical entry lists reference the same byte offset, efficiently compressing uniform regions such as background.
Example metadata
{ "data_type": "label_multiset", "fill_value": "0xFFFFFFFFFFFFFFFE", "codecs": [ {"name": "label_multiset"}, {"name": "gzip", "configuration": {"level": 6}} ] }Notes
n5_label_multisetcodec (to be submitted separately) used insiden5_varlen(Add n5_varlen codec for N5 varlength block format #54).Test plan
data-types/label_multiset/schema.jsonandcodecs/label_multiset/schema.jsonvalidate correctly🤖 Generated with Claude Code