Add label_multiset data type and Zarr-native codec by mkitti · Pull Request #55 · zarr-developers/zarr-extensions

mkitti · 2026-04-13T19:07:51Z

Summary

Registers the label_multiset data type and its Zarr-native label_multiset array-to-bytes codec, used by imglib2-label-multisets and the Paintera connectome annotation tool.

Data type (`label_multiset`)

A variable-width data type where each voxel holds a multiset of (uint64 labelId, uint32 count) pairs. Key properties:

Five reserved label IDs: BACKGROUND (0x0), MAX_ID (0xFFFFFFFFFFFFFFFC), OUTSIDE, INVALID, TRANSPARENT
ArgMax: the label ID with the highest count (ties broken by smallest ID)
Fill value: JSON string "0xFFFFFFFFFFFFFFFE" (INVALID singleton)
Supports multi-resolution downscaling via OME-Zarr multiscales groups

Codec (`label_multiset`)

An all-little-endian array-to-bytes codec with no configuration:

listEntryOffsets[0..N-1]  (uint32 LE each)   4·N bytes
listData  (all LE)                            remaining bytes

List deduplication is the key compression mechanism: voxels sharing identical entry lists reference the same byte offset, efficiently compressing uniform regions such as background.

Example metadata

{
    "data_type": "label_multiset",
    "fill_value": "0xFFFFFFFFFFFFFFFE",
    "codecs": [
        {"name": "label_multiset"},
        {"name": "gzip", "configuration": {"level": 6}}
    ]
}

Notes

For N5 interoperability with existing datasets, see the companion n5_label_multiset codec (to be submitted separately) used inside n5_varlen (Add n5_varlen codec for N5 varlength block format #54).
This PR is independent of Add n5_varlen codec for N5 varlength block format #54 and can be reviewed/merged separately.

Test plan

Confirm data-types/label_multiset/schema.json and codecs/label_multiset/schema.json validate correctly
Verify fill value encoding and argmax semantics described in the README

🤖 Generated with Claude Code

Defines a variable-width Zarr data type for label multisets, where each voxel holds a multiset of (uint64 labelId, uint32 count) pairs. Used by imglib2-label-multisets and the Paintera annotation tool for volumetric segmentation with multi-resolution downscaling support. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Defines an array-to-bytes codec for the label_multiset data type using an all-little-endian layout: listEntryOffsets[N] (uint32 LE) followed by listData (LE). Exploits per-chunk list deduplication for efficient storage of uniform regions. For N5 interoperability use n5_label_multiset inside n5_varlen instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

clbarnes

The codec seems like it would be more broadly applicable than just label multisets - the label multisetness of it isn't distinguished from any other time you may want to hold an index into the (large) data. The index could have its own codecs and location just like the sharding_indexed codec. This could be used either for deduplication/ compression, as it is here, or for making variable-length types partially decodable.

clbarnes · 2026-04-24T09:58:51Z

+Each entry list occupies `4 + 12·numEntries` bytes. An empty entry list (`numEntries = 0`)
+is valid and occupies exactly 4 bytes.
+
+Entries within a list should be sorted by `labelId` in ascending unsigned order, with no


SHOULD be sorted, but I guess MUST not have duplicates?

clbarnes · 2026-04-24T10:01:40Z

+
+## Reserved label IDs
+
+Five label ID values are reserved at the top of the unsigned 64-bit range:


BACKGROUND isn't at the top of the range

clbarnes · 2026-04-24T10:05:39Z

+## Multiresolution
+
+Downscaled resolution levels are stored as separate Zarr arrays within a multiscale group,
+compatible with the [OME-Zarr multiscales
+specification](https://ngff.openmicroscopy.org/latest/). Each downscaled voxel aggregates
+the entry lists from its higher-resolution children, summing counts for matching label IDs.


This could be presented as a non-normative recommendation so it's not strictly tied to OME-Zarr. There is another proposal for multiscales from the geo community https://github.com/zarr-conventions/multiscales

clbarnes · 2026-04-24T10:38:14Z

@@ -0,0 +1,173 @@
+# label_multiset data type


This seems like it could be described as a use case and conventions on top of a generic varlen/ list data type on top of a struct.

clbarnes · 2026-04-24T10:42:09Z

+A label ID is *regular* if it is ≤ `MAX_ID` as an unsigned integer (i.e., ≤
+`0xFFFFFFFFFFFFFFFC`).
+
+## ArgMax


This isn't really part of the data type, nor is it encoded; it seems like it's a convention in one type of processing for it. Maybe this could be in a "usage patterns" section, possibly along with the reserved label IDs above?

mkitti and others added 2 commits April 13, 2026 14:59

mkitti mentioned this pull request Apr 13, 2026

Add n5_label_multiset codec for N5 legacy label multiset payload #56

Draft

3 tasks

clbarnes reviewed Apr 24, 2026

View reviewed changes

clbarnes mentioned this pull request Apr 28, 2026

generic container data types #57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add label_multiset data type and Zarr-native codec#55

Add label_multiset data type and Zarr-native codec#55
mkitti wants to merge 2 commits intozarr-developers:mainfrom
mkitti:mkitti-label-multiset

mkitti commented Apr 13, 2026

Uh oh!

clbarnes left a comment

Uh oh!

clbarnes Apr 24, 2026

Uh oh!

clbarnes Apr 24, 2026

Uh oh!

clbarnes Apr 24, 2026

Uh oh!

clbarnes Apr 24, 2026

Uh oh!

clbarnes Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		## Reserved label IDs

		Five label ID values are reserved at the top of the unsigned 64-bit range:

Conversation

mkitti commented Apr 13, 2026

Summary

Data type (label_multiset)

Codec (label_multiset)

Example metadata

Notes

Test plan

Uh oh!

clbarnes left a comment

Choose a reason for hiding this comment

Uh oh!

clbarnes Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

clbarnes Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

clbarnes Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

clbarnes Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

clbarnes Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Data type (`label_multiset`)

Codec (`label_multiset`)