Skip to content

Add label_multiset data type and Zarr-native codec#55

Open
mkitti wants to merge 2 commits intozarr-developers:mainfrom
mkitti:mkitti-label-multiset
Open

Add label_multiset data type and Zarr-native codec#55
mkitti wants to merge 2 commits intozarr-developers:mainfrom
mkitti:mkitti-label-multiset

Conversation

@mkitti
Copy link
Copy Markdown
Contributor

@mkitti mkitti commented Apr 13, 2026

Summary

Registers the label_multiset data type and its Zarr-native label_multiset array-to-bytes codec, used by imglib2-label-multisets and the Paintera connectome annotation tool.

Data type (label_multiset)

A variable-width data type where each voxel holds a multiset of (uint64 labelId, uint32 count) pairs. Key properties:

  • Five reserved label IDs: BACKGROUND (0x0), MAX_ID (0xFFFFFFFFFFFFFFFC), OUTSIDE, INVALID, TRANSPARENT
  • ArgMax: the label ID with the highest count (ties broken by smallest ID)
  • Fill value: JSON string "0xFFFFFFFFFFFFFFFE" (INVALID singleton)
  • Supports multi-resolution downscaling via OME-Zarr multiscales groups

Codec (label_multiset)

An all-little-endian array-to-bytes codec with no configuration:

listEntryOffsets[0..N-1]  (uint32 LE each)   4·N bytes
listData  (all LE)                            remaining bytes

List deduplication is the key compression mechanism: voxels sharing identical entry lists reference the same byte offset, efficiently compressing uniform regions such as background.

Example metadata

{
    "data_type": "label_multiset",
    "fill_value": "0xFFFFFFFFFFFFFFFE",
    "codecs": [
        {"name": "label_multiset"},
        {"name": "gzip", "configuration": {"level": 6}}
    ]
}

Notes

Test plan

  • Confirm data-types/label_multiset/schema.json and codecs/label_multiset/schema.json validate correctly
  • Verify fill value encoding and argmax semantics described in the README

🤖 Generated with Claude Code

mkitti and others added 2 commits April 13, 2026 14:59
Defines a variable-width Zarr data type for label multisets, where
each voxel holds a multiset of (uint64 labelId, uint32 count) pairs.
Used by imglib2-label-multisets and the Paintera annotation tool for
volumetric segmentation with multi-resolution downscaling support.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Defines an array-to-bytes codec for the label_multiset data type using
an all-little-endian layout: listEntryOffsets[N] (uint32 LE) followed
by listData (LE). Exploits per-chunk list deduplication for efficient
storage of uniform regions. For N5 interoperability use n5_label_multiset
inside n5_varlen instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@clbarnes clbarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codec seems like it would be more broadly applicable than just label multisets - the label multisetness of it isn't distinguished from any other time you may want to hold an index into the (large) data. The index could have its own codecs and location just like the sharding_indexed codec. This could be used either for deduplication/ compression, as it is here, or for making variable-length types partially decodable.

Each entry list occupies `4 + 12·numEntries` bytes. An empty entry list (`numEntries = 0`)
is valid and occupies exactly 4 bytes.

Entries within a list should be sorted by `labelId` in ascending unsigned order, with no
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHOULD be sorted, but I guess MUST not have duplicates?


## Reserved label IDs

Five label ID values are reserved at the top of the unsigned 64-bit range:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BACKGROUND isn't at the top of the range

Comment on lines +160 to +165
## Multiresolution

Downscaled resolution levels are stored as separate Zarr arrays within a multiscale group,
compatible with the [OME-Zarr multiscales
specification](https://ngff.openmicroscopy.org/latest/). Each downscaled voxel aggregates
the entry lists from its higher-resolution children, summing counts for matching label IDs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be presented as a non-normative recommendation so it's not strictly tied to OME-Zarr. There is another proposal for multiscales from the geo community https://github.com/zarr-conventions/multiscales

@@ -0,0 +1,173 @@
# label_multiset data type
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could be described as a use case and conventions on top of a generic varlen/ list data type on top of a struct.

A label ID is *regular* if it is ≤ `MAX_ID` as an unsigned integer (i.e., ≤
`0xFFFFFFFFFFFFFFFC`).

## ArgMax
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really part of the data type, nor is it encoded; it seems like it's a convention in one type of processing for it. Maybe this could be in a "usage patterns" section, possibly along with the reserved label IDs above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants