From c52cc638257aff1a75b4d1d6d21da82c8c7412b5 Mon Sep 17 00:00:00 2001 From: Mark Kittisopikul Date: Mon, 13 Apr 2026 14:52:57 -0400 Subject: [PATCH 1/2] Add label_multiset data type Defines a variable-width Zarr data type for label multisets, where each voxel holds a multiset of (uint64 labelId, uint32 count) pairs. Used by imglib2-label-multisets and the Paintera annotation tool for volumetric segmentation with multi-resolution downscaling support. Co-Authored-By: Claude Sonnet 4.6 --- data-types/label_multiset/README.md | 173 ++++++++++++++++++++++++++ data-types/label_multiset/schema.json | 20 +++ 2 files changed, 193 insertions(+) create mode 100644 data-types/label_multiset/README.md create mode 100644 data-types/label_multiset/schema.json diff --git a/data-types/label_multiset/README.md b/data-types/label_multiset/README.md new file mode 100644 index 0000000..f46797c --- /dev/null +++ b/data-types/label_multiset/README.md @@ -0,0 +1,173 @@ +# label_multiset data type + +Defines a variable-width data type for label multisets, where each array element holds a +multiset of label IDs, each with a non-negative integer count. This data type is used by +[imglib2-label-multisets](https://github.com/saalfeldlab/imglib2-label-multisets) for +volumetric segmentation data, particularly in the +[Paintera](https://github.com/saalfeldlab/paintera) connectome annotation tool. + +## Background + +A label multiset voxel stores a multiset of label IDs, each carrying a non-negative integer +count representing the number of occurrences of that label. At full resolution, every voxel +is typically a singleton — one label with count 1. After downsampling, a voxel may represent +many labels aggregated from higher-resolution voxels, with counts recording how many +sub-voxels carried each label. + +## Data type representation + +### Name + +The name of this data type is the string `"label_multiset"`. + +### Configuration + +No configuration is required or permitted for this data type. + +## Element structure + +Each array element is a list of `(labelId, count)` pairs: + +| Field | Type | Description | +|-----------|--------|-------------| +| `labelId` | uint64 | Label identifier | +| `count` | uint32 | Number of occurrences | + +The list should be sorted by `labelId` in ascending unsigned order; duplicate label IDs +must not appear (their counts must be summed). An empty list (zero pairs) is valid and +represents a voxel with no label information. + +## Reserved label IDs + +Five label ID values are reserved at the top of the unsigned 64-bit range: + +| Name | uint64 value (hex) | int64 value | Meaning | +|---------------|------------------------|-------------|---------| +| `BACKGROUND` | `0x0000000000000000` | `0` | Background label | +| `MAX_ID` | `0xFFFFFFFFFFFFFFFC` | `-4` | Largest usable regular label ID | +| `OUTSIDE` | `0xFFFFFFFFFFFFFFFD` | `-3` | Voxel is outside the dataset bounds | +| `INVALID` | `0xFFFFFFFFFFFFFFFE` | `-2` | Uninitialized / no data | +| `TRANSPARENT` | `0xFFFFFFFFFFFFFFFF` | `-1` | Fully transparent (display hint) | + +A label ID is *regular* if it is ≤ `MAX_ID` as an unsigned integer (i.e., ≤ +`0xFFFFFFFFFFFFFFFC`). + +## ArgMax + +The **argmax** of a voxel's multiset is the label ID with the highest count. Ties are +broken by the smaller label ID (unsigned comparison). If the multiset is empty, the argmax +is `INVALID` (`0xFFFFFFFFFFFFFFFE`). + +``` +argmax := INVALID +maxCount := 0 +for each (labelId, count) in entries: + if count > maxCount or (count == maxCount and labelId < argmax): + argmax = labelId + maxCount = count +``` + +The argmax is useful as a scalar integer projection of the multiset for visualization and +interoperability with single-label data consumers. + +## Fill value representation + +The `fill_value` field in array metadata must be a JSON string containing the hexadecimal +representation of a uint64 label ID. This label ID represents the sole element of a +singleton multiset with count 1: + +- `"0xFFFFFFFFFFFFFFFE"` — singleton `{INVALID → 1}` (canonical fill value) +- `"0x0000000000000000"` — singleton `{BACKGROUND → 1}` + +## Codec compatibility + +This data type must be used with exactly one array-to-bytes codec from the following: + +- [`"label_multiset"`](../../codecs/label_multiset/README.md): Zarr v3 native + serialization (all little-endian). Recommended for new arrays. +- [`"n5_varlen"`](../../codecs/n5_varlen/README.md): N5 varlength block format, + for interoperability with existing N5-based label multiset datasets. The first inner + codec inside `n5_varlen` must be + [`"n5_label_multiset"`](../../codecs/n5_label_multiset/README.md). + +Optional bytes-to-bytes codecs (e.g., `gzip`, `blosc`, `zstd`) may follow the +array-to-bytes codec (or be placed inside `n5_varlen`'s inner codec chain). + +## Array metadata example + +For a new Zarr v3 label multiset array: + +```json +{ + "zarr_format": 3, + "node_type": "array", + "shape": [80, 64, 64], + "data_type": "label_multiset", + "chunk_grid": { + "name": "regular", + "configuration": { + "chunk_shape": [32, 32, 32] + } + }, + "chunk_key_encoding": {"name": "default"}, + "fill_value": "0xFFFFFFFFFFFFFFFE", + "codecs": [ + {"name": "label_multiset"}, + {"name": "gzip", "configuration": {"level": 6}} + ], + "attributes": { + "label_multisets": true, + "maxId": 99 + } +} +``` + +For reading an existing N5 label multiset dataset: + +```json +{ + "zarr_format": 3, + "node_type": "array", + "shape": [80, 64, 64], + "data_type": "label_multiset", + "chunk_grid": { + "name": "regular", + "configuration": { + "chunk_shape": [32, 32, 32] + } + }, + "chunk_key_encoding": { + "name": "v2", + "configuration": {"separator": "/"} + }, + "fill_value": "0xFFFFFFFFFFFFFFFE", + "codecs": [ + { + "name": "n5_varlen", + "configuration": { + "codecs": [ + {"name": "n5_label_multiset"} + ] + } + } + ], + "attributes": { + "label_multisets": true + } +} +``` + +## Multiresolution + +Downscaled resolution levels are stored as separate Zarr arrays within a multiscale group, +compatible with the [OME-Zarr multiscales +specification](https://ngff.openmicroscopy.org/latest/). Each downscaled voxel aggregates +the entry lists from its higher-resolution children, summing counts for matching label IDs. + +## Change log + +No changes yet. + +## Current maintainers + +* [Mark Kittisopikul](https://github.com/mkitti) diff --git a/data-types/label_multiset/schema.json b/data-types/label_multiset/schema.json new file mode 100644 index 0000000..115d4fe --- /dev/null +++ b/data-types/label_multiset/schema.json @@ -0,0 +1,20 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "oneOf": [ + { + "type": "object", + "properties": { + "name": { + "const": "label_multiset" + }, + "configuration": { + "type": "object", + "additionalProperties": false + } + }, + "required": ["name"], + "additionalProperties": false + }, + { "const": "label_multiset" } + ] +} From cf9eca3149a838fef728d25afa7f72cf1a4e2c3e Mon Sep 17 00:00:00 2001 From: Mark Kittisopikul Date: Mon, 13 Apr 2026 14:53:30 -0400 Subject: [PATCH 2/2] Add label_multiset codec for Zarr-native label multiset serialization Defines an array-to-bytes codec for the label_multiset data type using an all-little-endian layout: listEntryOffsets[N] (uint32 LE) followed by listData (LE). Exploits per-chunk list deduplication for efficient storage of uniform regions. For N5 interoperability use n5_label_multiset inside n5_varlen instead. Co-Authored-By: Claude Sonnet 4.6 --- codecs/label_multiset/README.md | 116 ++++++++++++++++++++++++++++++ codecs/label_multiset/schema.json | 20 ++++++ 2 files changed, 136 insertions(+) create mode 100644 codecs/label_multiset/README.md create mode 100644 codecs/label_multiset/schema.json diff --git a/codecs/label_multiset/README.md b/codecs/label_multiset/README.md new file mode 100644 index 0000000..f9f29d9 --- /dev/null +++ b/codecs/label_multiset/README.md @@ -0,0 +1,116 @@ +# label_multiset codec + +Defines an `array -> bytes` codec that serializes arrays of the +[`label_multiset`](../../data-types/label_multiset/README.md) data type into a compact +binary representation using the Zarr-native (all little-endian) format. The codec exploits +per-chunk list deduplication: voxels sharing identical entry lists reference the same +offset, which compresses regions of uniform labeling (e.g., background) very efficiently. + +For N5 interoperability with existing imglib2-label-multisets datasets, use the +[`n5_label_multiset`](../n5_label_multiset/README.md) codec inside +[`n5_varlen`](../n5_varlen/README.md) instead. + +## Codec name + +The value of the `name` member in the codec object MUST be `label_multiset`. + +## Configuration parameters + +No configuration is required or permitted for this codec. + +## Compatibility + +This codec is only compatible with the +[`"label_multiset"`](../../data-types/label_multiset/README.md) data type. + +## Example + +```json +{ + "data_type": "label_multiset", + "codecs": [{"name": "label_multiset"}] +} +``` + +## Format and algorithm + +This is an `array -> bytes` codec. The chunk contains `N` voxels in the chunk-linearization +order defined by the chunk grid. `N` equals the product of the chunk shape dimensions +(partial boundary chunks are treated as padded to the full chunk shape for the purposes of +this codec). + +### Chunk layout + +``` +┌──────────────────────────────────────────────────────────────────┐ +│ listEntryOffsets[0..N-1] (uint32 each, little-endian) │ 4·N bytes +├──────────────────────────────────────────────────────────────────┤ +│ listData (variable, all little-endian) │ remaining bytes +└──────────────────────────────────────────────────────────────────┘ +``` + +Total encoded size: `4·N + listDataSize` bytes. + +### `listEntryOffsets` array + +One entry per voxel, in chunk-linearization (C / row-major) order. Each entry is an +unsigned 32-bit little-endian byte offset into the `listData` region where that voxel's +entry list begins. Multiple voxels may share the same offset (list deduplication). + +### `listData` region + +A concatenation of unique entry lists in the order they were first encountered during +encoding. Each entry list has the following structure: + +``` +Offset Size Endian Content +------ ---- ------ ------------------------------------------- +0 4 LE numEntries (uint32) — number of entries +4 8 LE entries[0].labelId (uint64) +12 4 LE entries[0].count (uint32) +16 8 LE entries[1].labelId (uint64) +24 4 LE entries[1].count (uint32) +... +``` + +Each entry list occupies `4 + 12·numEntries` bytes. An empty entry list (`numEntries = 0`) +is valid and occupies exactly 4 bytes. + +Entries within a list should be sorted by `labelId` in ascending unsigned order, with no +duplicate `labelId` values. + +> **Note:** Existing N5 datasets produced by imglib2-label-multisets may contain unsorted +> entry lists. Implementations SHOULD accept unsorted lists when decoding and SHOULD write +> sorted lists when encoding. + +### Encoding procedure + +1. Iterate voxels in chunk-linearization order. +2. For each voxel, serialize its entry list to bytes. +3. If an identical byte sequence already exists in `listData`, record its existing offset + in `listEntryOffsets`; otherwise append the byte sequence to `listData` and record the + new offset. +4. Write `listEntryOffsets` (uint32 LE each), then `listData`. + +### Decoding procedure + +1. Read `N` × uint32 LE values as `listEntryOffsets`. +2. Read the remaining bytes as `listData`. +3. For each voxel, locate its entry list in `listData` using the corresponding offset and + parse `numEntries` followed by the `(labelId, count)` pairs. +4. Compute the argmax for each voxel from its entry list (or deduplicate from a cache of + previously computed argmax values for repeated offsets). + +### Null / all-empty chunks + +If all voxels in a chunk have empty entry lists (zero entries), an implementation MAY +represent the chunk as absent in the store (using the fill value mechanism) rather than +writing an explicit byte sequence. + +## Change log + +No changes yet. + +## Current maintainers + +* [Mark Kittisopikul](https://github.com/mkitti) diff --git a/codecs/label_multiset/schema.json b/codecs/label_multiset/schema.json new file mode 100644 index 0000000..115d4fe --- /dev/null +++ b/codecs/label_multiset/schema.json @@ -0,0 +1,20 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "oneOf": [ + { + "type": "object", + "properties": { + "name": { + "const": "label_multiset" + }, + "configuration": { + "type": "object", + "additionalProperties": false + } + }, + "required": ["name"], + "additionalProperties": false + }, + { "const": "label_multiset" } + ] +}