From c52cc638257aff1a75b4d1d6d21da82c8c7412b5 Mon Sep 17 00:00:00 2001
From: Mark Kittisopikul <kittisopikulm@janelia.hhmi.org>
Date: Mon, 13 Apr 2026 14:52:57 -0400
Subject: [PATCH 1/2] Add label_multiset data type

Defines a variable-width Zarr data type for label multisets, where
each voxel holds a multiset of (uint64 labelId, uint32 count) pairs.
Used by imglib2-label-multisets and the Paintera annotation tool for
volumetric segmentation with multi-resolution downscaling support.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 data-types/label_multiset/README.md   | 173 ++++++++++++++++++++++++++
 data-types/label_multiset/schema.json |  20 +++
 2 files changed, 193 insertions(+)
 create mode 100644 data-types/label_multiset/README.md
 create mode 100644 data-types/label_multiset/schema.json

diff --git a/data-types/label_multiset/README.md b/data-types/label_multiset/README.md
new file mode 100644
index 0000000..f46797c
--- /dev/null
+++ b/data-types/label_multiset/README.md
@@ -0,0 +1,173 @@
+# label_multiset data type
+
+Defines a variable-width data type for label multisets, where each array element holds a
+multiset of label IDs, each with a non-negative integer count. This data type is used by
+[imglib2-label-multisets](https://github.com/saalfeldlab/imglib2-label-multisets) for
+volumetric segmentation data, particularly in the
+[Paintera](https://github.com/saalfeldlab/paintera) connectome annotation tool.
+
+## Background
+
+A label multiset voxel stores a multiset of label IDs, each carrying a non-negative integer
+count representing the number of occurrences of that label. At full resolution, every voxel
+is typically a singleton — one label with count 1. After downsampling, a voxel may represent
+many labels aggregated from higher-resolution voxels, with counts recording how many
+sub-voxels carried each label.
+
+## Data type representation
+
+### Name
+
+The name of this data type is the string `"label_multiset"`.
+
+### Configuration
+
+No configuration is required or permitted for this data type.
+
+## Element structure
+
+Each array element is a list of `(labelId, count)` pairs:
+
+| Field     | Type   | Description |
+|-----------|--------|-------------|
+| `labelId` | uint64 | Label identifier |
+| `count`   | uint32 | Number of occurrences |
+
+The list should be sorted by `labelId` in ascending unsigned order; duplicate label IDs
+must not appear (their counts must be summed). An empty list (zero pairs) is valid and
+represents a voxel with no label information.
+
+## Reserved label IDs
+
+Five label ID values are reserved at the top of the unsigned 64-bit range:
+
+| Name          | uint64 value (hex)     | int64 value | Meaning |
+|---------------|------------------------|-------------|---------|
+| `BACKGROUND`  | `0x0000000000000000`   | `0`         | Background label |
+| `MAX_ID`      | `0xFFFFFFFFFFFFFFFC`   | `-4`        | Largest usable regular label ID |
+| `OUTSIDE`     | `0xFFFFFFFFFFFFFFFD`   | `-3`        | Voxel is outside the dataset bounds |
+| `INVALID`     | `0xFFFFFFFFFFFFFFFE`   | `-2`        | Uninitialized / no data |
+| `TRANSPARENT` | `0xFFFFFFFFFFFFFFFF`   | `-1`        | Fully transparent (display hint) |
+
+A label ID is *regular* if it is ≤ `MAX_ID` as an unsigned integer (i.e., ≤
+`0xFFFFFFFFFFFFFFFC`).
+
+## ArgMax
+
+The **argmax** of a voxel's multiset is the label ID with the highest count. Ties are
+broken by the smaller label ID (unsigned comparison). If the multiset is empty, the argmax
+is `INVALID` (`0xFFFFFFFFFFFFFFFE`).
+
+```
+argmax := INVALID
+maxCount := 0
+for each (labelId, count) in entries:
+    if count > maxCount or (count == maxCount and labelId < argmax):
+        argmax = labelId
+        maxCount = count
+```
+
+The argmax is useful as a scalar integer projection of the multiset for visualization and
+interoperability with single-label data consumers.
+
+## Fill value representation
+
+The `fill_value` field in array metadata must be a JSON string containing the hexadecimal
+representation of a uint64 label ID. This label ID represents the sole element of a
+singleton multiset with count 1:
+
+- `"0xFFFFFFFFFFFFFFFE"` — singleton `{INVALID → 1}` (canonical fill value)
+- `"0x0000000000000000"` — singleton `{BACKGROUND → 1}`
+
+## Codec compatibility
+
+This data type must be used with exactly one array-to-bytes codec from the following:
+
+- [`"label_multiset"`](../../codecs/label_multiset/README.md): Zarr v3 native
+  serialization (all little-endian). Recommended for new arrays.
+- [`"n5_varlen"`](../../codecs/n5_varlen/README.md): N5 varlength block format,
+  for interoperability with existing N5-based label multiset datasets. The first inner
+  codec inside `n5_varlen` must be
+  [`"n5_label_multiset"`](../../codecs/n5_label_multiset/README.md).
+
+Optional bytes-to-bytes codecs (e.g., `gzip`, `blosc`, `zstd`) may follow the
+array-to-bytes codec (or be placed inside `n5_varlen`'s inner codec chain).
+
+## Array metadata example
+
+For a new Zarr v3 label multiset array:
+
+```json
+{
+    "zarr_format": 3,
+    "node_type": "array",
+    "shape": [80, 64, 64],
+    "data_type": "label_multiset",
+    "chunk_grid": {
+        "name": "regular",
+        "configuration": {
+            "chunk_shape": [32, 32, 32]
+        }
+    },
+    "chunk_key_encoding": {"name": "default"},
+    "fill_value": "0xFFFFFFFFFFFFFFFE",
+    "codecs": [
+        {"name": "label_multiset"},
+        {"name": "gzip", "configuration": {"level": 6}}
+    ],
+    "attributes": {
+        "label_multisets": true,
+        "maxId": 99
+    }
+}
+```
+
+For reading an existing N5 label multiset dataset:
+
+```json
+{
+    "zarr_format": 3,
+    "node_type": "array",
+    "shape": [80, 64, 64],
+    "data_type": "label_multiset",
+    "chunk_grid": {
+        "name": "regular",
+        "configuration": {
+            "chunk_shape": [32, 32, 32]
+        }
+    },
+    "chunk_key_encoding": {
+        "name": "v2",
+        "configuration": {"separator": "/"}
+    },
+    "fill_value": "0xFFFFFFFFFFFFFFFE",
+    "codecs": [
+        {
+            "name": "n5_varlen",
+            "configuration": {
+                "codecs": [
+                    {"name": "n5_label_multiset"}
+                ]
+            }
+        }
+    ],
+    "attributes": {
+        "label_multisets": true
+    }
+}
+```
+
+## Multiresolution
+
+Downscaled resolution levels are stored as separate Zarr arrays within a multiscale group,
+compatible with the [OME-Zarr multiscales
+specification](https://ngff.openmicroscopy.org/latest/). Each downscaled voxel aggregates
+the entry lists from its higher-resolution children, summing counts for matching label IDs.
+
+## Change log
+
+No changes yet.
+
+## Current maintainers
+
+* [Mark Kittisopikul](https://github.com/mkitti)
diff --git a/data-types/label_multiset/schema.json b/data-types/label_multiset/schema.json
new file mode 100644
index 0000000..115d4fe
--- /dev/null
+++ b/data-types/label_multiset/schema.json
@@ -0,0 +1,20 @@
+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "oneOf": [
+    {
+      "type": "object",
+      "properties": {
+        "name": {
+          "const": "label_multiset"
+        },
+        "configuration": {
+          "type": "object",
+          "additionalProperties": false
+        }
+      },
+      "required": ["name"],
+      "additionalProperties": false
+    },
+    { "const": "label_multiset" }
+  ]
+}

From cf9eca3149a838fef728d25afa7f72cf1a4e2c3e Mon Sep 17 00:00:00 2001
From: Mark Kittisopikul <kittisopikulm@janelia.hhmi.org>
Date: Mon, 13 Apr 2026 14:53:30 -0400
Subject: [PATCH 2/2] Add label_multiset codec for Zarr-native label multiset
 serialization

Defines an array-to-bytes codec for the label_multiset data type using
an all-little-endian layout: listEntryOffsets[N] (uint32 LE) followed
by listData (LE). Exploits per-chunk list deduplication for efficient
storage of uniform regions. For N5 interoperability use n5_label_multiset
inside n5_varlen instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 codecs/label_multiset/README.md   | 116 ++++++++++++++++++++++++++++++
 codecs/label_multiset/schema.json |  20 ++++++
 2 files changed, 136 insertions(+)
 create mode 100644 codecs/label_multiset/README.md
 create mode 100644 codecs/label_multiset/schema.json

diff --git a/codecs/label_multiset/README.md b/codecs/label_multiset/README.md
new file mode 100644
index 0000000..f9f29d9
--- /dev/null
+++ b/codecs/label_multiset/README.md
@@ -0,0 +1,116 @@
+# label_multiset codec
+
+Defines an `array -> bytes` codec that serializes arrays of the
+[`label_multiset`](../../data-types/label_multiset/README.md) data type into a compact
+binary representation using the Zarr-native (all little-endian) format. The codec exploits
+per-chunk list deduplication: voxels sharing identical entry lists reference the same
+offset, which compresses regions of uniform labeling (e.g., background) very efficiently.
+
+For N5 interoperability with existing imglib2-label-multisets datasets, use the
+[`n5_label_multiset`](../n5_label_multiset/README.md) codec inside
+[`n5_varlen`](../n5_varlen/README.md) instead.
+
+## Codec name
+
+The value of the `name` member in the codec object MUST be `label_multiset`.
+
+## Configuration parameters
+
+No configuration is required or permitted for this codec.
+
+## Compatibility
+
+This codec is only compatible with the
+[`"label_multiset"`](../../data-types/label_multiset/README.md) data type.
+
+## Example
+
+```json
+{
+    "data_type": "label_multiset",
+    "codecs": [{"name": "label_multiset"}]
+}
+```
+
+## Format and algorithm
+
+This is an `array -> bytes` codec. The chunk contains `N` voxels in the chunk-linearization
+order defined by the chunk grid. `N` equals the product of the chunk shape dimensions
+(partial boundary chunks are treated as padded to the full chunk shape for the purposes of
+this codec).
+
+### Chunk layout
+
+```
+┌──────────────────────────────────────────────────────────────────┐
+│  listEntryOffsets[0..N-1]  (uint32 each, little-endian)          │  4·N bytes
+├──────────────────────────────────────────────────────────────────┤
+│  listData  (variable, all little-endian)                         │  remaining bytes
+└──────────────────────────────────────────────────────────────────┘
+```
+
+Total encoded size: `4·N + listDataSize` bytes.
+
+### `listEntryOffsets` array
+
+One entry per voxel, in chunk-linearization (C / row-major) order. Each entry is an
+unsigned 32-bit little-endian byte offset into the `listData` region where that voxel's
+entry list begins. Multiple voxels may share the same offset (list deduplication).
+
+### `listData` region
+
+A concatenation of unique entry lists in the order they were first encountered during
+encoding. Each entry list has the following structure:
+
+```
+Offset  Size  Endian  Content
+------  ----  ------  -------------------------------------------
+0       4     LE      numEntries (uint32) — number of entries
+4       8     LE      entries[0].labelId (uint64)
+12      4     LE      entries[0].count   (uint32)
+16      8     LE      entries[1].labelId (uint64)
+24      4     LE      entries[1].count   (uint32)
+...
+```
+
+Each entry list occupies `4 + 12·numEntries` bytes. An empty entry list (`numEntries = 0`)
+is valid and occupies exactly 4 bytes.
+
+Entries within a list should be sorted by `labelId` in ascending unsigned order, with no
+duplicate `labelId` values.
+
+> **Note:** Existing N5 datasets produced by imglib2-label-multisets may contain unsorted
+> entry lists. Implementations SHOULD accept unsorted lists when decoding and SHOULD write
+> sorted lists when encoding.
+
+### Encoding procedure
+
+1. Iterate voxels in chunk-linearization order.
+2. For each voxel, serialize its entry list to bytes.
+3. If an identical byte sequence already exists in `listData`, record its existing offset
+   in `listEntryOffsets`; otherwise append the byte sequence to `listData` and record the
+   new offset.
+4. Write `listEntryOffsets` (uint32 LE each), then `listData`.
+
+### Decoding procedure
+
+1. Read `N` × uint32 LE values as `listEntryOffsets`.
+2. Read the remaining bytes as `listData`.
+3. For each voxel, locate its entry list in `listData` using the corresponding offset and
+   parse `numEntries` followed by the `(labelId, count)` pairs.
+4. Compute the argmax for each voxel from its entry list (or deduplicate from a cache of
+   previously computed argmax values for repeated offsets).
+
+### Null / all-empty chunks
+
+If all voxels in a chunk have empty entry lists (zero entries), an implementation MAY
+represent the chunk as absent in the store (using the fill value mechanism) rather than
+writing an explicit byte sequence.
+
+## Change log
+
+No changes yet.
+
+## Current maintainers
+
+* [Mark Kittisopikul](https://github.com/mkitti)
diff --git a/codecs/label_multiset/schema.json b/codecs/label_multiset/schema.json
new file mode 100644
index 0000000..115d4fe
--- /dev/null
+++ b/codecs/label_multiset/schema.json
@@ -0,0 +1,20 @@
+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "oneOf": [
+    {
+      "type": "object",
+      "properties": {
+        "name": {
+          "const": "label_multiset"
+        },
+        "configuration": {
+          "type": "object",
+          "additionalProperties": false
+        }
+      },
+      "required": ["name"],
+      "additionalProperties": false
+    },
+    { "const": "label_multiset" }
+  ]
+}