-
Notifications
You must be signed in to change notification settings - Fork 13
Add label_multiset data type and Zarr-native codec #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,116 @@ | ||
| # label_multiset codec | ||
|
|
||
| Defines an `array -> bytes` codec that serializes arrays of the | ||
| [`label_multiset`](../../data-types/label_multiset/README.md) data type into a compact | ||
| binary representation using the Zarr-native (all little-endian) format. The codec exploits | ||
| per-chunk list deduplication: voxels sharing identical entry lists reference the same | ||
| offset, which compresses regions of uniform labeling (e.g., background) very efficiently. | ||
|
|
||
| For N5 interoperability with existing imglib2-label-multisets datasets, use the | ||
| [`n5_label_multiset`](../n5_label_multiset/README.md) codec inside | ||
| [`n5_varlen`](../n5_varlen/README.md) instead. | ||
|
|
||
| ## Codec name | ||
|
|
||
| The value of the `name` member in the codec object MUST be `label_multiset`. | ||
|
|
||
| ## Configuration parameters | ||
|
|
||
| No configuration is required or permitted for this codec. | ||
|
|
||
| ## Compatibility | ||
|
|
||
| This codec is only compatible with the | ||
| [`"label_multiset"`](../../data-types/label_multiset/README.md) data type. | ||
|
|
||
| ## Example | ||
|
|
||
| ```json | ||
| { | ||
| "data_type": "label_multiset", | ||
| "codecs": [{"name": "label_multiset"}] | ||
| } | ||
| ``` | ||
|
|
||
| ## Format and algorithm | ||
|
|
||
| This is an `array -> bytes` codec. The chunk contains `N` voxels in the chunk-linearization | ||
| order defined by the chunk grid. `N` equals the product of the chunk shape dimensions | ||
| (partial boundary chunks are treated as padded to the full chunk shape for the purposes of | ||
| this codec). | ||
|
|
||
| ### Chunk layout | ||
|
|
||
| ``` | ||
| ┌──────────────────────────────────────────────────────────────────┐ | ||
| │ listEntryOffsets[0..N-1] (uint32 each, little-endian) │ 4·N bytes | ||
| ├──────────────────────────────────────────────────────────────────┤ | ||
| │ listData (variable, all little-endian) │ remaining bytes | ||
| └──────────────────────────────────────────────────────────────────┘ | ||
| ``` | ||
|
|
||
| Total encoded size: `4·N + listDataSize` bytes. | ||
|
|
||
| ### `listEntryOffsets` array | ||
|
|
||
| One entry per voxel, in chunk-linearization (C / row-major) order. Each entry is an | ||
| unsigned 32-bit little-endian byte offset into the `listData` region where that voxel's | ||
| entry list begins. Multiple voxels may share the same offset (list deduplication). | ||
|
|
||
| ### `listData` region | ||
|
|
||
| A concatenation of unique entry lists in the order they were first encountered during | ||
| encoding. Each entry list has the following structure: | ||
|
|
||
| ``` | ||
| Offset Size Endian Content | ||
| ------ ---- ------ ------------------------------------------- | ||
| 0 4 LE numEntries (uint32) — number of entries | ||
| 4 8 LE entries[0].labelId (uint64) | ||
| 12 4 LE entries[0].count (uint32) | ||
| 16 8 LE entries[1].labelId (uint64) | ||
| 24 4 LE entries[1].count (uint32) | ||
| ... | ||
| ``` | ||
|
|
||
| Each entry list occupies `4 + 12·numEntries` bytes. An empty entry list (`numEntries = 0`) | ||
| is valid and occupies exactly 4 bytes. | ||
|
|
||
| Entries within a list should be sorted by `labelId` in ascending unsigned order, with no | ||
| duplicate `labelId` values. | ||
|
|
||
| > **Note:** Existing N5 datasets produced by imglib2-label-multisets may contain unsorted | ||
| > entry lists. Implementations SHOULD accept unsorted lists when decoding and SHOULD write | ||
| > sorted lists when encoding. | ||
|
|
||
| ### Encoding procedure | ||
|
|
||
| 1. Iterate voxels in chunk-linearization order. | ||
| 2. For each voxel, serialize its entry list to bytes. | ||
| 3. If an identical byte sequence already exists in `listData`, record its existing offset | ||
| in `listEntryOffsets`; otherwise append the byte sequence to `listData` and record the | ||
| new offset. | ||
| 4. Write `listEntryOffsets` (uint32 LE each), then `listData`. | ||
|
|
||
| ### Decoding procedure | ||
|
|
||
| 1. Read `N` × uint32 LE values as `listEntryOffsets`. | ||
| 2. Read the remaining bytes as `listData`. | ||
| 3. For each voxel, locate its entry list in `listData` using the corresponding offset and | ||
| parse `numEntries` followed by the `(labelId, count)` pairs. | ||
| 4. Compute the argmax for each voxel from its entry list (or deduplicate from a cache of | ||
| previously computed argmax values for repeated offsets). | ||
|
|
||
| ### Null / all-empty chunks | ||
|
|
||
| If all voxels in a chunk have empty entry lists (zero entries), an implementation MAY | ||
| represent the chunk as absent in the store (using the fill value mechanism) rather than | ||
| writing an explicit byte sequence. | ||
|
|
||
| ## Change log | ||
|
|
||
| No changes yet. | ||
|
|
||
| ## Current maintainers | ||
|
|
||
| * [Mark Kittisopikul](https://github.com/mkitti) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| { | ||
| "$schema": "https://json-schema.org/draft/2020-12/schema", | ||
| "oneOf": [ | ||
| { | ||
| "type": "object", | ||
| "properties": { | ||
| "name": { | ||
| "const": "label_multiset" | ||
| }, | ||
| "configuration": { | ||
| "type": "object", | ||
| "additionalProperties": false | ||
| } | ||
| }, | ||
| "required": ["name"], | ||
| "additionalProperties": false | ||
| }, | ||
| { "const": "label_multiset" } | ||
| ] | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,173 @@ | ||
| # label_multiset data type | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems like it could be described as a use case and conventions on top of a generic varlen/ list data type on top of a struct. |
||
|
|
||
| Defines a variable-width data type for label multisets, where each array element holds a | ||
| multiset of label IDs, each with a non-negative integer count. This data type is used by | ||
| [imglib2-label-multisets](https://github.com/saalfeldlab/imglib2-label-multisets) for | ||
| volumetric segmentation data, particularly in the | ||
| [Paintera](https://github.com/saalfeldlab/paintera) connectome annotation tool. | ||
|
|
||
| ## Background | ||
|
|
||
| A label multiset voxel stores a multiset of label IDs, each carrying a non-negative integer | ||
| count representing the number of occurrences of that label. At full resolution, every voxel | ||
| is typically a singleton — one label with count 1. After downsampling, a voxel may represent | ||
| many labels aggregated from higher-resolution voxels, with counts recording how many | ||
| sub-voxels carried each label. | ||
|
|
||
| ## Data type representation | ||
|
|
||
| ### Name | ||
|
|
||
| The name of this data type is the string `"label_multiset"`. | ||
|
|
||
| ### Configuration | ||
|
|
||
| No configuration is required or permitted for this data type. | ||
|
|
||
| ## Element structure | ||
|
|
||
| Each array element is a list of `(labelId, count)` pairs: | ||
|
|
||
| | Field | Type | Description | | ||
| |-----------|--------|-------------| | ||
| | `labelId` | uint64 | Label identifier | | ||
| | `count` | uint32 | Number of occurrences | | ||
|
|
||
| The list should be sorted by `labelId` in ascending unsigned order; duplicate label IDs | ||
| must not appear (their counts must be summed). An empty list (zero pairs) is valid and | ||
| represents a voxel with no label information. | ||
|
|
||
| ## Reserved label IDs | ||
|
|
||
| Five label ID values are reserved at the top of the unsigned 64-bit range: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BACKGROUND isn't at the top of the range |
||
|
|
||
| | Name | uint64 value (hex) | int64 value | Meaning | | ||
| |---------------|------------------------|-------------|---------| | ||
| | `BACKGROUND` | `0x0000000000000000` | `0` | Background label | | ||
| | `MAX_ID` | `0xFFFFFFFFFFFFFFFC` | `-4` | Largest usable regular label ID | | ||
| | `OUTSIDE` | `0xFFFFFFFFFFFFFFFD` | `-3` | Voxel is outside the dataset bounds | | ||
| | `INVALID` | `0xFFFFFFFFFFFFFFFE` | `-2` | Uninitialized / no data | | ||
| | `TRANSPARENT` | `0xFFFFFFFFFFFFFFFF` | `-1` | Fully transparent (display hint) | | ||
|
|
||
| A label ID is *regular* if it is ≤ `MAX_ID` as an unsigned integer (i.e., ≤ | ||
| `0xFFFFFFFFFFFFFFFC`). | ||
|
|
||
| ## ArgMax | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't really part of the data type, nor is it encoded; it seems like it's a convention in one type of processing for it. Maybe this could be in a "usage patterns" section, possibly along with the reserved label IDs above? |
||
|
|
||
| The **argmax** of a voxel's multiset is the label ID with the highest count. Ties are | ||
| broken by the smaller label ID (unsigned comparison). If the multiset is empty, the argmax | ||
| is `INVALID` (`0xFFFFFFFFFFFFFFFE`). | ||
|
|
||
| ``` | ||
| argmax := INVALID | ||
| maxCount := 0 | ||
| for each (labelId, count) in entries: | ||
| if count > maxCount or (count == maxCount and labelId < argmax): | ||
| argmax = labelId | ||
| maxCount = count | ||
| ``` | ||
|
|
||
| The argmax is useful as a scalar integer projection of the multiset for visualization and | ||
| interoperability with single-label data consumers. | ||
|
|
||
| ## Fill value representation | ||
|
|
||
| The `fill_value` field in array metadata must be a JSON string containing the hexadecimal | ||
| representation of a uint64 label ID. This label ID represents the sole element of a | ||
| singleton multiset with count 1: | ||
|
|
||
| - `"0xFFFFFFFFFFFFFFFE"` — singleton `{INVALID → 1}` (canonical fill value) | ||
| - `"0x0000000000000000"` — singleton `{BACKGROUND → 1}` | ||
|
|
||
| ## Codec compatibility | ||
|
|
||
| This data type must be used with exactly one array-to-bytes codec from the following: | ||
|
|
||
| - [`"label_multiset"`](../../codecs/label_multiset/README.md): Zarr v3 native | ||
| serialization (all little-endian). Recommended for new arrays. | ||
| - [`"n5_varlen"`](../../codecs/n5_varlen/README.md): N5 varlength block format, | ||
| for interoperability with existing N5-based label multiset datasets. The first inner | ||
| codec inside `n5_varlen` must be | ||
| [`"n5_label_multiset"`](../../codecs/n5_label_multiset/README.md). | ||
|
|
||
| Optional bytes-to-bytes codecs (e.g., `gzip`, `blosc`, `zstd`) may follow the | ||
| array-to-bytes codec (or be placed inside `n5_varlen`'s inner codec chain). | ||
|
|
||
| ## Array metadata example | ||
|
|
||
| For a new Zarr v3 label multiset array: | ||
|
|
||
| ```json | ||
| { | ||
| "zarr_format": 3, | ||
| "node_type": "array", | ||
| "shape": [80, 64, 64], | ||
| "data_type": "label_multiset", | ||
| "chunk_grid": { | ||
| "name": "regular", | ||
| "configuration": { | ||
| "chunk_shape": [32, 32, 32] | ||
| } | ||
| }, | ||
| "chunk_key_encoding": {"name": "default"}, | ||
| "fill_value": "0xFFFFFFFFFFFFFFFE", | ||
| "codecs": [ | ||
| {"name": "label_multiset"}, | ||
| {"name": "gzip", "configuration": {"level": 6}} | ||
| ], | ||
| "attributes": { | ||
| "label_multisets": true, | ||
| "maxId": 99 | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| For reading an existing N5 label multiset dataset: | ||
|
|
||
| ```json | ||
| { | ||
| "zarr_format": 3, | ||
| "node_type": "array", | ||
| "shape": [80, 64, 64], | ||
| "data_type": "label_multiset", | ||
| "chunk_grid": { | ||
| "name": "regular", | ||
| "configuration": { | ||
| "chunk_shape": [32, 32, 32] | ||
| } | ||
| }, | ||
| "chunk_key_encoding": { | ||
| "name": "v2", | ||
| "configuration": {"separator": "/"} | ||
| }, | ||
| "fill_value": "0xFFFFFFFFFFFFFFFE", | ||
| "codecs": [ | ||
| { | ||
| "name": "n5_varlen", | ||
| "configuration": { | ||
| "codecs": [ | ||
| {"name": "n5_label_multiset"} | ||
| ] | ||
| } | ||
| } | ||
| ], | ||
| "attributes": { | ||
| "label_multisets": true | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ## Multiresolution | ||
|
|
||
| Downscaled resolution levels are stored as separate Zarr arrays within a multiscale group, | ||
| compatible with the [OME-Zarr multiscales | ||
| specification](https://ngff.openmicroscopy.org/latest/). Each downscaled voxel aggregates | ||
| the entry lists from its higher-resolution children, summing counts for matching label IDs. | ||
|
Comment on lines
+160
to
+165
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could be presented as a non-normative recommendation so it's not strictly tied to OME-Zarr. There is another proposal for multiscales from the geo community https://github.com/zarr-conventions/multiscales |
||
|
|
||
| ## Change log | ||
|
|
||
| No changes yet. | ||
|
|
||
| ## Current maintainers | ||
|
|
||
| * [Mark Kittisopikul](https://github.com/mkitti) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| { | ||
| "$schema": "https://json-schema.org/draft/2020-12/schema", | ||
| "oneOf": [ | ||
| { | ||
| "type": "object", | ||
| "properties": { | ||
| "name": { | ||
| "const": "label_multiset" | ||
| }, | ||
| "configuration": { | ||
| "type": "object", | ||
| "additionalProperties": false | ||
| } | ||
| }, | ||
| "required": ["name"], | ||
| "additionalProperties": false | ||
| }, | ||
| { "const": "label_multiset" } | ||
| ] | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHOULD be sorted, but I guess MUST not have duplicates?