Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 116 additions & 0 deletions codecs/label_multiset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# label_multiset codec

Defines an `array -> bytes` codec that serializes arrays of the
[`label_multiset`](../../data-types/label_multiset/README.md) data type into a compact
binary representation using the Zarr-native (all little-endian) format. The codec exploits
per-chunk list deduplication: voxels sharing identical entry lists reference the same
offset, which compresses regions of uniform labeling (e.g., background) very efficiently.

For N5 interoperability with existing imglib2-label-multisets datasets, use the
[`n5_label_multiset`](../n5_label_multiset/README.md) codec inside
[`n5_varlen`](../n5_varlen/README.md) instead.

## Codec name

The value of the `name` member in the codec object MUST be `label_multiset`.

## Configuration parameters

No configuration is required or permitted for this codec.

## Compatibility

This codec is only compatible with the
[`"label_multiset"`](../../data-types/label_multiset/README.md) data type.

## Example

```json
{
"data_type": "label_multiset",
"codecs": [{"name": "label_multiset"}]
}
```

## Format and algorithm

This is an `array -> bytes` codec. The chunk contains `N` voxels in the chunk-linearization
order defined by the chunk grid. `N` equals the product of the chunk shape dimensions
(partial boundary chunks are treated as padded to the full chunk shape for the purposes of
this codec).

### Chunk layout

```
┌──────────────────────────────────────────────────────────────────┐
│ listEntryOffsets[0..N-1] (uint32 each, little-endian) │ 4·N bytes
├──────────────────────────────────────────────────────────────────┤
│ listData (variable, all little-endian) │ remaining bytes
└──────────────────────────────────────────────────────────────────┘
```

Total encoded size: `4·N + listDataSize` bytes.

### `listEntryOffsets` array

One entry per voxel, in chunk-linearization (C / row-major) order. Each entry is an
unsigned 32-bit little-endian byte offset into the `listData` region where that voxel's
entry list begins. Multiple voxels may share the same offset (list deduplication).

### `listData` region

A concatenation of unique entry lists in the order they were first encountered during
encoding. Each entry list has the following structure:

```
Offset Size Endian Content
------ ---- ------ -------------------------------------------
0 4 LE numEntries (uint32) — number of entries
4 8 LE entries[0].labelId (uint64)
12 4 LE entries[0].count (uint32)
16 8 LE entries[1].labelId (uint64)
24 4 LE entries[1].count (uint32)
...
```

Each entry list occupies `4 + 12·numEntries` bytes. An empty entry list (`numEntries = 0`)
is valid and occupies exactly 4 bytes.

Entries within a list should be sorted by `labelId` in ascending unsigned order, with no
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHOULD be sorted, but I guess MUST not have duplicates?

duplicate `labelId` values.

> **Note:** Existing N5 datasets produced by imglib2-label-multisets may contain unsorted
> entry lists. Implementations SHOULD accept unsorted lists when decoding and SHOULD write
> sorted lists when encoding.

### Encoding procedure

1. Iterate voxels in chunk-linearization order.
2. For each voxel, serialize its entry list to bytes.
3. If an identical byte sequence already exists in `listData`, record its existing offset
in `listEntryOffsets`; otherwise append the byte sequence to `listData` and record the
new offset.
4. Write `listEntryOffsets` (uint32 LE each), then `listData`.

### Decoding procedure

1. Read `N` × uint32 LE values as `listEntryOffsets`.
2. Read the remaining bytes as `listData`.
3. For each voxel, locate its entry list in `listData` using the corresponding offset and
parse `numEntries` followed by the `(labelId, count)` pairs.
4. Compute the argmax for each voxel from its entry list (or deduplicate from a cache of
previously computed argmax values for repeated offsets).

### Null / all-empty chunks

If all voxels in a chunk have empty entry lists (zero entries), an implementation MAY
represent the chunk as absent in the store (using the fill value mechanism) rather than
writing an explicit byte sequence.

## Change log

No changes yet.

## Current maintainers

* [Mark Kittisopikul](https://github.com/mkitti)
20 changes: 20 additions & 0 deletions codecs/label_multiset/schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"oneOf": [
{
"type": "object",
"properties": {
"name": {
"const": "label_multiset"
},
"configuration": {
"type": "object",
"additionalProperties": false
}
},
"required": ["name"],
"additionalProperties": false
},
{ "const": "label_multiset" }
]
}
173 changes: 173 additions & 0 deletions data-types/label_multiset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# label_multiset data type
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could be described as a use case and conventions on top of a generic varlen/ list data type on top of a struct.


Defines a variable-width data type for label multisets, where each array element holds a
multiset of label IDs, each with a non-negative integer count. This data type is used by
[imglib2-label-multisets](https://github.com/saalfeldlab/imglib2-label-multisets) for
volumetric segmentation data, particularly in the
[Paintera](https://github.com/saalfeldlab/paintera) connectome annotation tool.

## Background

A label multiset voxel stores a multiset of label IDs, each carrying a non-negative integer
count representing the number of occurrences of that label. At full resolution, every voxel
is typically a singleton — one label with count 1. After downsampling, a voxel may represent
many labels aggregated from higher-resolution voxels, with counts recording how many
sub-voxels carried each label.

## Data type representation

### Name

The name of this data type is the string `"label_multiset"`.

### Configuration

No configuration is required or permitted for this data type.

## Element structure

Each array element is a list of `(labelId, count)` pairs:

| Field | Type | Description |
|-----------|--------|-------------|
| `labelId` | uint64 | Label identifier |
| `count` | uint32 | Number of occurrences |

The list should be sorted by `labelId` in ascending unsigned order; duplicate label IDs
must not appear (their counts must be summed). An empty list (zero pairs) is valid and
represents a voxel with no label information.

## Reserved label IDs

Five label ID values are reserved at the top of the unsigned 64-bit range:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BACKGROUND isn't at the top of the range


| Name | uint64 value (hex) | int64 value | Meaning |
|---------------|------------------------|-------------|---------|
| `BACKGROUND` | `0x0000000000000000` | `0` | Background label |
| `MAX_ID` | `0xFFFFFFFFFFFFFFFC` | `-4` | Largest usable regular label ID |
| `OUTSIDE` | `0xFFFFFFFFFFFFFFFD` | `-3` | Voxel is outside the dataset bounds |
| `INVALID` | `0xFFFFFFFFFFFFFFFE` | `-2` | Uninitialized / no data |
| `TRANSPARENT` | `0xFFFFFFFFFFFFFFFF` | `-1` | Fully transparent (display hint) |

A label ID is *regular* if it is ≤ `MAX_ID` as an unsigned integer (i.e., ≤
`0xFFFFFFFFFFFFFFFC`).

## ArgMax
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really part of the data type, nor is it encoded; it seems like it's a convention in one type of processing for it. Maybe this could be in a "usage patterns" section, possibly along with the reserved label IDs above?


The **argmax** of a voxel's multiset is the label ID with the highest count. Ties are
broken by the smaller label ID (unsigned comparison). If the multiset is empty, the argmax
is `INVALID` (`0xFFFFFFFFFFFFFFFE`).

```
argmax := INVALID
maxCount := 0
for each (labelId, count) in entries:
if count > maxCount or (count == maxCount and labelId < argmax):
argmax = labelId
maxCount = count
```

The argmax is useful as a scalar integer projection of the multiset for visualization and
interoperability with single-label data consumers.

## Fill value representation

The `fill_value` field in array metadata must be a JSON string containing the hexadecimal
representation of a uint64 label ID. This label ID represents the sole element of a
singleton multiset with count 1:

- `"0xFFFFFFFFFFFFFFFE"` — singleton `{INVALID → 1}` (canonical fill value)
- `"0x0000000000000000"` — singleton `{BACKGROUND → 1}`

## Codec compatibility

This data type must be used with exactly one array-to-bytes codec from the following:

- [`"label_multiset"`](../../codecs/label_multiset/README.md): Zarr v3 native
serialization (all little-endian). Recommended for new arrays.
- [`"n5_varlen"`](../../codecs/n5_varlen/README.md): N5 varlength block format,
for interoperability with existing N5-based label multiset datasets. The first inner
codec inside `n5_varlen` must be
[`"n5_label_multiset"`](../../codecs/n5_label_multiset/README.md).

Optional bytes-to-bytes codecs (e.g., `gzip`, `blosc`, `zstd`) may follow the
array-to-bytes codec (or be placed inside `n5_varlen`'s inner codec chain).

## Array metadata example

For a new Zarr v3 label multiset array:

```json
{
"zarr_format": 3,
"node_type": "array",
"shape": [80, 64, 64],
"data_type": "label_multiset",
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [32, 32, 32]
}
},
"chunk_key_encoding": {"name": "default"},
"fill_value": "0xFFFFFFFFFFFFFFFE",
"codecs": [
{"name": "label_multiset"},
{"name": "gzip", "configuration": {"level": 6}}
],
"attributes": {
"label_multisets": true,
"maxId": 99
}
}
```

For reading an existing N5 label multiset dataset:

```json
{
"zarr_format": 3,
"node_type": "array",
"shape": [80, 64, 64],
"data_type": "label_multiset",
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [32, 32, 32]
}
},
"chunk_key_encoding": {
"name": "v2",
"configuration": {"separator": "/"}
},
"fill_value": "0xFFFFFFFFFFFFFFFE",
"codecs": [
{
"name": "n5_varlen",
"configuration": {
"codecs": [
{"name": "n5_label_multiset"}
]
}
}
],
"attributes": {
"label_multisets": true
}
}
```

## Multiresolution

Downscaled resolution levels are stored as separate Zarr arrays within a multiscale group,
compatible with the [OME-Zarr multiscales
specification](https://ngff.openmicroscopy.org/latest/). Each downscaled voxel aggregates
the entry lists from its higher-resolution children, summing counts for matching label IDs.
Comment on lines +160 to +165
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be presented as a non-normative recommendation so it's not strictly tied to OME-Zarr. There is another proposal for multiscales from the geo community https://github.com/zarr-conventions/multiscales


## Change log

No changes yet.

## Current maintainers

* [Mark Kittisopikul](https://github.com/mkitti)
20 changes: 20 additions & 0 deletions data-types/label_multiset/schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"oneOf": [
{
"type": "object",
"properties": {
"name": {
"const": "label_multiset"
},
"configuration": {
"type": "object",
"additionalProperties": false
}
},
"required": ["name"],
"additionalProperties": false
},
{ "const": "label_multiset" }
]
}