Skip to content

Add n5_varlen codec for N5 varlength block format#54

Open
mkitti wants to merge 1 commit intozarr-developers:mainfrom
mkitti:mkitti-n5-varlen
Open

Add n5_varlen codec for N5 varlength block format#54
mkitti wants to merge 1 commit intozarr-developers:mainfrom
mkitti:mkitti-n5-varlen

Conversation

@mkitti
Copy link
Copy Markdown
Contributor

@mkitti mkitti commented Apr 13, 2026

Summary

Registers the n5_varlen codec, the varlength-mode counterpart to the existing n5_default codec.

  • Handles N5 blocks with mode = 0x0001 (varlength)
  • Wraps an inner codec pipeline (array-to-bytes + optional bytes-to-bytes) with the N5 varlength block header
  • Intended for variable-width data types where the per-chunk byte size cannot be derived from the chunk shape alone (e.g. label_multiset)

N5 varlength header format

Offset    Size    Endian  Field
------    ------  ------  ----------------------------------------
0         2       BE      mode      uint16  = 0x0001 (varlength)
2         2       BE      ndim      uint16  number of dimensions
4         4·ndim  BE      dims[]    uint32 each  block shape
4+4·ndim  4       BE      num_bytes uint32  byte count of payload

Codec chain (example with label_multiset)

{
  "name": "n5_varlen",
  "configuration": {
    "codecs": [
      {"name": "n5_label_multiset"},
      {"name": "gzip", "configuration": {"level": 6}}
    ]
  }
}

Test plan

  • Confirm codecs/n5_varlen/schema.json validates correctly
  • Verify the annotated binary layout against a real N5 varlength block (see README)

🤖 Generated with Claude Code

Defines an array-to-bytes codec that wraps an inner codec pipeline
with the N5 varlength block header (mode=0x0001). Intended for
variable-width data types such as label_multiset where the per-chunk
byte size cannot be derived from the chunk shape alone.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mkitti
Copy link
Copy Markdown
Contributor Author

mkitti commented Apr 13, 2026

@clbarnes here is the variable length N5 codec. Could you review?

Comment on lines +80 to +81
> **Note:** For default-mode N5 blocks, the header omits `num_bytes` and instead contains
> the number of array elements. The varlength header always includes `num_bytes` in place
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default-mode... header... contains the number of array elements.

I don't think this is correct, default mode blocks just have a shorter header (according to the spec anyway).

}
```

## Annotated binary layout
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a canonical description of this somewhere or is it gleaned from the implementation? Either way a link would be great.

Comment on lines +195 to +196
arrays written by imglib2-label-multisets. The `num_bytes` field in the header equals
`file_size - header_size` for uncompressed data.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly related to this schema, but what is the num_bytes field actually for? You're reading to the end of the block anyway because you don't know where each value is.

"name": "n5_varlen",
"configuration": {
"codecs": [
{"name": "n5_label_multiset"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this is the same pattern used by the n5_default codec, where the inner codecs describes the whole pipeline for the payload, but in this case we're having to invent (and specify and implement) a second new codec n5_label_multiset which will only ever be used in this n5_varlen context, and n5_varlen will always use this n5_label_multiset.

With n5_default, I went back and forth on whether to explicitly include the transpose and bytes codecs or whether to more closely match N5 by handling them in the outer codec and just having an optional compressor field which would contain one bytes-to-bytes codec. I settled on including the whole chain because it meant the whole thing could be delegated to core-spec codecs. In this case, I'd say there's more argument for having more of the logic in the n5_varlen codec and just an optional compressor in the configuration.

But I'm ambivalent about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants