Add n5_varlen codec for N5 varlength block format#54
Add n5_varlen codec for N5 varlength block format#54mkitti wants to merge 1 commit intozarr-developers:mainfrom
Conversation
Defines an array-to-bytes codec that wraps an inner codec pipeline with the N5 varlength block header (mode=0x0001). Intended for variable-width data types such as label_multiset where the per-chunk byte size cannot be derived from the chunk shape alone. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@clbarnes here is the variable length N5 codec. Could you review? |
| > **Note:** For default-mode N5 blocks, the header omits `num_bytes` and instead contains | ||
| > the number of array elements. The varlength header always includes `num_bytes` in place |
There was a problem hiding this comment.
default-mode... header... contains the number of array elements.
I don't think this is correct, default mode blocks just have a shorter header (according to the spec anyway).
| } | ||
| ``` | ||
|
|
||
| ## Annotated binary layout |
There was a problem hiding this comment.
Is there a canonical description of this somewhere or is it gleaned from the implementation? Either way a link would be great.
| arrays written by imglib2-label-multisets. The `num_bytes` field in the header equals | ||
| `file_size - header_size` for uncompressed data. |
There was a problem hiding this comment.
Not strictly related to this schema, but what is the num_bytes field actually for? You're reading to the end of the block anyway because you don't know where each value is.
| "name": "n5_varlen", | ||
| "configuration": { | ||
| "codecs": [ | ||
| {"name": "n5_label_multiset"} |
There was a problem hiding this comment.
I know that this is the same pattern used by the n5_default codec, where the inner codecs describes the whole pipeline for the payload, but in this case we're having to invent (and specify and implement) a second new codec n5_label_multiset which will only ever be used in this n5_varlen context, and n5_varlen will always use this n5_label_multiset.
With n5_default, I went back and forth on whether to explicitly include the transpose and bytes codecs or whether to more closely match N5 by handling them in the outer codec and just having an optional compressor field which would contain one bytes-to-bytes codec. I settled on including the whole chain because it meant the whole thing could be delegated to core-spec codecs. In this case, I'd say there's more argument for having more of the logic in the n5_varlen codec and just an optional compressor in the configuration.
But I'm ambivalent about it.
Summary
Registers the
n5_varlencodec, the varlength-mode counterpart to the existingn5_defaultcodec.mode = 0x0001(varlength)label_multiset)N5 varlength header format
Codec chain (example with label_multiset)
{ "name": "n5_varlen", "configuration": { "codecs": [ {"name": "n5_label_multiset"}, {"name": "gzip", "configuration": {"level": 6}} ] } }Test plan
codecs/n5_varlen/schema.jsonvalidates correctly🤖 Generated with Claude Code