fix: write minimum byte width in chunked dimension encoded length (closes #53)#157
Open
ChenZeiShuai wants to merge 2 commits intoApollo3zehn:devfrom
Open
fix: write minimum byte width in chunked dimension encoded length (closes #53)#157ChenZeiShuai wants to merge 2 commits intoApollo3zehn:devfrom
ChenZeiShuai wants to merge 2 commits intoApollo3zehn:devfrom
Conversation
…ollo3zehn#53) ChunkedStoragePropertyDescription4.Encode() always wrote (byte)8 as the "Dimension Size Encoded Length" field, regardless of the actual dimension magnitudes. The HDF5 spec requires this field to hold the *minimum* number of bytes needed to encode the largest chunk dimension, and libhdf5's H5D__chunk_set_sizes() in src/H5Dchunk.c strictly enforces this with a direct `!=` check that aborts with: "stored chunk dimension encoding length does not match value calculated from chunk dimensions" As a result every chunked file written by PureHDF (with or without filters) was rejected by libhdf5-based readers — h5py, HDFView, MATLAB, Imaris, Bio-Formats — even though PureHDF could read them back itself. This is the same symptom users reported in Apollo3zehn#53 (pandas via h5py) and is likely related to Apollo3zehn#88 (h5dump hang). Fix: - Compute the encoded length as `1 + floor(log2(max_dim) / 8)` (min 1 byte, capped at 8 by HDF5 spec), mirroring libhdf5's byte-counting loop. - Replace `driver.Write((byte)8)` with the computed value. - Replace the fixed-width `for (i = 0; i < Rank-1) Write(ulong)` + trailing `Write((ulong)4)` (which also hardcoded the element-size word to 4, breaking any non-int32 dataset coincidentally) with a single `Rank`-iteration loop using `WriteUtils.WriteUlongArbitrary` at the computed width — element size now uses the real `DimensionSizes[Rank-1]`. - `GetEncodeSize()` updated to reflect the variable byte width. Test: - New `[Theory] ChunkedFile_IsReadableBy_libhdf5` round-trips through HDF.PInvoke (same libhdf5 h5py uses). Covers 1-byte, 2-byte, 3-byte encoded length cases plus a 6D microscopy-style chunk shape. Verified independently with h5py 3.16 / numpy 2.4 / hdf5 lib 2.0 on a 6D SPAD-counts data export pipeline (chunked + Deflate-1 → 47x compression, chunked + Deflate-9 → 112x compression, all readable round-trip).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the long-standing bug where files containing chunked datasets written by PureHDF cannot be opened by libhdf5-based readers (h5py, HDFView, MATLAB, Imaris, Bio-Formats, etc.).
This is the same root cause as #53 (pandas-via-h5py reports "return nothing") and likely related to #88 (h5dump hang).
Reproduction (pre-fix)
Root cause
ChunkedStoragePropertyDescription4.Encode()(StoragePropertyDescriptions.cs) hardcoded the Dimension Size Encoded Length field to8, then wrote each dimension as aulong(8 bytes):The HDF5 file format spec (Data Layout Message v4 Properties) requires this field to hold the minimum number of bytes needed to encode the largest chunk dimension. libhdf5 enforces this strictly in
H5D__chunk_set_sizes()(src/H5Dchunk.c):Since
8 != min_bytes_for_largest_dimfor any reasonable dimension value (anything< 2^56), every chunked file PureHDF wrote was rejected. PureHDF's own decoder happens to use the stored value as the read width, so PureHDF-to-PureHDF round-trip works — masking the bug.The trailing
driver.Write((ulong)4)was a second related bug: it hardcoded the element-size term ofDimensionSizesto 4 (correct only forint/uint/floatelement types), losing the actualtypeSizefor any other dtype.Fix
ComputeEncodedLength(ulong[])mirroring libhdf5's byte-counting loop (while (v != 0) { len++; v >>= 8; }).Encode: write the computed length, then loop0..Rank(not0..Rank-1) usingWriteUtils.WriteUlongArbitrary(value, encLen)for variable byte width — also picks up the real element size fromDimensionSizes[Rank-1].GetEncodeSize: replacesizeof(ulong) * RankwithencLen * Rankso the layout-message size matches actual on-disk bytes.WriteUtils.WriteUlongArbitraryalready exists in this codebase (used byH5D_Chunk4_FixedArray.csfor chunk size encoding) — no new utility needed.Tests
New
[Theory]testChunkedFile_IsReadableBy_libhdf5round-trips chunked files throughHDF.PInvoke(the same libhdf5 h5py uses). Covers:[4,4,32,32,16,1]Pre-fix: all 4 cases fail (libhdf5 returns negative handle on
H5F.open).Post-fix: all 4 pass.
External verification
Verified independently against a real-world 6D SPAD-counts microscopy export pipeline using h5py 3.16 / numpy 2.4 / hdf5 lib 2.0:
100/100 random-sample value match across all cases.
Related issues
Notes for reviewer
(ulong)4was specifically suspicious becauseDimensionSizes[Rank-1]already contains the correcttypeSizeset byDataLayoutMessage4.Create(). The hardcoded4only worked when element size was 4 bytes (int/uint/float), silently corrupting layout for any other dtype.Decodesince it correctly readsdimensionSizeEncodedLengthfrom disk and uses it. Pre-fix files (where stored = 8) decode correctly; post-fix files (where stored = min) also decode correctly.(ulong)4were still there, the layout-message offset would shift andH5D.openwould fail.