fix: write minimum byte width in chunked dimension encoded length (closes #53) by ChenZeiShuai · Pull Request #157 · Apollo3zehn/PureHDF

ChenZeiShuai · 2026-04-30T03:46:20Z

Summary

Fixes the long-standing bug where files containing chunked datasets written by PureHDF cannot be opened by libhdf5-based readers (h5py, HDFView, MATLAB, Imaris, Bio-Formats, etc.).

This is the same root cause as #53 (pandas-via-h5py reports "return nothing") and likely related to #88 (h5dump hang).

Reproduction (pre-fix)

// PureHDF v2.1.2
var data = new int[10];
var file = new H5File { ["chunked"] = new H5Dataset(data, chunks: new uint[] { 10 }) };
file.Write("test.h5");

# h5py 3.16 / hdf5 2.0
import h5py
with h5py.File('test.h5', 'r') as f:
    f['chunked']  # KeyError: 'Unable to synchronously open object (stored chunk
                  # dimension encoding length does not match value calculated
                  # from chunk dimensions)'

Root cause

ChunkedStoragePropertyDescription4.Encode() (StoragePropertyDescriptions.cs) hardcoded the Dimension Size Encoded Length field to 8, then wrote each dimension as a ulong (8 bytes):

// dimension size encoded length
driver.Write((byte)8);                                  // hardcoded

// dimension sizes
for (int i = 0; i < Rank - 1; i++)
    driver.Write(DimensionSizes[i]);                    // 8 bytes each

driver.Write((ulong)4);                                 // hardcoded element size

The HDF5 file format spec (Data Layout Message v4 Properties) requires this field to hold the minimum number of bytes needed to encode the largest chunk dimension. libhdf5 enforces this strictly in H5D__chunk_set_sizes() (src/H5Dchunk.c):

if (dset->shared->layout.u.chunk.enc_bytes_per_dim) {
    if (dset->shared->layout.u.chunk.enc_bytes_per_dim != max_enc_bytes_per_dim)
        HGOTO_ERROR(H5E_DATASET, H5E_BADVALUE, FAIL,
                    "stored chunk dimension encoding length does not match value "
                    "calculated from chunk dimensions");
}

Since 8 != min_bytes_for_largest_dim for any reasonable dimension value (anything < 2^56), every chunked file PureHDF wrote was rejected. PureHDF's own decoder happens to use the stored value as the read width, so PureHDF-to-PureHDF round-trip works — masking the bug.

The trailing driver.Write((ulong)4) was a second related bug: it hardcoded the element-size term of DimensionSizes to 4 (correct only for int/uint/float element types), losing the actual typeSize for any other dtype.

Fix

Add ComputeEncodedLength(ulong[]) mirroring libhdf5's byte-counting loop (while (v != 0) { len++; v >>= 8; }).
Encode: write the computed length, then loop 0..Rank (not 0..Rank-1) using WriteUtils.WriteUlongArbitrary(value, encLen) for variable byte width — also picks up the real element size from DimensionSizes[Rank-1].
GetEncodeSize: replace sizeof(ulong) * Rank with encLen * Rank so the layout-message size matches actual on-disk bytes.

WriteUtils.WriteUlongArbitrary already exists in this codebase (used by H5D_Chunk4_FixedArray.cs for chunk size encoding) — no new utility needed.

Tests

New [Theory] test ChunkedFile_IsReadableBy_libhdf5 round-trips chunked files through HDF.PInvoke (the same libhdf5 h5py uses). Covers:

1D, max dim 10 → 1-byte encoded length
1D, max dim 256 → 2-byte encoded length
1D, max dim 65536 → 3-byte encoded length
6D real-world microscopy chunk shape [4,4,32,32,16,1]

Pre-fix: all 4 cases fail (libhdf5 returns negative handle on H5F.open).
Post-fix: all 4 pass.

External verification

Verified independently against a real-world 6D SPAD-counts microscopy export pipeline using h5py 3.16 / numpy 2.4 / hdf5 lib 2.0:

File	Pre-fix	Post-fix	Compression
chunked, no compression	rejected	OK	1x
chunked + Deflate-1	rejected	OK	47x
chunked + Deflate-9	rejected	OK	112x
contiguous (control)	OK	OK	1x

100/100 random-sample value match across all cases.

Related issues

closes Can't read file from pandas library #53 (Can't read file from pandas library)
likely fixes Unable to open file written with PureHDF #88 root cause (Unable to open file written with PureHDF / h5dump hang) — though that one was closed with another patch, the same encoded-length issue may have been a contributing factor

Notes for reviewer

The (ulong)4 was specifically suspicious because DimensionSizes[Rank-1] already contains the correct typeSize set by DataLayoutMessage4.Create(). The hardcoded 4 only worked when element size was 4 bytes (int/uint/float), silently corrupting layout for any other dtype.
I did not change Decode since it correctly reads dimensionSizeEncodedLength from disk and uses it. Pre-fix files (where stored = 8) decode correctly; post-fix files (where stored = min) also decode correctly.
I considered adding a separate test for non-int32 element sizes (the second hardcoded bug), but the libhdf5 round-trip test already covers it implicitly — if (ulong)4 were still there, the layout-message offset would shift and H5D.open would fail.

…ollo3zehn#53) ChunkedStoragePropertyDescription4.Encode() always wrote (byte)8 as the "Dimension Size Encoded Length" field, regardless of the actual dimension magnitudes. The HDF5 spec requires this field to hold the *minimum* number of bytes needed to encode the largest chunk dimension, and libhdf5's H5D__chunk_set_sizes() in src/H5Dchunk.c strictly enforces this with a direct `!=` check that aborts with: "stored chunk dimension encoding length does not match value calculated from chunk dimensions" As a result every chunked file written by PureHDF (with or without filters) was rejected by libhdf5-based readers — h5py, HDFView, MATLAB, Imaris, Bio-Formats — even though PureHDF could read them back itself. This is the same symptom users reported in Apollo3zehn#53 (pandas via h5py) and is likely related to Apollo3zehn#88 (h5dump hang). Fix: - Compute the encoded length as `1 + floor(log2(max_dim) / 8)` (min 1 byte, capped at 8 by HDF5 spec), mirroring libhdf5's byte-counting loop. - Replace `driver.Write((byte)8)` with the computed value. - Replace the fixed-width `for (i = 0; i < Rank-1) Write(ulong)` + trailing `Write((ulong)4)` (which also hardcoded the element-size word to 4, breaking any non-int32 dataset coincidentally) with a single `Rank`-iteration loop using `WriteUtils.WriteUlongArbitrary` at the computed width — element size now uses the real `DimensionSizes[Rank-1]`. - `GetEncodeSize()` updated to reflect the variable byte width. Test: - New `[Theory] ChunkedFile_IsReadableBy_libhdf5` round-trips through HDF.PInvoke (same libhdf5 h5py uses). Covers 1-byte, 2-byte, 3-byte encoded length cases plus a 6D microscopy-style chunk shape. Verified independently with h5py 3.16 / numpy 2.4 / hdf5 lib 2.0 on a 6D SPAD-counts data export pipeline (chunked + Deflate-1 → 47x compression, chunked + Deflate-9 → 112x compression, all readable round-trip).

Vincent Wilms and others added 2 commits March 17, 2026 23:54

Merge commit '2b9f966ccfd3635fae443b7bb0c5e8be2d6ad766'

22ad4a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: write minimum byte width in chunked dimension encoded length (closes #53)#157

fix: write minimum byte width in chunked dimension encoded length (closes #53)#157
ChenZeiShuai wants to merge 2 commits intoApollo3zehn:devfrom
ChenZeiShuai:fix/chunked-dim-encoded-length

ChenZeiShuai commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChenZeiShuai commented Apr 30, 2026

Summary

Reproduction (pre-fix)

Root cause

Fix

Tests

External verification

Related issues

Notes for reviewer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant