Skip to content

fix: write minimum byte width in chunked dimension encoded length (closes #53)#157

Open
ChenZeiShuai wants to merge 2 commits intoApollo3zehn:devfrom
ChenZeiShuai:fix/chunked-dim-encoded-length
Open

fix: write minimum byte width in chunked dimension encoded length (closes #53)#157
ChenZeiShuai wants to merge 2 commits intoApollo3zehn:devfrom
ChenZeiShuai:fix/chunked-dim-encoded-length

Conversation

@ChenZeiShuai
Copy link
Copy Markdown

Summary

Fixes the long-standing bug where files containing chunked datasets written by PureHDF cannot be opened by libhdf5-based readers (h5py, HDFView, MATLAB, Imaris, Bio-Formats, etc.).

This is the same root cause as #53 (pandas-via-h5py reports "return nothing") and likely related to #88 (h5dump hang).

Reproduction (pre-fix)

// PureHDF v2.1.2
var data = new int[10];
var file = new H5File { ["chunked"] = new H5Dataset(data, chunks: new uint[] { 10 }) };
file.Write("test.h5");
# h5py 3.16 / hdf5 2.0
import h5py
with h5py.File('test.h5', 'r') as f:
    f['chunked']  # KeyError: 'Unable to synchronously open object (stored chunk
                  # dimension encoding length does not match value calculated
                  # from chunk dimensions)'

Root cause

ChunkedStoragePropertyDescription4.Encode() (StoragePropertyDescriptions.cs) hardcoded the Dimension Size Encoded Length field to 8, then wrote each dimension as a ulong (8 bytes):

// dimension size encoded length
driver.Write((byte)8);                                  // hardcoded

// dimension sizes
for (int i = 0; i < Rank - 1; i++)
    driver.Write(DimensionSizes[i]);                    // 8 bytes each

driver.Write((ulong)4);                                 // hardcoded element size

The HDF5 file format spec (Data Layout Message v4 Properties) requires this field to hold the minimum number of bytes needed to encode the largest chunk dimension. libhdf5 enforces this strictly in H5D__chunk_set_sizes() (src/H5Dchunk.c):

if (dset->shared->layout.u.chunk.enc_bytes_per_dim) {
    if (dset->shared->layout.u.chunk.enc_bytes_per_dim != max_enc_bytes_per_dim)
        HGOTO_ERROR(H5E_DATASET, H5E_BADVALUE, FAIL,
                    "stored chunk dimension encoding length does not match value "
                    "calculated from chunk dimensions");
}

Since 8 != min_bytes_for_largest_dim for any reasonable dimension value (anything < 2^56), every chunked file PureHDF wrote was rejected. PureHDF's own decoder happens to use the stored value as the read width, so PureHDF-to-PureHDF round-trip works — masking the bug.

The trailing driver.Write((ulong)4) was a second related bug: it hardcoded the element-size term of DimensionSizes to 4 (correct only for int/uint/float element types), losing the actual typeSize for any other dtype.

Fix

  • Add ComputeEncodedLength(ulong[]) mirroring libhdf5's byte-counting loop (while (v != 0) { len++; v >>= 8; }).
  • Encode: write the computed length, then loop 0..Rank (not 0..Rank-1) using WriteUtils.WriteUlongArbitrary(value, encLen) for variable byte width — also picks up the real element size from DimensionSizes[Rank-1].
  • GetEncodeSize: replace sizeof(ulong) * Rank with encLen * Rank so the layout-message size matches actual on-disk bytes.

WriteUtils.WriteUlongArbitrary already exists in this codebase (used by H5D_Chunk4_FixedArray.cs for chunk size encoding) — no new utility needed.

Tests

New [Theory] test ChunkedFile_IsReadableBy_libhdf5 round-trips chunked files through HDF.PInvoke (the same libhdf5 h5py uses). Covers:

  • 1D, max dim 10 → 1-byte encoded length
  • 1D, max dim 256 → 2-byte encoded length
  • 1D, max dim 65536 → 3-byte encoded length
  • 6D real-world microscopy chunk shape [4,4,32,32,16,1]

Pre-fix: all 4 cases fail (libhdf5 returns negative handle on H5F.open).
Post-fix: all 4 pass.

External verification

Verified independently against a real-world 6D SPAD-counts microscopy export pipeline using h5py 3.16 / numpy 2.4 / hdf5 lib 2.0:

File Pre-fix Post-fix Compression
chunked, no compression rejected OK 1x
chunked + Deflate-1 rejected OK 47x
chunked + Deflate-9 rejected OK 112x
contiguous (control) OK OK 1x

100/100 random-sample value match across all cases.

Related issues

Notes for reviewer

  • The (ulong)4 was specifically suspicious because DimensionSizes[Rank-1] already contains the correct typeSize set by DataLayoutMessage4.Create(). The hardcoded 4 only worked when element size was 4 bytes (int/uint/float), silently corrupting layout for any other dtype.
  • I did not change Decode since it correctly reads dimensionSizeEncodedLength from disk and uses it. Pre-fix files (where stored = 8) decode correctly; post-fix files (where stored = min) also decode correctly.
  • I considered adding a separate test for non-int32 element sizes (the second hardcoded bug), but the libhdf5 round-trip test already covers it implicitly — if (ulong)4 were still there, the layout-message offset would shift and H5D.open would fail.

Vincent Wilms and others added 2 commits March 17, 2026 23:54
…ollo3zehn#53)

ChunkedStoragePropertyDescription4.Encode() always wrote (byte)8 as the
"Dimension Size Encoded Length" field, regardless of the actual dimension
magnitudes. The HDF5 spec requires this field to hold the *minimum* number
of bytes needed to encode the largest chunk dimension, and libhdf5's
H5D__chunk_set_sizes() in src/H5Dchunk.c strictly enforces this with a
direct `!=` check that aborts with:

  "stored chunk dimension encoding length does not match value calculated
   from chunk dimensions"

As a result every chunked file written by PureHDF (with or without
filters) was rejected by libhdf5-based readers — h5py, HDFView, MATLAB,
Imaris, Bio-Formats — even though PureHDF could read them back itself.
This is the same symptom users reported in Apollo3zehn#53 (pandas via h5py) and is
likely related to Apollo3zehn#88 (h5dump hang).

Fix:
- Compute the encoded length as `1 + floor(log2(max_dim) / 8)` (min 1 byte,
  capped at 8 by HDF5 spec), mirroring libhdf5's byte-counting loop.
- Replace `driver.Write((byte)8)` with the computed value.
- Replace the fixed-width `for (i = 0; i < Rank-1) Write(ulong)` +
  trailing `Write((ulong)4)` (which also hardcoded the element-size word
  to 4, breaking any non-int32 dataset coincidentally) with a single
  `Rank`-iteration loop using `WriteUtils.WriteUlongArbitrary` at the
  computed width — element size now uses the real `DimensionSizes[Rank-1]`.
- `GetEncodeSize()` updated to reflect the variable byte width.

Test:
- New `[Theory] ChunkedFile_IsReadableBy_libhdf5` round-trips through
  HDF.PInvoke (same libhdf5 h5py uses). Covers 1-byte, 2-byte, 3-byte
  encoded length cases plus a 6D microscopy-style chunk shape.

Verified independently with h5py 3.16 / numpy 2.4 / hdf5 lib 2.0 on a 6D
SPAD-counts data export pipeline (chunked + Deflate-1 → 47x compression,
chunked + Deflate-9 → 112x compression, all readable round-trip).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can't read file from pandas library Unable to open file written with PureHDF

1 participant