Problem
Some HDF5 files store arrays with extra length-1 dimensions. MATLAB v7.3 .mat files are the most common example: MATLAB wraps every array in at least one singleton, so a 1D vector of 3266 latitudes is stored as shape (3266, 1) rather than (3266,).
When HDFParser assigns phony dimension names to these arrays, the singletons create conflicts that prevent open_virtual_dataset from building an xr.Dataset:
from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import HDFParser
from obspec_utils.registry import ObjectStoreRegistry
from obspec_utils.stores import AiohttpStore
# CReSIS polar radar data -- public, no auth required
BASE_URL = "https://data.cresis.ku.edu"
MAT_PATH = "/data/rds/2016_Greenland_Polar6/CSARP_standard/20160517_06/Data_20160517_06_001.mat"
store = AiohttpStore(BASE_URL)
registry = ObjectStoreRegistry({BASE_URL: store})
# Must drop MATLAB-internal groups to avoid a separate crash (h5py.Reference
# fillvalue). These groups contain object references, not array data.
parser = HDFParser(drop_variables=[
"#refs#", "#subsystem#",
"param_combine", "param_csarp", "param_records",
])
vds = open_virtual_dataset(url=BASE_URL + MAT_PATH, registry=registry, parser=parser)
ValueError: conflicting sizes for dimension 'phony_dim_1':
length 13483 on 'Data' and length 1 on {'phony_dim_0': 'Bottom', 'phony_dim_1': 'Bottom'}
What's happening internally:
| HDF5 variable |
Stored shape |
Assigned dims |
Data |
(3266, 13483) |
phony_dim_0=3266, phony_dim_1=13483 |
Latitude |
(3266, 1) |
phony_dim_0=3266, phony_dim_1=1 |
Time |
(1, 13483) |
phony_dim_0=1, phony_dim_1=13483 |
phony_dim_0 means two different things (3266 vs 1) and so does phony_dim_1 (13483 vs 1). xarray rightfully rejects this.
The root cause is that _dataset_dims in parsers/hdf/hdf.py (L275-318) assigns phony_dim_N by position index within each dataset independently. When two datasets have different numbers of "real" dimensions but the same number of stored dimensions (because of padding singletons), the names collide.
Proposed fix
Add a squeeze parameter to HDFParser that removes length-1 dimensions from array metadata during parsing:
parser = HDFParser(squeeze=True)
The work happens in _construct_manifest_array (L38-92). Before creating the ManifestArray, the function would:
- Identify singleton axes:
keep = [i for i, s in enumerate(shape) if s != 1]
- Filter
shape, chunks, and dims to only the kept axes
- Reshape the
ChunkManifest to drop the singleton axes
Step 3 is the main implementation detail. The manifest stores chunk references in shaped numpy arrays (_paths, _offsets, _lengths). For singletons, there is exactly one chunk along that axis, so the reshape is a safe np.squeeze on those arrays. For contiguous (non-chunked) datasets, the manifest has a single entry whose key changes from e.g. "0.0" to "0".
After squeezing, the example above becomes:
| Variable |
Squeezed shape |
Dims |
Data |
(3266, 13483) |
phony_dim_0, phony_dim_1 |
Latitude |
(3266,) |
phony_dim_0 |
Time |
(13483,) |
phony_dim_0 |
This removes the size conflict on phony_dim_1, but note that Latitude and Time now both have phony_dim_0 at different sizes (3266 vs 13483). That's a separate naming issue (see companion issue for dimension naming) — squeeze alone removes the spurious conflicts from padding dimensions.
Where to look
HDFParser.__init__: add squeeze: bool = False parameter — L140-165
_construct_manifest_array: apply squeeze to shape/chunks/dims and manifest — L38-92
_dataset_chunk_manifest: the returned ChunkManifest stores data in numpy arrays (_paths, _offsets, _lengths) that would need their singleton axes removed — L200-272
ChunkManifest class in manifests/manifest.py: may need a squeeze() method, or callers can use ChunkManifest.from_arrays() with reshaped arrays
Who this affects
Any HDF5 file with length-1 dimensions that wasn't written by NetCDF4. MATLAB .mat files are the most common case, but some instrument-generated HDF5 files have the same pattern.
Related work
#879 adds support for vds.squeeze() after parsing — it implements indexing that removes a length-1 dimension from an existing ManifestArray. That solves a different use case (post-parse manipulation, e.g. the workflow in #872 where a user wants to squeeze then expand_dims to reorder dimensions).
This issue is about squeezing during parsing, before the ManifestArray is constructed. The crash happens inside construct_fully_virtual_dataset when xr.Dataset() tries to unify the dimension names — the user never gets a virtual dataset to call .squeeze() on.
That said, the two could share implementation: if VirtualiZarr added a squeeze pass between ManifestGroup construction and xr.Dataset creation (inside construct_fully_virtual_dataset in xarray.py), #879's indexing machinery could be reused rather than duplicating the logic in the parser.
Problem
Some HDF5 files store arrays with extra length-1 dimensions. MATLAB v7.3
.matfiles are the most common example: MATLAB wraps every array in at least one singleton, so a 1D vector of 3266 latitudes is stored as shape(3266, 1)rather than(3266,).When
HDFParserassigns phony dimension names to these arrays, the singletons create conflicts that preventopen_virtual_datasetfrom building anxr.Dataset:What's happening internally:
Data(3266, 13483)phony_dim_0=3266,phony_dim_1=13483Latitude(3266, 1)phony_dim_0=3266,phony_dim_1=1Time(1, 13483)phony_dim_0=1,phony_dim_1=13483phony_dim_0means two different things (3266 vs 1) and so doesphony_dim_1(13483 vs 1). xarray rightfully rejects this.The root cause is that
_dataset_dimsinparsers/hdf/hdf.py(L275-318) assignsphony_dim_Nby position index within each dataset independently. When two datasets have different numbers of "real" dimensions but the same number of stored dimensions (because of padding singletons), the names collide.Proposed fix
Add a
squeezeparameter toHDFParserthat removes length-1 dimensions from array metadata during parsing:The work happens in
_construct_manifest_array(L38-92). Before creating theManifestArray, the function would:keep = [i for i, s in enumerate(shape) if s != 1]shape,chunks, anddimsto only the kept axesChunkManifestto drop the singleton axesStep 3 is the main implementation detail. The manifest stores chunk references in shaped numpy arrays (
_paths,_offsets,_lengths). For singletons, there is exactly one chunk along that axis, so the reshape is a safenp.squeezeon those arrays. For contiguous (non-chunked) datasets, the manifest has a single entry whose key changes from e.g."0.0"to"0".After squeezing, the example above becomes:
Data(3266, 13483)phony_dim_0,phony_dim_1Latitude(3266,)phony_dim_0Time(13483,)phony_dim_0This removes the size conflict on
phony_dim_1, but note thatLatitudeandTimenow both havephony_dim_0at different sizes (3266 vs 13483). That's a separate naming issue (see companion issue for dimension naming) — squeeze alone removes the spurious conflicts from padding dimensions.Where to look
HDFParser.__init__: addsqueeze: bool = Falseparameter — L140-165_construct_manifest_array: apply squeeze to shape/chunks/dims and manifest — L38-92_dataset_chunk_manifest: the returnedChunkManifeststores data in numpy arrays (_paths,_offsets,_lengths) that would need their singleton axes removed — L200-272ChunkManifestclass inmanifests/manifest.py: may need asqueeze()method, or callers can useChunkManifest.from_arrays()with reshaped arraysWho this affects
Any HDF5 file with length-1 dimensions that wasn't written by NetCDF4. MATLAB
.matfiles are the most common case, but some instrument-generated HDF5 files have the same pattern.Related work
#879 adds support for
vds.squeeze()after parsing — it implements indexing that removes a length-1 dimension from an existingManifestArray. That solves a different use case (post-parse manipulation, e.g. the workflow in #872 where a user wants to squeeze thenexpand_dimsto reorder dimensions).This issue is about squeezing during parsing, before the
ManifestArrayis constructed. The crash happens insideconstruct_fully_virtual_datasetwhenxr.Dataset()tries to unify the dimension names — the user never gets a virtual dataset to call.squeeze()on.That said, the two could share implementation: if VirtualiZarr added a squeeze pass between
ManifestGroupconstruction andxr.Datasetcreation (insideconstruct_fully_virtual_datasetinxarray.py), #879's indexing machinery could be reused rather than duplicating the logic in the parser.