Skip to content

Feature: Squeeze singleton dimensions in HDFParser #971

@maxrjones

Description

@maxrjones

Problem

Some HDF5 files store arrays with extra length-1 dimensions. MATLAB v7.3 .mat files are the most common example: MATLAB wraps every array in at least one singleton, so a 1D vector of 3266 latitudes is stored as shape (3266, 1) rather than (3266,).

When HDFParser assigns phony dimension names to these arrays, the singletons create conflicts that prevent open_virtual_dataset from building an xr.Dataset:

from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import HDFParser
from obspec_utils.registry import ObjectStoreRegistry
from obspec_utils.stores import AiohttpStore

# CReSIS polar radar data -- public, no auth required
BASE_URL = "https://data.cresis.ku.edu"
MAT_PATH = "/data/rds/2016_Greenland_Polar6/CSARP_standard/20160517_06/Data_20160517_06_001.mat"

store = AiohttpStore(BASE_URL)
registry = ObjectStoreRegistry({BASE_URL: store})

# Must drop MATLAB-internal groups to avoid a separate crash (h5py.Reference
# fillvalue). These groups contain object references, not array data.
parser = HDFParser(drop_variables=[
    "#refs#", "#subsystem#",
    "param_combine", "param_csarp", "param_records",
])

vds = open_virtual_dataset(url=BASE_URL + MAT_PATH, registry=registry, parser=parser)
ValueError: conflicting sizes for dimension 'phony_dim_1':
  length 13483 on 'Data' and length 1 on {'phony_dim_0': 'Bottom', 'phony_dim_1': 'Bottom'}

What's happening internally:

HDF5 variable Stored shape Assigned dims
Data (3266, 13483) phony_dim_0=3266, phony_dim_1=13483
Latitude (3266, 1) phony_dim_0=3266, phony_dim_1=1
Time (1, 13483) phony_dim_0=1, phony_dim_1=13483

phony_dim_0 means two different things (3266 vs 1) and so does phony_dim_1 (13483 vs 1). xarray rightfully rejects this.

The root cause is that _dataset_dims in parsers/hdf/hdf.py (L275-318) assigns phony_dim_N by position index within each dataset independently. When two datasets have different numbers of "real" dimensions but the same number of stored dimensions (because of padding singletons), the names collide.

Proposed fix

Add a squeeze parameter to HDFParser that removes length-1 dimensions from array metadata during parsing:

parser = HDFParser(squeeze=True)

The work happens in _construct_manifest_array (L38-92). Before creating the ManifestArray, the function would:

  1. Identify singleton axes: keep = [i for i, s in enumerate(shape) if s != 1]
  2. Filter shape, chunks, and dims to only the kept axes
  3. Reshape the ChunkManifest to drop the singleton axes

Step 3 is the main implementation detail. The manifest stores chunk references in shaped numpy arrays (_paths, _offsets, _lengths). For singletons, there is exactly one chunk along that axis, so the reshape is a safe np.squeeze on those arrays. For contiguous (non-chunked) datasets, the manifest has a single entry whose key changes from e.g. "0.0" to "0".

After squeezing, the example above becomes:

Variable Squeezed shape Dims
Data (3266, 13483) phony_dim_0, phony_dim_1
Latitude (3266,) phony_dim_0
Time (13483,) phony_dim_0

This removes the size conflict on phony_dim_1, but note that Latitude and Time now both have phony_dim_0 at different sizes (3266 vs 13483). That's a separate naming issue (see companion issue for dimension naming) — squeeze alone removes the spurious conflicts from padding dimensions.

Where to look

  • HDFParser.__init__: add squeeze: bool = False parameter — L140-165
  • _construct_manifest_array: apply squeeze to shape/chunks/dims and manifest — L38-92
  • _dataset_chunk_manifest: the returned ChunkManifest stores data in numpy arrays (_paths, _offsets, _lengths) that would need their singleton axes removed — L200-272
  • ChunkManifest class in manifests/manifest.py: may need a squeeze() method, or callers can use ChunkManifest.from_arrays() with reshaped arrays

Who this affects

Any HDF5 file with length-1 dimensions that wasn't written by NetCDF4. MATLAB .mat files are the most common case, but some instrument-generated HDF5 files have the same pattern.

Related work

#879 adds support for vds.squeeze() after parsing — it implements indexing that removes a length-1 dimension from an existing ManifestArray. That solves a different use case (post-parse manipulation, e.g. the workflow in #872 where a user wants to squeeze then expand_dims to reorder dimensions).

This issue is about squeezing during parsing, before the ManifestArray is constructed. The crash happens inside construct_fully_virtual_dataset when xr.Dataset() tries to unify the dimension names — the user never gets a virtual dataset to call .squeeze() on.

That said, the two could share implementation: if VirtualiZarr added a squeeze pass between ManifestGroup construction and xr.Dataset creation (inside construct_fully_virtual_dataset in xarray.py), #879's indexing machinery could be reused rather than duplicating the logic in the parser.

Metadata

Metadata

Assignees

No one assigned

    Labels

    HDF parserNon-kerchunk-based HDF parser

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions