Feature: Squeeze singleton dimensions in HDFParser

## Problem

Some HDF5 files store arrays with extra length-1 dimensions. MATLAB v7.3 `.mat` files are the most common example: MATLAB wraps every array in at least one singleton, so a 1D vector of 3266 latitudes is stored as shape `(3266, 1)` rather than `(3266,)`.

When `HDFParser` assigns phony dimension names to these arrays, the singletons create conflicts that prevent `open_virtual_dataset` from building an `xr.Dataset`:

```python
from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import HDFParser
from obspec_utils.registry import ObjectStoreRegistry
from obspec_utils.stores import AiohttpStore

# CReSIS polar radar data -- public, no auth required
BASE_URL = "https://data.cresis.ku.edu"
MAT_PATH = "/data/rds/2016_Greenland_Polar6/CSARP_standard/20160517_06/Data_20160517_06_001.mat"

store = AiohttpStore(BASE_URL)
registry = ObjectStoreRegistry({BASE_URL: store})

# Must drop MATLAB-internal groups to avoid a separate crash (h5py.Reference
# fillvalue). These groups contain object references, not array data.
parser = HDFParser(drop_variables=[
    "#refs#", "#subsystem#",
    "param_combine", "param_csarp", "param_records",
])

vds = open_virtual_dataset(url=BASE_URL + MAT_PATH, registry=registry, parser=parser)
```

```
ValueError: conflicting sizes for dimension 'phony_dim_1':
  length 13483 on 'Data' and length 1 on {'phony_dim_0': 'Bottom', 'phony_dim_1': 'Bottom'}
```

What's happening internally:

| HDF5 variable | Stored shape | Assigned dims |
|---|---|---|
| `Data` | `(3266, 13483)` | `phony_dim_0=3266`, `phony_dim_1=13483` |
| `Latitude` | `(3266, 1)` | `phony_dim_0=3266`, `phony_dim_1=1` |
| `Time` | `(1, 13483)` | `phony_dim_0=1`, `phony_dim_1=13483` |

`phony_dim_0` means two different things (3266 vs 1) and so does `phony_dim_1` (13483 vs 1). xarray rightfully rejects this.

The root cause is that `_dataset_dims` in `parsers/hdf/hdf.py` (L275-318) assigns `phony_dim_N` by position index within each dataset independently. When two datasets have different numbers of "real" dimensions but the same number of stored dimensions (because of padding singletons), the names collide.

## Proposed fix

Add a `squeeze` parameter to `HDFParser` that removes length-1 dimensions from array metadata during parsing:

```python
parser = HDFParser(squeeze=True)
```

The work happens in `_construct_manifest_array` (L38-92). Before creating the `ManifestArray`, the function would:

1. Identify singleton axes: `keep = [i for i, s in enumerate(shape) if s != 1]`
2. Filter `shape`, `chunks`, and `dims` to only the kept axes
3. Reshape the `ChunkManifest` to drop the singleton axes

Step 3 is the main implementation detail. The manifest stores chunk references in shaped numpy arrays (`_paths`, `_offsets`, `_lengths`). For singletons, there is exactly one chunk along that axis, so the reshape is a safe `np.squeeze` on those arrays. For contiguous (non-chunked) datasets, the manifest has a single entry whose key changes from e.g. `"0.0"` to `"0"`.

After squeezing, the example above becomes:

| Variable | Squeezed shape | Dims |
|---|---|---|
| `Data` | `(3266, 13483)` | `phony_dim_0`, `phony_dim_1` |
| `Latitude` | `(3266,)` | `phony_dim_0` |
| `Time` | `(13483,)` | `phony_dim_0` |

This removes the size conflict on `phony_dim_1`, but note that `Latitude` and `Time` now both have `phony_dim_0` at different sizes (3266 vs 13483). That's a separate naming issue (see companion issue for dimension naming) — squeeze alone removes the *spurious* conflicts from padding dimensions.

## Where to look

- `HDFParser.__init__`: add `squeeze: bool = False` parameter — L140-165
- `_construct_manifest_array`: apply squeeze to shape/chunks/dims and manifest — L38-92
- `_dataset_chunk_manifest`: the returned `ChunkManifest` stores data in numpy arrays (`_paths`, `_offsets`, `_lengths`) that would need their singleton axes removed — L200-272
- `ChunkManifest` class in `manifests/manifest.py`: may need a `squeeze()` method, or callers can use `ChunkManifest.from_arrays()` with reshaped arrays

## Who this affects

Any HDF5 file with length-1 dimensions that wasn't written by NetCDF4. MATLAB `.mat` files are the most common case, but some instrument-generated HDF5 files have the same pattern.

## Related work

#879 adds support for `vds.squeeze()` *after* parsing — it implements indexing that removes a length-1 dimension from an existing `ManifestArray`. That solves a different use case (post-parse manipulation, e.g. the workflow in #872 where a user wants to squeeze then `expand_dims` to reorder dimensions).

This issue is about squeezing *during* parsing, before the `ManifestArray` is constructed. The crash happens inside `construct_fully_virtual_dataset` when `xr.Dataset()` tries to unify the dimension names — the user never gets a virtual dataset to call `.squeeze()` on.

That said, the two could share implementation: if VirtualiZarr added a squeeze pass between `ManifestGroup` construction and `xr.Dataset` creation (inside `construct_fully_virtual_dataset` in `xarray.py`), #879's indexing machinery could be reused rather than duplicating the logic in the parser.


HDF5 variable	Stored shape	Assigned dims
`Data`	`(3266, 13483)`	`phony_dim_0=3266`, `phony_dim_1=13483`
`Latitude`	`(3266, 1)`	`phony_dim_0=3266`, `phony_dim_1=1`
`Time`	`(1, 13483)`	`phony_dim_0=1`, `phony_dim_1=13483`

Variable	Squeezed shape	Dims
`Data`	`(3266, 13483)`	`phony_dim_0`, `phony_dim_1`
`Latitude`	`(3266,)`	`phony_dim_0`
`Time`	`(13483,)`	`phony_dim_0`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Squeeze singleton dimensions in HDFParser #971

Problem

Proposed fix

Where to look

Who this affects

Related work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Feature: Squeeze singleton dimensions in HDFParser #971

Description

Problem

Proposed fix

Where to look

Who this affects

Related work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions