Skip to content

HDF5 validator incorrectly handles attributes with variable-length string arrays #61

Description

@agolovanov

HDF5 supports two ways of storing an array of strings: fixed-length and variable-length.

openPMD uses arrays of strings for some attributes, for example, for axisLabels. When a fixed-length array is used,

// h5dump output
ATTRIBUTE "axisLabels" {
    DATATYPE  H5T_STRING {
        STRSIZE 2;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1;
    }
    DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
    DATA {
    (0): "x", "z"
    }
}

openPMD-validator considers that a valid attribute. However, when a variable-length array is used,

ATTRIBUTE "axisLabels" {
    DATATYPE  H5T_STRING {
        STRSIZE H5T_VARIABLE;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1;
    }
    DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
    DATA {
    (0): "x", "z"
    }
}

openPMD-validator fails with the following error message:

Error: Attribute axisLabels in `/data/0/meshes/inv` is not of type ndarray of '<map object at 0x7fe5256acbb0>' (is ndarray of 'object_')!

As variable-length string arrays are a legitimate feature of the HDF5 data format, and the openPMD standard does not explicitly ban using this feature (it only states that axisLabels should be "1-dimensional array containing N (string) elements", which is satisfied in both cases), I believe using variable-length should not violate the openPMD standard, and thus the openPMD-validator should not fail in this case.

This probably happens because internally h5py represents variable-length string arrays as np.ndarray with dtype=object instead of numpy string type (see https://docs.h5py.org/en/stable/special.html). Because of that, instead of using arr.dtype.type (which gives np.object_ for variable-length arrays), the validator should use the h5py.check_string_dtype(arr.dtype) method which correctly works both with fixed- and variable-length string arrays.

Attached are two example output files with fixed- and variable-length used for axisLabels: examples.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions