Skip to content

scan_grib unable to process spectra data #540

@gpickering-WEL

Description

@gpickering-WEL

Hi there! I've been processing 2D wave spectra grib files using kerchunk and have found that it doesn't correctly identify the coordinates, nor the dimensions of the data . The variable d2fd has the dimensions (directionNumber, frequencyNumber, latitude, longitude), so a 2D array per lat/lon point. Xarray and cfgrib read the data format correctly, but kerchunk does not. See the below,

cfgrib

xr.load_dataset("ec_spectra/T1P12180000010106001", engine="cfgrib")

<xarray.Dataset> Size: 3MB
Dimensions:          (directionNumber: 36, frequencyNumber: 29, latitude: 21,
                      longitude: 31)
Coordinates:
    number           int64 8B 0
    time             datetime64[ns] 8B 2024-12-18
    step             timedelta64[ns] 8B 14 days 06:00:00
    meanSea          float64 8B 0.0
  * directionNumber  (directionNumber) int64 288B 1 2 3 4 5 6 ... 32 33 34 35 36
  * frequencyNumber  (frequencyNumber) int64 232B 1 2 3 4 5 6 ... 25 26 27 28 29
  * latitude         (latitude) float64 168B -38.0 -38.1 -38.2 ... -39.9 -40.0
  * longitude        (longitude) float64 248B 141.0 141.1 141.2 ... 143.9 144.0
    valid_time       datetime64[ns] 8B 2025-01-01T06:00:00
Data variables:
    d2fd             (directionNumber, frequencyNumber, latitude, longitude) float32 3MB ...
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2025-01-23T14:57 GRIB to CDM+CF via cfgrib-0.9.1...

kerchunk

single_ref = scan_grib("./ec_spectra/T1P12180000010100001")
print(f"Lenght of single_ref: {len(single_ref)}")
with open("single_ref.json", "w") as f:
    json.dump(single_ref[0], f)

fs = fsspec.filesystem("reference", fo="single_ref.json")
xr.open_dataset(fs.get_mapper(""), engine="zarr")

Lenght of single_ref: 1
/home/gpickering/projects/welvops-Py/.venv/lib/python3.10/site-packages/xarray/backends/api.py:571: RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider:
1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
  backend_ds = backend.open_dataset(
<xarray.Dataset> Size: 6kB
Dimensions:     (latitude: 21, longitude: 31)
Coordinates:
  * latitude    (latitude) float64 168B -38.0 -38.1 -38.2 ... -39.8 -39.9 -40.0
  * longitude   (longitude) float64 248B 141.0 141.1 141.2 ... 143.8 143.9 144.0
    meanSea     float64 8B ...
    number      int64 8B ...
    step        timedelta64[ns] 8B ...
    time        datetime64[ns] 8B ...
    valid_time  datetime64[ns] 8B ...
Data variables:
    d2fd        (latitude, longitude) float64 5kB ...
Attributes:
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_edition:            1
    GRIB_subCentre:          0
    institution:             European Centre for Medium-Range Weather Forecasts

I've been trying to understand the problem but unfortunately haven't reached a solution, here's a few things I've noticed though.

  • cfgrib has a specific variable for spectra keys (directionNumber and frequencyNumber). Kerchunk uses cfgrib.dataset.COORD_ATTRS in a few places to loop over possible coordinates. This doesn't contain either directionNumber, nor frequencyNumber.
  • _split_file doesn't break the grib file into its component messages (I think it's meant to?). The example file I've been testing has 1044 messages. I copied the ECMWF example on reading messages from a byte stream to create a new _split_file but that wasn't a complete fix.
$ grib_ls ./notebooks/ec_spectra/T1P12180000010100001 | tail -n 10
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1044 of 1044 messages in ./notebooks/ec_spectra/T1P12180000010100001

1044 of 1044 total messages in 1 files
def _split_file2(f: io.FileIO, skip=0):
    data = f.read()
    offset = 0
    part = 0

    while offset < len(data):
        h = codes_new_from_message(data[offset:])
        total_length = codes_get(h, "totalLength")
        yield offset, total_length, data[offset : offset + total_length]
        codes_release(h)
        offset += total_length
        part += 1
        if skip and part >= skip:
            break

Any suggestions would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions