Skip to content

PrithviWxC_inference.py example fails with "No valid data" despite successful downloads #53

@dun280

Description

@dun280

Problem Summary

The inference example in examples/PrithviWxC_inference.py fails with AssertionError: There doesn't seem to be any valid data. even after all HuggingFace downloads complete successfully.

Environment

  • Python 3.12.0
  • Fresh installation following repository instructions
  • All dependencies installed via pip install '.[examples]'

Steps to Reproduce

  1. Clone repository and install dependencies
  2. Run examples/PrithviWxC_inference.py without modifications
  3. All download steps complete successfully:
    Fetching 1 files: 100%|████████| 1/1 [00:02<00:00,  2.43s/it]
    Fetching 1 files: 100%|████████| 1/1 [00:02<00:00,  2.88s/it]
    # ... all downloads succeed
    
  4. Dataset creation fails:
    assert len(dataset) > 0, "There doesn't seem to be any valid data."
    AssertionError: There doesn't seem to be any valid data.

Root Cause

The example uses 18-hour lead times but only downloads 1 day of data (Jan 1st). This creates an incomplete data dependency:

  • Required for 18h forecast: Input data (Jan 1st) + Target data (Jan 2nd) + Day-2 climatology
  • Actually downloaded: Input data (Jan 1st) + Day-1 climatology only
  • Result: Merra2Dataset correctly rejects all samples due to missing dependencies

Missing Downloads

The example downloads:

allow_patterns="merra-2/MERRA2_sfc_2020010[1].nc"  # Only Jan 1st
allow_patterns="climatology/climate_*_doy001_*.nc"   # Only day-1 climatology

But needs:

allow_patterns="merra-2/MERRA2_sfc_2020010[12].nc"  # Jan 1st AND 2nd  
allow_patterns="climatology/climate_*_doy00[12]_*.nc" # Day-1 AND day-2 climatology

Quick Fix

Add these downloads to the example:

# Add Jan 2nd data
snapshot_download(
    repo_id="ibm-nasa-geospatial/Prithvi-WxC-1.0-2300M",
    allow_patterns="merra-2/MERRA2_sfc_20200102.nc",
    local_dir="../data",
)
snapshot_download(
    repo_id="ibm-nasa-geospatial/Prithvi-WxC-1.0-2300M",
    allow_patterns="merra-2/MERRA_pres_20200102.nc", 
    local_dir="../data",
)

# Add day-2 climatology  
snapshot_download(
    repo_id="ibm-nasa-geospatial/Prithvi-WxC-1.0-2300M",
    allow_patterns="climatology/climate_*_doy002_*.nc",
    local_dir="../data",
)

Alternative Fix

Use shorter lead times that stay within single day:

lead_times = [6]  # Instead of [18]
time_range = ("2020-01-01T06:00:00", "2020-01-01T21:00:00")  # Single day

Suggested Improvements

  1. Fix the example - Add the missing downloads so it works out-of-the-box

  2. Better error messaging - Instead of silent failure, the Merra2Dataset could report:

    AssertionError: No valid samples found. Missing data for forecast targets:
    - Need data files: MERRA2_sfc_20200102.nc, MERRA_pres_20200102.nc  
    - Need climatology: climate_*_doy002_*.nc
    
  3. Documentation - Clarify the data dependency relationship:

    "For N-hour forecasts, you need N/24 + 1 days of data plus matching climatology"

Impact

This affects anyone trying the official examples for the first time. The issue is confusing because:

  • All downloads appear to succeed ✅
  • No clear error about what's missing ❌
  • Requires deep understanding of the forecasting logic to diagnose ❌

Verification

After applying the quick fix above, the example runs successfully:

Dataset length: 2
🎉 SUCCESS! Full pipeline working

Thanks for this excellent model! This issue just needs a small documentation/example fix to improve the new user experience.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions