The datasets used in this repository are publicly available open data and are licensed under the Creative Commons Attribution 4.0 (CC BY 4.0) license. See LICENSES/CC-BY-4.0.txt for details.
In particular, this repository uses the Liander Power Dataset, which is publicly accessible via the Liander Open Data portal.
This dataset consists of:
- Historical Power Measurements: 5-minute averaged active power readings (in MW) collected across Alliander's electrical grid, measured in UTC timezone.
- Historical Weather Data: Sourced from providers such as Open-Meteo, offering temperature, wind, and shortwave_radiation records, measured in UTC timezone.
- Powerpile dataset: Publicly available datasets containing power consumption data from various locations.
The data is organized to link measurements, metadata, and locations. We use numpy.memmaps because they allow efficient access to large arrays stored on disk without loading the entire array into memory. The core components are:
- Contains a 1D NumPy array with all historical measurement values. Each value corresponds to a 5-minute averaged active power reading.
- Stores spans of measurement data, each representing a time window for a specific location.
- Columns:
id: Start index in the NumPy arraylocation: Location identifierdatetime_start: Start time (Unix timestamp)num_values: Number of values in the span
- Connects the
.npand.parquetfiles - Includes metadata for locations, such as:
- Name
- Longitude and latitude
The dataset is also available in the Croissant format (ML Commons v1.1), a JSON-LD standard for making ML datasets discoverable and interoperable across platforms like HuggingFace, Kaggle, and OpenML. The Croissant release is self-contained — it does not require the memmap or the internal .json descriptor.
The release consists of two Parquet files described by croissant.json:
measurements.parquet— one row per (location, span), with columns forlocation_id,location_name,lat,lon,start_time,sample_interval_s,num_values, and avalueslist column containing the power readings.weather.parquet— one row per (location, feature, span), with columns forweather_location_id,lat,lon,feature_name,feature_unit,start_time,sample_interval_s,num_values, and avalueslist column containing the weather readings.
To load the Croissant release into the project's internal NumpyData format, use the adapter in croissant_adapter.py:
from croissant_adapter import load_measurements_from_croissant, load_weather_from_croissant
measurements = load_measurements_from_croissant(Path("data/LianderPower"))
weather = load_weather_from_croissant(Path("data/LianderPower"))The adapter rebuilds the flat memmap-backed layout that the rest of the pipeline expects (NumpyData, IntervalDataset, TimeseriesDataset), so downstream code runs unchanged.
You can integrate data from external providers as long as it follows the same structure used by the core dataset:
- Prepare a Parquet file containing at least
timeandmeasurementscolumns.
This can be created with pandas. The measurement values should be provided in watts (W); the system automatically converts them to megawatts (MW) during loading. - Add the new location to the dataset's
.jsondescriptor, including afilenamefield pointing to the Parquet file you created.
You can follow the layout used in the benchmarking dataset for reference.