We enforce the following requirements in all projects.
-
Open-source code: projects must use either Apache 2.0 or MIT open-source licenses, which are OSI approved, and they must be openly available in platforms such as GitHub.
-
Open data: data produced by projects (dataset tools and data modules in particular) should use a CC-BY-4.0 license whenever possible.
-
Versioning: projects must be version controlled with official releases, which can be used to specify the version of the project used in a study and/or dataset, and an accompanying CHANGELOG. Project developers are free to choose their preferred approach (e.g., SemVer or CalVer).
-
Testing: projects must employ some type of testing to ensure quality and long-term stability. The approach will vary depending on the type of project:
- Software tools: these require using testing frameworks (e.g., pytest). The chosen approach (e.g., unit testing, functional testing) is up to the developer, but some kind of testing must be present.
- Dataset tools: the method used to produce the data must contain some kind of assurance or evaluation to ensure its quality. The choice of method is up to the maintainers in a case-by-case basis.
- Data modules: a minimal set of tests, provided by our template. These will fulfill two goals:
- Verifying that module inputs/outputs are placed in the right locations.
- Serve as a small example of the module’s operation, which will be used to build standardised module documentation. We recommend delegating more complex testing to the software tools and datasets used by the module.
- Model builders: we refrain from requiring a specific testing method for these repositories. Nevertheless, we recommend at least using lightweight snakemake unit tests to verify that the steps of the workflow work as intended.
-
Documentation: projects must provide documentation to ensure the methods and reasoning behind their code can be understood:
- Software tools: versioned domaximum_roof_ratio: 0.80 # unit for a value cumentation website with, at minimum, API documentation and useful examples.
- Dataset tools: either versioned documentation or a README file explaining the methodology and assumptions employed, to allow others to reuse the tool in the future to reproduce or update the dataset when necessary.
- Data modules: projects must have versioned and standardised documentation with the following in mind:
- A README file explaining the different steps of the workflow, citing relevant material, and detailing their methodology.
- To follow our templating, which requires explanations of key components (configuration options, input/output files, wildcards) and will enable standardised documentation generation.
- Model builders: no specific requirements, but documentation is nevertheless recommended.
The following is a list of general advice on how to format files to help tools interact seamlessly. We encourage developers to employ validation for user inputted files, either through built-in snakemake functionality, or through libraries like pydantic and pandera.
-
Configuration data: we prefer to use YAML (.yaml) files.
-
These files must always be validated to detect invalid user settings.
-
Unit-specific configuration settings must state the unit explicitly in their naming.
???+ example "Naming configuration variables clearly"
```yaml maximum_installable_mw_per_km2: # unit for the section pv_tilted: 160 pv_flat: 80 maximum_roof_ratio: 0.80 # unit for a value ```
-
-
Tabular data: we prefer Apache Parquet (.parquet) files due to their performance, storage efficiency and metadata compatibility, with the following requirements:
-
Follow tidy data principles (columns are variables, rows are observations) to make data machine-readable.
???+ example "Example of a tidy table"
| year | country_id | shape_id | demand_mwh | |---------------|------------------|------------------|--------------| | 2020 | ITA | North | 4500 | | 2020 | ITA | East | 4800 | | 2020 | ITA | South | 3000 | -
Embed important metadata within the file, with at minimum the following values:
units: a dictionary specifying per-column units, usingno_unitfor unitless cases. Used to simplify parsing and enabling easier integration of unit-checking tools like pint.source: a string specifying the source and/or author of a dataset.license: a string specifying the license of the dataset.
??? example "Embedding metadata in
pandas"`pandas` will automatically convert data in `df.attrs` into [file-level metadata](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.attrs.html#pandas.DataFrame.attrs) when saving to `.parquet`: ```python dataframe.attrs["units"] = { "year": "yr", "country_id": "no_unit", "shape_id": "no_unit", "demand": "mwh" } dataframe.attrs["source"] = "github.com/modelblocks-org/docs" dataframe.attrs["license"] = "CC-BY-4.0" dataframe.to_parquet('my_data.parquet') ```
-
-
Raster data: we prefer to use GeoTIFF (.tiff) files.
-
Polygon data: we prefer GeoParquet (.parquet) files.
-
Gridded data: we prefer to use netCDF (.nc) files.
-
For all data: use snake case (
foo_bar) for headers, keys, indexes, variables, etc. Avoid hyphens (foo-bar) and camel case (FooBar). -
For timeseries data: timeseries must follow ISO 8601 UTC spec (e.g., 2024-08-01T15:00:00Z).
-
For national / subnational data:
- Country IDs should always be under the
country_idnaming and follow ISO 3166-1 alpha-3 (e.g., CHE, CHN, GBR, MEX, etc). - Sub-regions must be under the
shape_idnaming. This applies even in cases where national resolution is requested (i.e.,country_idandshape_idshould match). - A
shape_speckey or header should be present, specifying the version of the subregion standard used (e.g., NUTS2024, GADM4.1, ISO 3166-2:2013). This aids in replicability since subregion codes change quite often.
???+ example "Example of tabular subnational data"
| country_id | shape_id | shape_spec | demand_mwh | |------------------|------------------|------------------|--------------| | DEU | DE13 | NUTS2024 | 4500 | | DEU | DE14 | NUTS2024 | 4800 | | ITA | ITA0 | GADM4.1 | 20000 | - Country IDs should always be under the
-
For spatial data:
- Use
longitude|latitudeto express position and avoid ambiguous values likex|y. - Make sure to save the CRS with the spatial data. With the recommended file types for GIS data, see above, this is guaranteed. 3. For geodetic data, where preserving position matters, use a geodetic CRS (e.g. EPSG:4326). 4. For projected data, where preserving distance or area matters, allow users to specify the reference system that best fits the needs of the calculation (e.g. EPSG:3035 for Europe).
- Use
-
For currency data: currency codes must follow ISO 4217 alpha-3 codes in combination with the year of the currency (e.g., CHF2024, EUR2015, USD2020) to allow for inflation adjustments.