An end-to-end data engineering pipeline for meteorological data, starting with ~700 automatic weather stations from Brazil's national meteorological network (INMET), spanning from 2000 to the present day.
The project is designed to grow over time. The goal is to gradually expand coverage to other countries and data sources, building towards a broader multi-national meteorological platform.
Work in progress. Architecture and implementation are being developed iteratively and documented as the project evolves.
Weather data is messy, large, and spatially structured. INMET's historical archive spans over two decades of hourly readings across all Brazilian states. The dataset is large enough to make distributed processing a natural fit, and rich enough to support meaningful spatiotemporal feature engineering and forecasting.
Data is sourced from INMET (Instituto Nacional de Meteorologia), Brazil's national meteorological institute. The network currently comprises ~700 automatic weather stations with hourly granularity. Coverage was significantly smaller in 2000 and has grown steadily over the years.
Raw data files are not included in this repository. To download them, run:
python ingest/historical_data.py| Layer | Tools |
|---|---|
| Language | Python 3.14+ |
| Package manager | uv |
| Processing | PySpark |
| Runtime | Java 21 |
| IDE | PyCharm |
| Phase | Status |
|---|---|
| Exploratory data analysis | 🔄 In progress |
| Ingestion layer | 🔜 Planned |
| Processing layer | 🔜 Planned |
| Storage layer | 🔜 Planned |
| Model training | 🔜 Planned |
| Serving layer | 🔜 Planned |
| Spatial interpolation (Phase 2) | 🔜 Planned |
| Multi-country expansion (Phase 3) | 🔜 Planned |
git clone https://github.com/DanielTrivelli/millibar.git
cd millibar
uv syncThis project is under the MIT license. See the file LICENSE for more details.