This document describes the data architecture: storage stages, database schema, pipeline orchestration, and naming conventions.
The system ingests raw emissions and activity data from global and national sources, transforms and models it, and loads it into the GlobalAPI database for use in city-level inventories.
flowchart LR
A[External Sources\ne.g. ClimateTRACE\nSEEG, EDGAR] --> B[Stage 1\nfiles S3\nRaw landing zone]
B --> C[Stage 2\nraw_data S3 + DB\nCleaned & staged]
C --> D[Stage 3\nmodelled DB\nStandardised]
D --> E[Stage 4\nreporting DB\nMetabase-ready]
E --> F[GlobalAPI\nCity inventories]
The raw data landing zone. Source files are stored here exactly as received from the publisher, with no transformation.
- Any original file format (CSV, JSON, XLSX, Parquet, etc.)
- All pipelines read from here as their starting point
Cleaned and structured data, stored as Parquet in S3 and/or as staging tables in the raw_data PostgreSQL schema. Data is parsed, typed, and prepared for modelling but not yet mapped to the target schema.
- Staging tables named
raw_data.<source>_<sector>_staging - This stage is where most Python transformation logic runs
Standardised, modelled data. Primary output of the pipeline system and the main input to the GlobalAPI. Data here is aggregated to city level and conforms to the target data model.
- Lives in the
modelledschema of theccglobaldatabase - Also mirrored to S3 for datasets not yet loaded to the DB
Reporting-optimised tables for Metabase dashboards. Lean and query-performant — not a full copy of modelled data.
PostgreSQL database with two schemas: raw_data (staging) and modelled (final).
Canonical schema: dbdiagram.io/d/GlobalData-6993ca23bd82f5fce2e31149
| Table | Purpose |
|---|---|
emissions |
Central table — one row per city / reference number / year / gas / datasource |
activity_subcategory |
Activity types and units associated with emission records |
emissions_factor |
Emission factors linking an activity unit to a GHG quantity |
ghgi_methodology |
Methodology reference records, linked to a reporting reference number |
publisher_datasource |
Registry of data publishers and their datasets |
formula_input |
Input parameters for formula-based calculations |
city_polygon |
City boundary geometries; locode is the primary city identifier used across tables |
global_warming_potential |
GWP reference values per gas and IPCC assessment report |
Intermediate staging tables created per dataset during ingestion, named:
raw_data.<source>_<sector>_staging
These exist only to support SQL transformations into modelled and are not part of the final data model.
Pipelines run in Mage.ai (cc-mage/), orchestrated locally via Docker on port 6789.
data_loader (Python) reads raw file from S3
└── data_exporter (Python) writes to raw_data staging table
└── transformer (SQL) staging transform + sector mapping
├── data_exporter (SQL) → modelled.ghgi_methodology
├── data_exporter (SQL) → modelled.activity_subcategory
├── data_exporter (SQL) → modelled.emissions_factor
└── data_exporter (SQL) → modelled.emissions
All SQL blocks use PostgreSQL with use_raw_sql: true and export_write_policy: append. Dataset-specific parameters (S3 bucket, year, sector) are defined as variables in the pipeline's metadata.yaml.
| Prefix | Meaning |
|---|---|
ghgi_ |
Greenhouse Gas Inventory |
ccra_ |
Climate Risk Assessment |
dq_ |
Data quality / validation |
actions_ |
City Action Plan |
_v<year> suffix |
Tied to a specific data release version |
raw_data.<source>_<sector>_staging
Examples: raw_data.ct_onroad_v2025_staging, raw_data.edgar_emissions_staging