A worked example of building a local data warehouse on the GitHub Archive using dbt-duckdb — from raw JSON to a star schema with slowly changing dimensions, running entirely locally.
- From Raw JSON to Data Warehouse: Analyzing GitHub Data Locally
- Bringing the Data Warehouse to Life: Exploring GitHub Data Interactively with Rill
- Python 3.13 or higher
- uv — Python package manager
- Git
wgetorcurlfor downloading dumps- ~3 GB free disk space per day of GitHub Archive data
Clone the repository and install dependencies:
git clone https://github.com/idesis-gmbh/githubexperiments.git
cd githubexperiments
uv syncuv sync reads pyproject.toml and installs all dependencies into a local virtual environment automatically.
Download GitHub Archive data into the data/gharchive/ directory.
Using wget:
wget -P data/gharchive/ https://data.gharchive.org/2026-03-01-{0..23}.json.gzAdjust the date and hour range to your needs, e.g. a full day or month:
wget -P data/gharchive/ https://data.gharchive.org/2026-03-{01..31}-{0..23}.json.gz| Data | Compressed | DuckDB |
|---|---|---|
| 1 hour | ~50 MB | ~100 MB |
| 1 day | ~1 GB | ~2 GB |
| 1 week | ~7 GB | ~14 GB |
| 1 month | ~30 GB | ~60 GB |
On first run, process the first file and generate dbt models using the canonical sample:
uv run main.py --canonical-schemaThen process all remaining files incrementally:
uv run main.pyEach file is processed through the full dbt pipeline — staging, snapshots, dimensions, facts, and marts — and acknowledged in the control schema on success. Re-running skips already processed files.
The generated SQL models and canonical sample are checked in — --canonical-schema
only needs to be rerun after a database reset or if the GitHub Archive schema changes.
The canonical sample is also included in every staging run to ensure correct type
inference — see analytics/README.md for details.
Advanced usage: If the canonical sample does not yet exist, it can be generated
from a real file using --infer-schema (infers schema directly) followed by
--canonical-sample (generates a canonical sample).
See analytics/README.md for the data model, schema discovery details, and example
analyses.
githubexperiments/
├── main.py # regenerate dbt models from schema discovery
├── etl.py # incremental pipeline: process new archive files into the warehouse
├── sd.py # schema discovery and SQL code generation
├── pyproject.toml # project metadata and dependencies
├── uv.lock # locked dependencies
├── .gitignore
├── README.md
├── analytics/ # dbt project
│ ├── dbt_project.yml
│ ├── profiles.yml # dbt connection configuration
│ ├── models/
│ │ ├── staging/ # raw JSON ingestion via read_json_auto
│ │ ├── dimensions/ # current state of each entity
│ │ ├── facts/ # event fact table
│ │ └── marts/ # aggregated models
│ ├── snapshots/ # slowly changing dimension definitions
│ ├── analyses/ # example queries
│ └── dev.duckdb # DuckDB database (generated, gitignored)
├── data/
│ └── gharchive/ # canonical sample and downloaded .json.gz files
│ └── canonical_sample.json # canonical sample for schema discovery
└── docs/
├── blog1/
│ └── README.md # Blog post 1 (English)
└── blog2/
├── README.md # Blog post 2 (English)
├── explore.gif
└── pivot.gif
This project is licensed under the MIT License