GitHubExperiments

A worked example of building a local data warehouse on the GitHub Archive using dbt-duckdb — from raw JSON to a star schema with slowly changing dimensions, running entirely locally.

Blog Posts

Getting Started

Prerequisites

Python 3.13 or higher
uv — Python package manager
Git
wget or curl for downloading dumps
~3 GB free disk space per day of GitHub Archive data

Installation

Clone the repository and install dependencies:

git clone https://github.com/idesis-gmbh/githubexperiments.git
cd githubexperiments
uv sync

uv sync reads pyproject.toml and installs all dependencies into a local virtual environment automatically.

Data Download

Download GitHub Archive data into the data/gharchive/ directory.

Using wget:

wget -P data/gharchive/ https://data.gharchive.org/2026-03-01-{0..23}.json.gz

Adjust the date and hour range to your needs, e.g. a full day or month:

wget -P data/gharchive/ https://data.gharchive.org/2026-03-{01..31}-{0..23}.json.gz

Data	Compressed	DuckDB
1 hour	~50 MB	~100 MB
1 day	~1 GB	~2 GB
1 week	~7 GB	~14 GB
1 month	~30 GB	~60 GB

Running the Pipeline

On first run, process the first file and generate dbt models using the canonical sample:

uv run main.py --canonical-schema

Then process all remaining files incrementally:

uv run main.py

Each file is processed through the full dbt pipeline — staging, snapshots, dimensions, facts, and marts — and acknowledged in the control schema on success. Re-running skips already processed files.

The generated SQL models and canonical sample are checked in — --canonical-schema only needs to be rerun after a database reset or if the GitHub Archive schema changes. The canonical sample is also included in every staging run to ensure correct type inference — see analytics/README.md for details.

Advanced usage: If the canonical sample does not yet exist, it can be generated from a real file using --infer-schema (infers schema directly) followed by
--canonical-sample (generates a canonical sample).

Exploring the Data

See analytics/README.md for the data model, schema discovery details, and example analyses.

Project Structure

githubexperiments/
├── main.py          # regenerate dbt models from schema discovery
├── etl.py           # incremental pipeline: process new archive files into the warehouse
├── sd.py            # schema discovery and SQL code generation
├── pyproject.toml   # project metadata and dependencies
├── uv.lock          # locked dependencies
├── .gitignore
├── README.md
├── analytics/       # dbt project
│   ├── dbt_project.yml
│   ├── profiles.yml # dbt connection configuration
│   ├── models/
│   │   ├── staging/     # raw JSON ingestion via read_json_auto
│   │   ├── dimensions/  # current state of each entity
│   │   ├── facts/       # event fact table
│   │   └── marts/       # aggregated models
│   ├── snapshots/   # slowly changing dimension definitions
│   ├── analyses/    # example queries
│   └── dev.duckdb   # DuckDB database (generated, gitignored)
├── data/
│   └── gharchive/   # canonical sample and downloaded .json.gz files
│       └── canonical_sample.json  # canonical sample for schema discovery
└── docs/
    ├── blog1/
    │   └── README.md    # Blog post 1 (English)
    └── blog2/
        ├── README.md    # Blog post 2 (English)
        ├── explore.gif
        └── pivot.gif

License

This project is licensed under the MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHubExperiments

Blog Posts

Getting Started

Prerequisites

Installation

Data Download

Running the Pipeline

Exploring the Data

Project Structure

Further Reading

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
analytics		analytics
data/gharchive		data/gharchive
docs		docs
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
etl.py		etl.py
main.py		main.py
pyproject.toml		pyproject.toml
sd.py		sd.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

GitHubExperiments

Blog Posts

Getting Started

Prerequisites

Installation

Data Download

Running the Pipeline

Exploring the Data

Project Structure

Further Reading

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages