Skip to content

Sam-Ny/modular-data-modeling-dbt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”§ Modular Data Modeling with dbt Core

dbt SQL Python Jinja Git MIT License

A focused dbt Core project demonstrating modular SQL transformation β€” from raw source data to analytics-ready models using Staging, Intermediate, and Analytics layers with SCD Type 2, custom tests, Jinja macros, and auto-generated documentation.

Architecture β€’ Features β€’ Project Structure β€’ Setup β€’ Key Decisions


πŸ“Œ Project Overview

This project treats data transformation like software engineering β€” modular, tested, documented, and version controlled. Rather than writing one giant SQL query, each transformation is broken into small, reusable models with clear responsibilities.

Built to demonstrate dbt Core proficiency across the full toolkit β€” snapshots, seeds, macros, tests, and docs β€” on a real dataset with meaningful business logic.

Metric Value
dbt Models Staging + Intermediate + Analytics layers
Snapshot Strategy SCD Type 2 (check strategy)
Tests Generic + Custom data quality tests
Macros Jinja-powered reusable SQL logic
Seeds Static CSV reference data
Documentation Auto-generated data dictionary + lineage DAG

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    SOURCE DATA                        β”‚
β”‚          Raw normalized tables / CSV seeds            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  STAGING LAYER                        β”‚
β”‚   Type casting Β· Column renaming Β· Basic cleaning     β”‚
β”‚   One model per source table Β· Views                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               INTERMEDIATE / CORE LAYER               β”‚
β”‚   Business logic joins Β· Deduplication                β”‚
β”‚   Wide table construction Β· Aggregations              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 ANALYTICS LAYER                       β”‚
β”‚   One Big Table (OBT) Β· Analytics ready              β”‚
β”‚   BI tool optimized Β· Tested + Documented            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β–Ό                 β–Ό
           SNAPSHOTS           SEEDS
        SCD Type 2          Static reference
        history tracking     data (CSV)

βœ… dbt Features Covered

Feature What was built
Staging Models Standardised types, renamed columns, basic cleaning per source
Intermediate Models Business logic joins, wide table construction
Analytics Models Final OBT ready for BI consumption
SCD Type 2 Snapshots Historical tracking of slowly changing dimensions using check strategy
Seeds Static business mapping CSVs loaded as reference tables
Generic Tests unique, not_null, accepted_values, relationships
Custom Tests Singular SQL tests for complex business rule validation
Jinja Macros Reusable SQL logic to eliminate repetition across models
dbt Docs Auto-generated searchable data dictionary and lineage DAG
Source Freshness Defined loaded_at_field for staleness detection

πŸ“ Project Structure

modular-data-modeling-dbt/
β”‚
β”œβ”€β”€ sameer_dbt_practice/
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ staging/
β”‚   β”‚   β”‚   β”œβ”€β”€ sources.yml              # Source definitions
β”‚   β”‚   β”‚   β”œβ”€β”€ schema.yml               # Staging tests + docs
β”‚   β”‚   β”‚   └── stg_*.sql               # One model per source
β”‚   β”‚   β”œβ”€β”€ intermediate/
β”‚   β”‚   β”‚   └── int_*.sql               # Business logic joins
β”‚   β”‚   └── analytics/
β”‚   β”‚       β”œβ”€β”€ schema.yml               # Analytics tests + docs
β”‚   β”‚       └── obt_*.sql               # Final wide tables
β”‚   β”œβ”€β”€ snapshots/
β”‚   β”‚   └── *_snapshot.sql              # SCD Type 2 snapshots
β”‚   β”œβ”€β”€ seeds/
β”‚   β”‚   └── *.csv                       # Static reference data
β”‚   β”œβ”€β”€ tests/
β”‚   β”‚   └── *.sql                       # Custom singular tests
β”‚   β”œβ”€β”€ macros/
β”‚   β”‚   └── *.sql                       # Reusable Jinja macros
β”‚   β”œβ”€β”€ analyses/
β”‚   └── dbt_project.yml
β”‚
β”œβ”€β”€ .gitignore
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ pyproject.toml
└── README.md

βš™οΈ Key Engineering Decisions

1. Modular layered architecture over a single query Each layer has a single responsibility. Staging cleans, intermediate joins, analytics serves. This means any layer can be debugged, tested, or replaced independently without touching the others.

2. SCD Type 2 for historical tracking Used dbt's check strategy snapshot to track changes in slowly changing dimensions β€” capturing full history with dbt_valid_from and dbt_valid_to timestamps. Enables point-in-time analysis without any custom merge logic.

3. Seeds for static reference data Business mapping tables (categories, status codes, lookup values) are version controlled as CSV seeds β€” not hardcoded in SQL. Changes to reference data are tracked in Git like any other code change.

4. Jinja macros for DRY SQL Repetitive filtering logic extracted into macros. Instead of copying the same WHERE clause across 10 models, one macro call keeps everything consistent and maintainable.

5. Tests as data contracts Generic tests on every primary key and foreign key. Custom singular tests for complex business rules. If data breaks these contracts, dbt fails loudly rather than silently producing wrong numbers.


πŸš€ Setup Guide

Prerequisites

  • Python 3.11+
  • dbt Core + relevant adapter (e.g. dbt-snowflake, dbt-postgres)
  • A database connection (Snowflake, Postgres, etc.)

1. Clone the repo

git clone https://github.com/Sam-Ny/modular-data-modeling-dbt.git
cd modular-data-modeling-dbt

2. Create virtual environment

uv venv
source .venv/bin/activate      # Mac/Linux
.venv\Scripts\activate         # Windows

3. Install dependencies

uv pip install -r requirements.txt

4. Configure dbt profile

Create ~/.dbt/profiles.yml with your database connection. See dbt docs for format.

5. Run the project

cd sameer_dbt_practice

dbt debug          # verify connection
dbt seed           # load reference CSVs
dbt run            # run all models
dbt snapshot       # run SCD Type 2 snapshots
dbt test           # run all data quality tests
dbt docs generate  # generate documentation
dbt docs serve     # view docs at localhost:8080

6. Run everything at once

dbt build          # seed + run + test + snapshot in dependency order

πŸ“– How dbt Concepts Connect

sources.yml
└── {{ source('raw', 'orders') }}      ← reads from database source
        ↓
stg_orders.sql                         ← staging model cleans it
        ↓
{{ ref('stg_orders') }}                ← ref() builds the DAG
        ↓
int_orders_enriched.sql                ← intermediate joins it
        ↓
{{ ref('int_orders_enriched') }}
        ↓
obt_orders.sql                         ← analytics layer serves it
        ↓
schema.yml tests                       ← tests validate it
        ↓
dbt docs                               ← docs describe it

πŸ‘€ Author

Sameer Nayak Data Engineer | Azure Β· Snowflake Β· Databricks Β· dbt

LinkedIn GitHub


πŸ’‘ Also check out: Snowflake Supply Chain Analytics β€” a full end-to-end pipeline using Databricks + Snowflake + dbt on 1.5M rows of real e-commerce data.


⭐ If you found this useful, consider starring the repo!

About

A focused dbt Core project demonstrating proficiency in the dbt workflow: from raw source transformation to a final analytics-ready table. Features implementation of snapshots, tests, macros, and models..

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages