|
| 1 | +# SQL Data Engineering Project |
| 2 | + |
| 3 | +End-to-end SQL project for ingesting raw CSV data, transforming it into an OLTP model, loading a dimensional warehouse, and running analytics, quality checks, performance tests, and monitoring. |
| 4 | + |
| 5 | +## Project Layout |
| 6 | + |
| 7 | +```text |
| 8 | +sql-data-engineering-project/ |
| 9 | +├── database/ |
| 10 | +├── data/ |
| 11 | +├── etl/ |
| 12 | +├── warehouse/ |
| 13 | +├── analytics/ |
| 14 | +├── data_quality/ |
| 15 | +├── performance/ |
| 16 | +├── monitoring/ |
| 17 | +├── tests/ |
| 18 | +└── docs/ |
| 19 | +``` |
| 20 | + |
| 21 | +## Tech Stack |
| 22 | + |
| 23 | +- PostgreSQL 14+ |
| 24 | +- SQL (psql-compatible scripts) |
| 25 | +- Optional Python tools (linting/automation) |
| 26 | + |
| 27 | +## Quick Start |
| 28 | + |
| 29 | +One command end-to-end (recommended): |
| 30 | + |
| 31 | +```bash |
| 32 | +chmod +x run_all.sh |
| 33 | +./run_all.sh |
| 34 | +``` |
| 35 | + |
| 36 | +Manual execution: |
| 37 | + |
| 38 | +1. Create database (example): |
| 39 | + |
| 40 | +```bash |
| 41 | +createdb sql_data_engineering |
| 42 | +``` |
| 43 | + |
| 44 | +2. Set environment values in `.env`. |
| 45 | + |
| 46 | +3. Initialize core schemas and tables: |
| 47 | + |
| 48 | +```bash |
| 49 | +psql "$DATABASE_URL" -f database/schema.sql |
| 50 | +psql "$DATABASE_URL" -f database/tables.sql |
| 51 | +psql "$DATABASE_URL" -f database/constraints.sql |
| 52 | +psql "$DATABASE_URL" -f database/indexes.sql |
| 53 | +``` |
| 54 | + |
| 55 | +4. Load source data and run ETL: |
| 56 | + |
| 57 | +```bash |
| 58 | +psql "$DATABASE_URL" -f etl/extract.sql |
| 59 | +psql "$DATABASE_URL" -f etl/transform.sql |
| 60 | +psql "$DATABASE_URL" -f etl/load.sql |
| 61 | +``` |
| 62 | + |
| 63 | +5. Build warehouse model (includes SCD Type 2 update + fact load): |
| 64 | + |
| 65 | +```bash |
| 66 | +psql "$DATABASE_URL" -f warehouse/star_schema.sql |
| 67 | +``` |
| 68 | + |
| 69 | +6. Run quality checks, analytics, and tests: |
| 70 | + |
| 71 | +```bash |
| 72 | +psql "$DATABASE_URL" -f data_quality/validation_queries.sql |
| 73 | +psql "$DATABASE_URL" -f analytics/revenue_analysis.sql |
| 74 | +psql "$DATABASE_URL" -f tests/test_data_load.sql |
| 75 | +psql "$DATABASE_URL" -f tests/test_scd_logic.sql |
| 76 | +psql "$DATABASE_URL" -f tests/test_quality_checks.sql |
| 77 | +``` |
| 78 | + |
| 79 | +## Pipeline Flow |
| 80 | + |
| 81 | +1. **Extract** CSVs into staging raw tables. |
| 82 | +2. **Transform** and standardize datatypes + deduplicate records. |
| 83 | +3. **Load** cleaned data into OLTP tables with upserts. |
| 84 | +4. **Warehouse** load dimensions/facts and apply SCD Type 2 for customers. |
| 85 | +5. **Analyze** KPIs and run fraud/retention/segmentation logic. |
| 86 | +6. **Monitor** row counts and anomaly signals. |
| 87 | + |
| 88 | +## Notes |
| 89 | + |
| 90 | +- SQL is written for PostgreSQL. |
| 91 | +- `etl/extract.sql` uses `\copy`, so run with `psql` from project root. |
| 92 | +- Example data is included in `data/raw/`. |
| 93 | +- Architecture and data model diagrams are generated from DOT sources: |
| 94 | + - `docs/architecture_diagram.dot` |
| 95 | + - `docs/data_model.dot` |
| 96 | + - Regenerate with `./docs/generate_diagrams.sh` |
0 commit comments