Skip to content

Sanaurrehmanarain/End-to-End-Data-Science-Pipeline-Linux-Python-MySQL

Repository files navigation

End-to-End-Data-Science-Pipeline-Linux-Python-MySQL

Click the banner to view the full analysis report

End-to-End Data Science Project (Linux + MySQL + Python)

Linux MySQL Python

📌 Overview

This project demonstrates a full-stack data science workflow entirely on the Ubuntu Linux command line.

Instead of relying solely on notebooks, I built an automated pipeline that:

  1. Ingests raw CSV data using Linux CLI tools (awk, sed).
  2. Cleans & Normalizes data using Python scripts.
  3. Loads structured data into a MySQL database.
  4. Analyzes business KPIs using complex SQL queries.
  5. Visualizes results using matplotlib.

📂 Project Structure

linux-data-science-project/
│
├── data/
│   ├── raw/                  # Original CSV file
│   └── processed/            # (Generated) Cleaned artifacts
│
├── scripts/
│   ├── 01_setup_env.sh       # Virtual env automation
│   ├── 02_etl_to_mysql.ipynb    # Python ETL (CSV -> MySQL)
│   
│
├── notebooks/
│   └── 03_analysis.ipynb        # Generates charts from SQL data
│
├── sql/
│   ├── schema.sql            # Database creation scripts
│   ├── queries.sql           # 10+ Business analytical queries
│
├── output_plots/             # Generated visualizations
├── LINUX_COMMANDS.md         # Documentation of CLI data exploration
├── SQL_SCENARIOS.md          # Business questions & SQL results
├── Makefile                  # Build automation commands
├── README.md                 # Project documentation

🛠 Architecture & Skills Demonstrated

Component Tools Used Skills Demonstrated
Data Exploration Linux Terminal (grep, awk, wc) CLI proficiency, stream processing
ETL Pipeline Python (pandas, sqlalchemy) Data cleaning, database connectors, automation
Database MySQL Schema design, relational modeling
Analytics SQL Window functions, aggregations, subqueries
Automation GNU Make (Makefile) Build automation, reproducible workflows

🚀 How to Run

  1. Prerequisites
    • Ubuntu/Linux OS (or WSL)

    • MySQL Server installed and running

    • Python 3.8+

  2. Setup
    1. Create a MySQL database named superstore_db:
    CREATE DATABASE superstore_db;
    1. Update database credentials in scripts/02_etl_to_mysql.ipynb:
    DB_USER = 'your_username'
    DB_PASS = 'your_password'
    1. Run the Pipeline I have set up a Makefile to automate the entire process. Simply run:
    # Sets up environment, cleans data, loads DB, and runs analysis
    make pipeline

📊 Key Insights

Full analysis can be found in SQL_SCENARIOS.md.

  • Top Profit Center: Technology category yields the highest profit margins (17%).

  • Shipping Efficiency: "Standard Class" shipping averages 5.0 days vs 0.04 days for "Same Day".

  • Seasonality: Sales volume consistently spikes by 30% in November/December.

📄 Documentation

About

An automated end-to-end data science pipeline built on Ubuntu Linux. Features CLI data exploration, Python-based ETL (Pandas), MySQL database modeling, and business analytics via SQL and Makefiles.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages