Skip to content

Latest commit

 

History

History
95 lines (81 loc) · 3.71 KB

File metadata and controls

95 lines (81 loc) · 3.71 KB
End-to-End-Data-Science-Pipeline-Linux-Python-MySQL

Click the banner to view the full analysis report

End-to-End Data Science Project (Linux + MySQL + Python)

Linux MySQL Python

📌 Overview

This project demonstrates a full-stack data science workflow entirely on the Ubuntu Linux command line.

Instead of relying solely on notebooks, I built an automated pipeline that:

  1. Ingests raw CSV data using Linux CLI tools (awk, sed).
  2. Cleans & Normalizes data using Python scripts.
  3. Loads structured data into a MySQL database.
  4. Analyzes business KPIs using complex SQL queries.
  5. Visualizes results using matplotlib.

📂 Project Structure

linux-data-science-project/
│
├── data/
│   ├── raw/                  # Original CSV file
│   └── processed/            # (Generated) Cleaned artifacts
│
├── scripts/
│   ├── 01_setup_env.sh       # Virtual env automation
│   ├── 02_etl_to_mysql.ipynb    # Python ETL (CSV -> MySQL)
│   
│
├── notebooks/
│   └── 03_analysis.ipynb        # Generates charts from SQL data
│
├── sql/
│   ├── schema.sql            # Database creation scripts
│   ├── queries.sql           # 10+ Business analytical queries
│
├── output_plots/             # Generated visualizations
├── LINUX_COMMANDS.md         # Documentation of CLI data exploration
├── SQL_SCENARIOS.md          # Business questions & SQL results
├── Makefile                  # Build automation commands
├── README.md                 # Project documentation

🛠 Architecture & Skills Demonstrated

Component Tools Used Skills Demonstrated
Data Exploration Linux Terminal (grep, awk, wc) CLI proficiency, stream processing
ETL Pipeline Python (pandas, sqlalchemy) Data cleaning, database connectors, automation
Database MySQL Schema design, relational modeling
Analytics SQL Window functions, aggregations, subqueries
Automation GNU Make (Makefile) Build automation, reproducible workflows

🚀 How to Run

  1. Prerequisites
    • Ubuntu/Linux OS (or WSL)

    • MySQL Server installed and running

    • Python 3.8+

  2. Setup
    1. Create a MySQL database named superstore_db:
    CREATE DATABASE superstore_db;
    1. Update database credentials in scripts/02_etl_to_mysql.ipynb:
    DB_USER = 'your_username'
    DB_PASS = 'your_password'
    1. Run the Pipeline I have set up a Makefile to automate the entire process. Simply run:
    # Sets up environment, cleans data, loads DB, and runs analysis
    make pipeline

📊 Key Insights

Full analysis can be found in SQL_SCENARIOS.md.

  • Top Profit Center: Technology category yields the highest profit margins (17%).

  • Shipping Efficiency: "Standard Class" shipping averages 5.0 days vs 0.04 days for "Same Day".

  • Seasonality: Sales volume consistently spikes by 30% in November/December.

📄 Documentation