End-to-End-Data-Science-Pipeline-Linux-Python-MySQL

Click the banner to view the full analysis report

End-to-End Data Science Project (Linux + MySQL + Python)

📌 Overview

This project demonstrates a full-stack data science workflow entirely on the Ubuntu Linux command line.

Instead of relying solely on notebooks, I built an automated pipeline that:

Ingests raw CSV data using Linux CLI tools (awk, sed).
Cleans & Normalizes data using Python scripts.
Loads structured data into a MySQL database.
Analyzes business KPIs using complex SQL queries.
Visualizes results using matplotlib.

📂 Project Structure

linux-data-science-project/
│
├── data/
│   ├── raw/                  # Original CSV file
│   └── processed/            # (Generated) Cleaned artifacts
│
├── scripts/
│   ├── 01_setup_env.sh       # Virtual env automation
│   ├── 02_etl_to_mysql.ipynb    # Python ETL (CSV -> MySQL)
│   
│
├── notebooks/
│   └── 03_analysis.ipynb        # Generates charts from SQL data
│
├── sql/
│   ├── schema.sql            # Database creation scripts
│   ├── queries.sql           # 10+ Business analytical queries
│
├── output_plots/             # Generated visualizations
├── LINUX_COMMANDS.md         # Documentation of CLI data exploration
├── SQL_SCENARIOS.md          # Business questions & SQL results
├── Makefile                  # Build automation commands
├── README.md                 # Project documentation

🛠 Architecture & Skills Demonstrated

Component	Tools Used	Skills Demonstrated
Data Exploration	Linux Terminal (`grep`, `awk`, `wc`)	CLI proficiency, stream processing
ETL Pipeline	Python (`pandas`, `sqlalchemy`)	Data cleaning, database connectors, automation
Database	MySQL	Schema design, relational modeling
Analytics	SQL	Window functions, aggregations, subqueries
Automation	GNU Make (`Makefile`)	Build automation, reproducible workflows

🚀 How to Run

Prerequisites
- Ubuntu/Linux OS (or WSL)
- MySQL Server installed and running
- Python 3.8+
Setup
1. Create a MySQL database named superstore_db:
```
CREATE DATABASE superstore_db;
```
1. Update database credentials in scripts/02_etl_to_mysql.ipynb:
```
DB_USER = 'your_username'
DB_PASS = 'your_password'
```
1. Run the Pipeline I have set up a Makefile to automate the entire process. Simply run:
```
# Sets up environment, cleans data, loads DB, and runs analysis
make pipeline
```

📊 Key Insights

Full analysis can be found in SQL_SCENARIOS.md.

Top Profit Center: Technology category yields the highest profit margins (17%).
Shipping Efficiency: "Standard Class" shipping averages 5.0 days vs 0.04 days for "Same Day".
Seasonality: Sales volume consistently spikes by 30% in November/December.

📄 Documentation

Linux Command Logs: How I explored the data using only the terminal.
SQL Business Scenarios: Detailed breakdown of 10 business questions and queries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End-to-End Data Science Project (Linux + MySQL + Python)

📌 Overview

📂 Project Structure

🛠 Architecture & Skills Demonstrated

🚀 How to Run

📊 Key Insights

📄 Documentation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

End-to-End Data Science Project (Linux + MySQL + Python)

📌 Overview

📂 Project Structure

🛠 Architecture & Skills Demonstrated

🚀 How to Run

📊 Key Insights

📄 Documentation