This project demonstrates a full-stack data science workflow entirely on the Ubuntu Linux command line.
Instead of relying solely on notebooks, I built an automated pipeline that:
- Ingests raw CSV data using Linux CLI tools (
awk,sed). - Cleans & Normalizes data using Python scripts.
- Loads structured data into a MySQL database.
- Analyzes business KPIs using complex SQL queries.
- Visualizes results using
matplotlib.
linux-data-science-project/
│
├── data/
│ ├── raw/ # Original CSV file
│ └── processed/ # (Generated) Cleaned artifacts
│
├── scripts/
│ ├── 01_setup_env.sh # Virtual env automation
│ ├── 02_etl_to_mysql.ipynb # Python ETL (CSV -> MySQL)
│
│
├── notebooks/
│ └── 03_analysis.ipynb # Generates charts from SQL data
│
├── sql/
│ ├── schema.sql # Database creation scripts
│ ├── queries.sql # 10+ Business analytical queries
│
├── output_plots/ # Generated visualizations
├── LINUX_COMMANDS.md # Documentation of CLI data exploration
├── SQL_SCENARIOS.md # Business questions & SQL results
├── Makefile # Build automation commands
├── README.md # Project documentation
| Component | Tools Used | Skills Demonstrated |
|---|---|---|
| Data Exploration | Linux Terminal (grep, awk, wc) |
CLI proficiency, stream processing |
| ETL Pipeline | Python (pandas, sqlalchemy) |
Data cleaning, database connectors, automation |
| Database | MySQL | Schema design, relational modeling |
| Analytics | SQL | Window functions, aggregations, subqueries |
| Automation | GNU Make (Makefile) |
Build automation, reproducible workflows |
- Prerequisites
-
Ubuntu/Linux OS (or WSL)
-
MySQL Server installed and running
-
Python 3.8+
-
- Setup
- Create a MySQL database named
superstore_db:
CREATE DATABASE superstore_db;
- Update database credentials in
scripts/02_etl_to_mysql.ipynb:
DB_USER = 'your_username' DB_PASS = 'your_password'
- Run the Pipeline
I have set up a
Makefileto automate the entire process. Simply run:
# Sets up environment, cleans data, loads DB, and runs analysis make pipeline - Create a MySQL database named
Full analysis can be found in SQL_SCENARIOS.md.
-
Top Profit Center: Technology category yields the highest profit margins (17%).
-
Shipping Efficiency: "Standard Class" shipping averages 5.0 days vs 0.04 days for "Same Day".
-
Seasonality: Sales volume consistently spikes by 30% in November/December.
-
Linux Command Logs: How I explored the data using only the terminal.
-
SQL Business Scenarios: Detailed breakdown of 10 business questions and queries.