Welcome to ML Data Pipeline, where you’ll build an end-to-end SQL pipeline to clean, transform, and query retail sales data for ML model training! This project weaves together DQL for querying, DML for updates, Joins for merging, and Indexing for speed, preparing a clean dataset for customer purchase predictions. Perfect for your ML-Interview-Preparation-Hub portfolio! 🚀
- Scenario: You’re a data engineer at an e-commerce company. Your task is to process raw
salesandcustomersdata, clean inconsistencies, merge datasets, and create a feature table for an ML model predicting customer purchases. - Goals: Clean data, join tables, optimize queries, and export a CSV.
- SQL Skills: DQL (SELECT), DML (UPDATE, INSERT), Joins (INNER, LEFT), Indexing.
- Create
sales(sale_id INT, customer_id INT, amount FLOAT, sale_date DATE, product VARCHAR) andcustomers(customer_id INT, region VARCHAR) tables. - Insert sample data (e.g., 10 sales, 5 customers).
- Clean
salesby setting negativeamountto 0 using DML. - Join
salesandcustomersto create a feature table with Joins. - Add an index on
customer_idfor faster joins. - Query total sales per region for ML features using DQL.
- Export results to a CSV for portfolio use.
-- Task 1: Create Tables
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
region VARCHAR(50)
);
CREATE TABLE sales (
sale_id INT PRIMARY KEY,
customer_id INT,
amount FLOAT,
sale_date DATE,
product VARCHAR(50),
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
-- Task 2: Insert Sample Data
INSERT INTO customers VALUES
(1, 'North'), (2, 'South'), (3, 'East'), (4, 'West'), (5, 'North');
INSERT INTO sales VALUES
(1, 1, 100.0, '2025-04-01', 'Laptop'),
(2, 2, -50.0, '2025-04-02', 'Phone'),
(3, 3, 200.0, '2025-04-03', 'Tablet'),
(4, 1, 150.0, '2025-04-04', 'Laptop'),
(5, 4, -20.0, '2025-04-05', 'Phone'),
(6, 2, 300.0, '2025-04-06', 'Tablet'),
(7, 5, 400.0, '2025-04-07', 'Laptop'),
(8, 3, -10.0, '2025-04-08', 'Phone'),
(9, 4, 250.0, '2025-04-09', 'Tablet'),
(10, 5, 180.0, '2025-04-10', 'Laptop');
-- Task 3: Clean Data
UPDATE sales
SET amount = 0
WHERE amount < 0;
-- Task 4: Join Tables
CREATE TABLE sales_features AS
SELECT s.sale_id, s.customer_id, s.amount, s.sale_date, s.product, c.region
FROM sales s
INNER JOIN customers c ON s.customer_id = c.customer_id;
-- Task 5: Add Index
CREATE INDEX idx_customer_id ON sales(customer_id);
-- Task 6: Query Features
SELECT c.region, COUNT(s.sale_id) AS sale_count, SUM(s.amount) AS total_amount
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
GROUP BY c.region;
-- Task 7: Export (Pseudo-code, depends on DB tool)
-- COPY (SELECT ...) TO 'sales_features.csv' WITH CSV HEADER;Explanation:
- DDL creates tables with constraints.
- DML inserts data and fixes negative amounts.
- Joins merge sales and customer data.
- Indexing speeds up joins.
- DQL aggregates for ML features, grouping by region.
- A
sales_featurestable with cleaned, joined data. - A CSV file (
sales_features.csv) with regional sales metrics. - A portfolio-ready dataset showcasing data prep for ML.
- Add
sales_features.csvtoirohanportfolio.netlify.app. - Write a README explaining the pipeline and ML use (e.g., customer purchase prediction).
- Visualize with Python’s
seaborn(e.g., bar plot of sales by region). - Tag skills: DQL, DML, Joins, Indexing.
Extend the pipeline by adding a Window Function to rank sales per customer within each region.