You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The raw CSV files (around 250 MB total) aren't committed to this repo because of GitHub's file size limits. The notebook notebooks/BlitzMart_01_DataGeneration.ipynb rebuilds the full dataset from scratch in about 5 minutes.
Delivered / Shipped / Cancelled / Processing / Returned
payment_method
string
Credit Card / PayPal / Klarna / SEPA / Apple Pay / Google Pay
order_items (~3.4M rows)
Column
Type
Description
order_item_id
int
Primary key
order_id
int
FK to orders
product_id
int
FK to products
quantity
int
Units ordered
unit_price
decimal
Price per unit at time of sale
discount
decimal
Discount applied (0.0 to 0.20)
line_total
decimal
quantity * unit_price * (1 - discount)
products (5,000 rows)
Column
Type
Description
product_id
int
Primary key
product_name
string
Product name
category
string
8 categories (Electronics, Fashion, etc.)
supplier_id
int
FK to suppliers
unit_cost
decimal
Cost to BlitzMart
unit_price
decimal
Catalog selling price
weight_kg
decimal
Product weight
suppliers (50 rows)
Column
Type
Description
supplier_id
int
Primary key
supplier_name
string
Supplier company name
country
string
Supplier country
rating
decimal
Supplier rating (3.0 to 5.0)
lead_time_days
int
Average lead time
warehouses (6 rows)
Column
Type
Description
warehouse_id
int
Primary key
warehouse_name
string
Warehouse name
city
string
Warehouse location
region
string
North / South / East / West
capacity_units
int
Storage capacity
shipments (~1.4M rows)
Column
Type
Description
shipment_id
int
Primary key
order_id
int
FK to orders
carrier
string
DHL / Hermes / DPD / UPS / GLS
ship_date
date
Date shipped
promised_delivery_days
int
Promised delivery window
actual_delivery_days
int
Actual delivery time
delivery_date
date
Date delivered
on_time_delivery
int
1 = on-time, 0 = late
returns (~150K rows)
Column
Type
Description
return_id
int
Primary key
order_id
int
FK to orders
return_date
date
Date of return
return_reason
string
Wrong size / Damaged / Not as described / etc.
refund_amount
decimal
Refund value in EUR
Data Quality Notes
The generated data deliberately includes realistic quality issues, all of which are detected and fixed in the ETL cleaning pipeline:
~2% missing customer ages and genders
~0.1% duplicate order records
~0.5% invalid quantities or zero prices
~0.3% impossible delivery dates (delivery before shipment)
~50 extreme refund outliers (data entry errors)
~1% missing supplier references (orphan products)
After cleaning, the data_clean/ folder holds the analysis-ready tables. The cleaning script also writes a _cleaning_report.csv audit log so every fix is traceable.