Production-style product analytics project for internship interviews. It turns a public retail dataset into a SQL-first analytics system that answers practical questions about acquisition, activation, conversion, retention, monetization, and experiment performance through a FastAPI service and a Streamlit dashboard.
Note: revenue, orders, repeat behavior, and cohorts come from real ecommerce transactions. Acquisition, top-of-funnel stages, and experiment readouts are modeled or simulated from that transaction data and are documented as assumptions, not observed product events.
- Product teams need more than raw orders. They need a decision layer that explains where users drop off, which segments are healthiest, and what to test next.
- Public datasets rarely include clean product-event funnels, so this project shows how to build an honest analytics mart with explicit assumptions instead of hiding the gaps.
- The repo is structured like an application, not a notebook dump: ingest, marts, evaluation, API, dashboard, tests, and public-facing project documentation.
flowchart LR
A[UCI Online Retail II] --> B[Ingestion Pipeline]
B --> C[(DuckDB Warehouse)]
C --> D[SQL Models and Views]
D --> E[FastAPI Metrics Service]
E --> F[Streamlit Dashboard]
D --> G[Pytest Data and API Tests]
C --> H[Evaluation Checks]
- Where are users dropping off in the funnel?
- Which acquisition channels or segments convert best?
- How does retention change by signup cohort?
- Which user segments are healthiest or most at risk?
- If a simulated product change is tested, what is the likely impact?
- Python
- DuckDB
- SQL-first marts and views
- Pandas only where needed for ingestion and modeled event generation
- FastAPI
- Streamlit
- Plotly
- pytest
- optional Docker support
Primary source:
- UCI Online Retail II: transactional retail data with invoices, products, timestamps, customers, and countries
Why a modeled analytics mart is necessary:
- The dataset contains orders, not full product events like page visits, signups, activation, or controlled experiments.
- To support real product analytics questions, the pipeline creates a reproducible modeled top-of-funnel layer with documented assumptions.
Observed facts:
- orders
- revenue
- customer purchase timing
- repeat behavior
- geography
Modeled assumptions:
acquisition_channelis inferred from first-order characteristics and timing, not directly observed- top-of-funnel events like
visited_site,signed_up, andactivated_accountare generated deterministically from observed customer cohorts plus synthetic non-converting prospects - experiment results are simulated from the same cohort mix to demonstrate a realistic A/B readout without pretending the raw dataset contains one
This tradeoff is deliberate and interview-friendly: the repo is honest about what is observed vs inferred.
src/sql_product_analytics/
pipeline/ # ingest, marts, evaluation
analytics/ # shared service logic
api/ # FastAPI app and schemas
dashboard/ # Streamlit client app
sql/ # actual SQL model files
tests/ # metrics, API, CLI coverage
data/ # raw, staging, warehouse
- user_acquisition.sql
- funnel_stage_counts.sql
- funnel_conversion_rates.sql
- cohort_retention.sql
- revenue_kpis.sql
- segment_health.sql
- experiment_summary.sql
Example:
SELECT
cohort_date,
retention_day,
cohort_size,
active_customers,
ROUND(
CAST(active_customers AS DOUBLE) / NULLIF(cohort_size, 0),
4
) AS retention_rate
FROM cohort_retention
ORDER BY cohort_date, retention_day;python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"Run the full pipeline:
sql-product-analytics run-allRun from the included fixture instead of downloading the public dataset:
sql-product-analytics run-all --source-path tests/fixtures/sample_online_retail.csvServe the API:
sql-product-analytics serve-api --host 127.0.0.1 --port 8000Serve the dashboard:
sql-product-analytics serve-dashboard --api-base-url http://127.0.0.1:8000Run tests:
pytestingestbuild-martsevaluaterun-allserve-apiserve-dashboard
GET /healthGET /kpis/overviewGET /kpis/trendsGET /funnelGET /funnel/by-segmentGET /retention/cohortsGET /segments/healthGET /experiments/summaryGET /insights/recommendations
Example:
curl "http://127.0.0.1:8000/kpis/overview"Example overview response excerpt from the included fixture:
{
"total_orders": 10,
"purchasing_customers": 5,
"total_revenue": 389.0,
"avg_order_value": 38.9,
"repeat_customer_rate": 0.8,
"signup_rate": 0.7714,
"activation_rate": 0.8148,
"purchase_conversion_rate": 0.1852,
"strongest_segment": "Email / Core"
}From the fixture-backed warehouse:
- Funnel volume falls from 35 modeled visitors to 27 signups, then to 22 activated users, with only 5 observed purchasers. That makes purchase conversion the biggest downstream bottleneck.
- The strongest sample segment is
Email / Core, whileEmail / At Risk NewandEmail / Emergingare flagged as fragile due to low D30 retention or high inactivity share. - The simulated onboarding experiment improves activation meaningfully, but purchase lift is weak in the sample. That supports more testing, not a blind rollout.
- Executive Overview: KPIs, revenue trends, strongest segments
- Funnel Explorer: stage counts and conversion by segment type
- Retention & Cohorts: cohort heatmap plus D1/D7/D30 summaries
- Experiment / Insights: simulated experiment readout and business recommendations
The evaluate command runs:
- raw row count sanity
- customer dimension integrity
- null-customer checks
- funnel monotonicity
- retention bounds
- experiment variant coverage
- Acquisition channels are inferred, not observed.
- The source is transactional, so top-of-funnel and experiment layers are modeled.
- Retention is purchase-based retention, not session-level engagement.
- This is designed for local reproducibility and interview clarity, not warehouse-scale orchestration.
- Why SQL-first marts are the center of the project
- How you handled missing event-level data without pretending it existed
- How cohort definitions and repeat behavior were implemented
- Why the API-backed dashboard is better than reading raw files directly in Streamlit
- Where the business recommendations come from and where they stop being causal claims
Optional Docker files are included via the lightweight Dockerfile, but they should be treated as manual setup and not as a verified deployment path from this repo pass. For production-style hosting, run the API and dashboard as separate services against the same DuckDB volume or swap DuckDB for a Postgres-backed adapter in v2.
Real browser screenshots captured from the running local app:
To refresh them locally, capture the Executive Overview and Retention & Cohorts pages after starting the API and dashboard.

