Skip to content

Mohammad2460/sql-product-analytics-retention-funnel-dashboard

Repository files navigation

SQL Product Analytics + Retention/Funnel Dashboard

Production-style product analytics project for internship interviews. It turns a public retail dataset into a SQL-first analytics system that answers practical questions about acquisition, activation, conversion, retention, monetization, and experiment performance through a FastAPI service and a Streamlit dashboard.

Note: revenue, orders, repeat behavior, and cohorts come from real ecommerce transactions. Acquisition, top-of-funnel stages, and experiment readouts are modeled or simulated from that transaction data and are documented as assumptions, not observed product events.

Why This Project Matters

  • Product teams need more than raw orders. They need a decision layer that explains where users drop off, which segments are healthiest, and what to test next.
  • Public datasets rarely include clean product-event funnels, so this project shows how to build an honest analytics mart with explicit assumptions instead of hiding the gaps.
  • The repo is structured like an application, not a notebook dump: ingest, marts, evaluation, API, dashboard, tests, and public-facing project documentation.

Architecture

flowchart LR
    A[UCI Online Retail II] --> B[Ingestion Pipeline]
    B --> C[(DuckDB Warehouse)]
    C --> D[SQL Models and Views]
    D --> E[FastAPI Metrics Service]
    E --> F[Streamlit Dashboard]
    D --> G[Pytest Data and API Tests]
    C --> H[Evaluation Checks]
Loading

Product Questions Answered

  • Where are users dropping off in the funnel?
  • Which acquisition channels or segments convert best?
  • How does retention change by signup cohort?
  • Which user segments are healthiest or most at risk?
  • If a simulated product change is tested, what is the likely impact?

Stack

  • Python
  • DuckDB
  • SQL-first marts and views
  • Pandas only where needed for ingestion and modeled event generation
  • FastAPI
  • Streamlit
  • Plotly
  • pytest
  • optional Docker support

Data Source And Assumptions

Primary source:

  • UCI Online Retail II: transactional retail data with invoices, products, timestamps, customers, and countries

Why a modeled analytics mart is necessary:

  • The dataset contains orders, not full product events like page visits, signups, activation, or controlled experiments.
  • To support real product analytics questions, the pipeline creates a reproducible modeled top-of-funnel layer with documented assumptions.

Observed facts:

  • orders
  • revenue
  • customer purchase timing
  • repeat behavior
  • geography

Modeled assumptions:

  • acquisition_channel is inferred from first-order characteristics and timing, not directly observed
  • top-of-funnel events like visited_site, signed_up, and activated_account are generated deterministically from observed customer cohorts plus synthetic non-converting prospects
  • experiment results are simulated from the same cohort mix to demonstrate a realistic A/B readout without pretending the raw dataset contains one

This tradeoff is deliberate and interview-friendly: the repo is honest about what is observed vs inferred.

Repo Structure

src/sql_product_analytics/
  pipeline/      # ingest, marts, evaluation
  analytics/     # shared service logic
  api/           # FastAPI app and schemas
  dashboard/     # Streamlit client app
sql/             # actual SQL model files
tests/           # metrics, API, CLI coverage
data/            # raw, staging, warehouse

Key SQL Models

Example:

SELECT
    cohort_date,
    retention_day,
    cohort_size,
    active_customers,
    ROUND(
        CAST(active_customers AS DOUBLE) / NULLIF(cohort_size, 0),
        4
    ) AS retention_rate
FROM cohort_retention
ORDER BY cohort_date, retention_day;

Quickstart

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Run the full pipeline:

sql-product-analytics run-all

Run from the included fixture instead of downloading the public dataset:

sql-product-analytics run-all --source-path tests/fixtures/sample_online_retail.csv

Serve the API:

sql-product-analytics serve-api --host 127.0.0.1 --port 8000

Serve the dashboard:

sql-product-analytics serve-dashboard --api-base-url http://127.0.0.1:8000

Run tests:

pytest

CLI Commands

  • ingest
  • build-marts
  • evaluate
  • run-all
  • serve-api
  • serve-dashboard

API Endpoints

  • GET /health
  • GET /kpis/overview
  • GET /kpis/trends
  • GET /funnel
  • GET /funnel/by-segment
  • GET /retention/cohorts
  • GET /segments/health
  • GET /experiments/summary
  • GET /insights/recommendations

Example:

curl "http://127.0.0.1:8000/kpis/overview"

Example overview response excerpt from the included fixture:

{
  "total_orders": 10,
  "purchasing_customers": 5,
  "total_revenue": 389.0,
  "avg_order_value": 38.9,
  "repeat_customer_rate": 0.8,
  "signup_rate": 0.7714,
  "activation_rate": 0.8148,
  "purchase_conversion_rate": 0.1852,
  "strongest_segment": "Email / Core"
}

Sample Business Insights

From the fixture-backed warehouse:

  • Funnel volume falls from 35 modeled visitors to 27 signups, then to 22 activated users, with only 5 observed purchasers. That makes purchase conversion the biggest downstream bottleneck.
  • The strongest sample segment is Email / Core, while Email / At Risk New and Email / Emerging are flagged as fragile due to low D30 retention or high inactivity share.
  • The simulated onboarding experiment improves activation meaningfully, but purchase lift is weak in the sample. That supports more testing, not a blind rollout.

Dashboard Pages

  • Executive Overview: KPIs, revenue trends, strongest segments
  • Funnel Explorer: stage counts and conversion by segment type
  • Retention & Cohorts: cohort heatmap plus D1/D7/D30 summaries
  • Experiment / Insights: simulated experiment readout and business recommendations

Evaluation Checks

The evaluate command runs:

  • raw row count sanity
  • customer dimension integrity
  • null-customer checks
  • funnel monotonicity
  • retention bounds
  • experiment variant coverage

Limitations

  • Acquisition channels are inferred, not observed.
  • The source is transactional, so top-of-funnel and experiment layers are modeled.
  • Retention is purchase-based retention, not session-level engagement.
  • This is designed for local reproducibility and interview clarity, not warehouse-scale orchestration.

Interview Talking Points

  • Why SQL-first marts are the center of the project
  • How you handled missing event-level data without pretending it existed
  • How cohort definitions and repeat behavior were implemented
  • Why the API-backed dashboard is better than reading raw files directly in Streamlit
  • Where the business recommendations come from and where they stop being causal claims

Deployment Notes

Optional Docker files are included via the lightweight Dockerfile, but they should be treated as manual setup and not as a verified deployment path from this repo pass. For production-style hosting, run the API and dashboard as separate services against the same DuckDB volume or swap DuckDB for a Postgres-backed adapter in v2.

Screenshots

Real browser screenshots captured from the running local app:

Executive Overview

Retention & Cohorts

To refresh them locally, capture the Executive Overview and Retention & Cohorts pages after starting the API and dashboard.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors