Credit Card Fraud Detection Platform

An end-to-end data engineering and data science platform built on Microsoft Fabric, designed to detect fraudulent credit card transactions using a medallion lakehouse architecture and machine learning.

Architecture

                    ┌──────────────┐
                    │  Kaggle API  │
                    └──────┬───────┘
                           │
                           ▼
              ┌────────────────────────┐
              │   Bronze Layer         │
              │   (Raw Transactions)   │
              │   568,630 records      │
              └────────────┬───────────┘
                           │  PySpark
                           ▼
              ┌────────────────────────┐
              │   Silver Layer         │
              │   (Cleaned + Features) │
              │   37 engineered cols   │
              └──────┬─────────┬───────┘
                     │         │
                     ▼         ▼
        ┌────────────────┐  ┌──────────────────┐
        │   Gold Layer   │  │  ML Experiments   │
        │  (Aggregated)  │  │  (MLflow)         │
        │   4 tables     │  │  3 models trained │
        └───────┬────────┘  └────────┬──────────┘
                │                    │
                │         ┌──────────▼──────────┐
                │         │  Model Registry     │
                │         │  (Best model saved) │
                │         └──────────┬──────────┘
                │                    │
                ▼                    ▼
        ┌────────────────────────────────────┐
        │   Gold: Scored Predictions         │
        │   (fraud_probability + risk_cat)   │
        └───────────────┬────────────────────┘
                        │
                        ▼
              ┌────────────────────────┐
              │   Power BI Dashboard   │
              │   (Fraud Analytics)    │
              └────────────────────────┘

Tech Stack

Component	Technology
Platform	Microsoft Fabric (Lakehouse, Data Factory, Power BI)
Compute	Apache Spark (PySpark)
Storage	Delta Lake on OneLake
ML Tracking	MLflow (Experiments + Model Registry)
Models	Logistic Regression, Random Forest, XGBoost
Orchestration	Fabric Data Factory Pipelines
Visualization	Power BI
Version Control	GitHub (Fabric Git Integration)
Language	Python, PySpark, SQL

Dataset

Credit Card Fraud Detection Dataset 2023 from Kaggle

568,630 credit card transactions by European cardholders
31 columns: id, V1–V28 (PCA-transformed), Amount, Class
Class: 0 = Legitimate, 1 = Fraudulent
Balanced dataset: 50/50 split (pre-balanced by dataset creator)
Data ingested programmatically via Kaggle REST API — no manual uploads

Project Structure

Credit-Fraud-Detection/
│
├── 01_bronze_ingestion/          # Raw data ingestion from Kaggle API → Delta table
├── 02_silver_transformation/     # Data cleaning, type casting, feature engineering
├── 03_gold_aggregation/          # Business-ready aggregated tables for Power BI
├── 04_eda_analysis/              # Exploratory data analysis with visualizations
├── 05_model_training/            # Train 3 models with MLflow experiment tracking
├── 06_model_scoring/             # Load best model, score all transactions
├── fraud_lakehouse/              # Fabric Lakehouse metadata
├── fraud-detection-pipeline/     # Data Factory pipeline (orchestration)
└── README.md

Medallion Architecture

Bronze Layer

Raw CSV ingested from Kaggle REST API using KaggleApi Python SDK
Stored as Delta table with ingestion metadata (ingestion_timestamp, data_source, source_endpoint)
No transformations applied — preserves original data for auditability

Silver Layer

Data cleaning: deduplication, null checks, type casting
9 engineered features:
- Amount_log — log-transformed transaction amount to reduce skewness
- Amount_category — binned into Low / Medium / High / Very High
- V1_abs through V5_abs — absolute values of top PCA components
- V1_V2_interaction — interaction between top two features
- V1_Amount_interaction — interaction between V1 and log amount
- Amount_percentile — percentile rank of each transaction
- Amount_quartile — quartile assignment
Data quality assertions: zero nulls, zero duplicates, zero negative amounts

Gold Layer

Four business-ready Delta tables:

Table	Description
`gold_fraud_by_amount`	Fraud rate and counts by amount category
`gold_fraud_by_quartile`	Fraud rate by spending quartile
`gold_kpis`	Single-row KPI table for Power BI cards
`gold_feature_summary`	Average feature values by fraud/non-fraud class

Machine Learning

Experiment Tracking with MLflow

Three classification models trained and compared using MLflow:

Model	Accuracy	Precision	Recall	F1 Score	AUC-ROC
Logistic Regression	0.9712	0.9586	0.9852	0.9717	0.9952
Random Forest	0.9995	0.9992	0.9997	0.9995	1.0000
XGBoost	0.9994	0.9991	0.9996	0.9994	1.0000

Best Model: Random Forest (200 estimators, max depth 15)

What was logged to MLflow for each run:

Parameters: model type, hyperparameters, number of features
Metrics: accuracy, precision, recall, F1, AUC-ROC, average precision
Artifacts: confusion matrix CSV, feature importance CSV
Model: serialized model object

Top Predictive Features (Random Forest)

V14 — 19.48%
V4 — 13.44%
V10 — 13.38%
V17 — 9.34%
V12 — 8.78%

Model Scoring

Best model loaded from MLflow Model Registry
All 568,630 transactions scored with:
- fraud_prediction (0 or 1)
- fraud_probability (0.0 to 1.0)
- risk_category (Low / Medium / High)
Predictions written to gold_fraud_scores Delta table

Pipeline Orchestration

Automated Data Factory pipeline chains all notebooks:

Bronze Ingestion → Silver Transformation → Gold Aggregation → Model Scoring

Scheduled for daily execution
Error handling on each activity
End-to-end runtime: ~10 minutes

Power BI Dashboard

Four-page interactive dashboard connected via Fabric semantic model:

Executive Summary — KPI cards (total transactions, fraud count, fraud rate, fraud amount)
Fraud Analysis — Fraud rate by amount category and quartile
Model Performance — AUC-ROC, precision, recall, F1 score metrics
Transaction Scores — Detailed scored transactions with risk category filtering

Key Design Decisions

Decision	Rationale
API ingestion over CSV upload	Demonstrates production-style data ingestion
Medallion architecture	Industry-standard pattern for data quality progression
No SMOTE applied	Dataset is pre-balanced at 50/50 — verified class distribution before applying any resampling
MLflow over manual tracking	Native Fabric integration, reproducibility, model versioning
Multiple models compared	Shows systematic model selection, not just picking one algorithm
Gold layer for Power BI	Pre-aggregated tables reduce dashboard query time
Delta Lake for all layers	ACID transactions, time travel, schema enforcement

How to Run

Prerequisites

Microsoft Fabric workspace with trial or paid capacity
Kaggle account and API token (kaggle.json)
GitHub account for Git integration

Steps

Create a Fabric workspace and Lakehouse
Create an MLflow Experiment: fraud-detection-experiment
Import notebooks into the workspace
Set your Kaggle credentials in 01_bronze_ingestion
Run notebooks sequentially: 01 → 02 → 03 → 04 → 05 → 06
Create a Data Factory pipeline to chain notebooks
Build Power BI report from the Gold layer semantic model

Skills Demonstrated

Data Engineering: API ingestion, medallion architecture, Delta Lake, PySpark transformations, data quality checks, pipeline orchestration
Data Science: Feature engineering, model training, hyperparameter tuning, class imbalance analysis, model comparison, MLflow experiment tracking, model registry
Business Intelligence: Semantic modeling, Power BI dashboard design, KPI reporting
DevOps: Git integration with Fabric, automated pipelines, documentation

Author

Raj Kumar Manala

Built as a portfolio project targeting Azure Cloud Data Engineering and Microsoft Fabric Engineering roles.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Bronze Ingestion via API.Notebook		Bronze Ingestion via API.Notebook
Fraud Detection Model.SemanticModel		Fraud Detection Model.SemanticModel
Model_Training_MLflow.Notebook		Model_Training_MLflow.Notebook
eda_analysis.Notebook		eda_analysis.Notebook
fraud-detection-experiment.MLExperiment		fraud-detection-experiment.MLExperiment
fraud-detection-model.MLModel		fraud-detection-model.MLModel
fraud_lakehouse.Lakehouse		fraud_lakehouse.Lakehouse
gold_transformation.Notebook		gold_transformation.Notebook
model_scoring.Notebook		model_scoring.Notebook
pipeline_orchestration.DataPipeline		pipeline_orchestration.DataPipeline
silver_transformation.Notebook		silver_transformation.Notebook
Model_Training_MLflow.ipynb		Model_Training_MLflow.ipynb
README.md		README.md
gold_transformation.ipynb		gold_transformation.ipynb
model_scoring.ipynb		model_scoring.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Card Fraud Detection Platform

Architecture

Tech Stack

Dataset

Project Structure

Medallion Architecture

Bronze Layer

Silver Layer

Gold Layer

Machine Learning

Experiment Tracking with MLflow

What was logged to MLflow for each run:

Top Predictive Features (Random Forest)

Model Scoring

Pipeline Orchestration

Power BI Dashboard

Key Design Decisions

How to Run

Prerequisites

Steps

Skills Demonstrated

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Credit Card Fraud Detection Platform

Architecture

Tech Stack

Dataset

Project Structure

Medallion Architecture

Bronze Layer

Silver Layer

Gold Layer

Machine Learning

Experiment Tracking with MLflow

What was logged to MLflow for each run:

Top Predictive Features (Random Forest)

Model Scoring

Pipeline Orchestration

Power BI Dashboard

Key Design Decisions

How to Run

Prerequisites

Steps

Skills Demonstrated

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages