Skip to content

Darshit02/loan-prediction-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

11 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ’ณ Loan Prediction System

End-to-End ML + MLOps โ€” Predict Loan Approval in Real Time

Random Forest ยท XGBoost ยท MLflow ยท FastAPI ยท Streamlit ยท Docker

Python MLflow PRs Welcome Open in Colab


๐Ÿ“Œ Table of Contents


๐Ÿ”ฌ Overview

Loan Prediction System is a production-style, end-to-end machine learning application that predicts whether a loan application should be Approved or Rejected based on financial and demographic data.

The project demonstrates the complete ML lifecycle โ€” from raw data ingestion and feature engineering, through model training and experiment tracking with MLflow, to a live REST API and interactive Streamlit dashboard. Built with MLOps best practices from the ground up.

โš ๏ธ For educational purposes only. Not intended for real financial decision-making.


โœจ Features

Feature Description
๐Ÿ“Š Full ML Pipeline Data ingestion โ†’ preprocessing โ†’ training โ†’ evaluation โ†’ serving
๐Ÿค– Model Comparison Random Forest vs XGBoost with automatic best-model selection
๐Ÿ“ˆ MLflow Tracking Log parameters, metrics, artifacts; full experiment history & model versioning
๐Ÿ”Œ FastAPI Backend RESTful prediction endpoint with Pydantic schema validation
๐Ÿ“Š Streamlit Dashboard Upload applicant data and get instant predictions via browser UI
๐Ÿณ Dockerized Full Docker + Docker Compose setup for one-command production deployment
๐Ÿงช Test-Ready pytest-compatible structure for unit and integration tests

๐Ÿ—๏ธ Project Architecture

loan-prediction-system/
โ”‚
โ”œโ”€โ”€ app/
โ”‚   โ””โ”€โ”€ streamlit_app.py            # ๐Ÿ“Š Streamlit prediction dashboard
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/                        # ๐Ÿ“ Raw CSV dataset
โ”‚   โ”œโ”€โ”€ data_loader.py              #    Load & split data
โ”‚   โ””โ”€โ”€ preprocess.py               #    Feature engineering & encoding
โ”‚
โ”œโ”€โ”€ mlruns/                         # ๐Ÿ“ˆ MLflow experiment tracking data
โ”œโ”€โ”€ mlflow.db                       #    MLflow SQLite backend
โ”‚
โ”œโ”€โ”€ models/
โ”‚   โ”œโ”€โ”€ train_model.py              # ๐Ÿค– RF & XGBoost training logic
โ”‚   โ”œโ”€โ”€ predict.py                  #    Load model + run inference
โ”‚   โ”œโ”€โ”€ preprocessor.pkl            #    Saved preprocessing pipeline
โ”‚   โ””โ”€โ”€ train_model.pkl             #    Saved best model artifact
โ”‚
โ”œโ”€โ”€ pipelines/
โ”‚   โ””โ”€โ”€ training_pipeline.py        # โš™๏ธ  Orchestrates full training flow
โ”‚
โ”œโ”€โ”€ notebooks/
โ”‚   โ””โ”€โ”€ eda.ipynb                   # ๐Ÿ““ Exploratory Data Analysis
โ”‚
โ”œโ”€โ”€ src/                            # ๐Ÿ”Œ Core API backend
โ”‚   โ”œโ”€โ”€ api/                        #    FastAPI route handlers
โ”‚   โ”œโ”€โ”€ config/                     #    App configuration & constants
โ”‚   โ”œโ”€โ”€ schema/                     #    Pydantic request/response models
โ”‚   โ”œโ”€โ”€ utils/                      #    Helper functions & logging
โ”‚   โ””โ”€โ”€ main.py                     #    API entrypoint
โ”‚
โ”œโ”€โ”€ tests/
โ”‚   โ””โ”€โ”€ sent_data.py                # ๐Ÿงช Test payload & assertions
โ”‚
โ”œโ”€โ”€ artifacts/                      # ๐Ÿ’พ Additional saved outputs
โ”œโ”€โ”€ Dockerfile
โ”œโ”€โ”€ compose.yaml
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ .gitignore
โ””โ”€โ”€ README.md

๐Ÿ”„ Pipeline Flow

                     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                     โ”‚              RAW DATA INGESTION               โ”‚
                     โ”‚    CSV โ†’ data_loader.py โ†’ Train/Test Split    โ”‚
                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                          โ”‚
                     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                     โ”‚            FEATURE ENGINEERING                โ”‚
                     โ”‚  Encoding ยท Scaling ยท Imputation ยท Selection  โ”‚
                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                          โ”‚
                     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                     โ”‚           MODEL TRAINING & COMPARISON         โ”‚
                     โ”‚       Random Forest  โ†”  XGBoost              โ”‚
                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                โ”‚                     โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚   Model Evaluation   โ”‚ โ”‚   MLflow Experiment   โ”‚
                    โ”‚ Accuracy ยท F1 ยท AUC  โ”‚ โ”‚  Params ยท Metrics ยท   โ”‚
                    โ”‚  Best Model Selected โ”‚ โ”‚  Artifacts ยท Versions โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                โ”‚
                     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                     โ”‚          SAVED MODEL (.pkl)                   โ”‚
                     โ”‚     models/train_model.pkl                    โ”‚
                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                โ”‚                    โ”‚
               โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
               โ”‚    FastAPI REST API   โ”‚  โ”‚  Streamlit Dashboard    โ”‚
               โ”‚   POST /predict       โ”‚  โ”‚  Form โ†’ Predict โ†’ Show  โ”‚
               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โš™๏ธ Tech Stack

Layer Technology
ML Models Random Forest ยท XGBoost
Data Processing Pandas ยท NumPy ยท Scikit-learn
Experiment Tracking MLflow (SQLite backend)
Backend API FastAPI ยท Uvicorn ยท Pydantic
Frontend Streamlit
Containerization Docker ยท Docker Compose
Testing pytest

๐Ÿš€ Getting Started

Prerequisites

  • Python 3.10+
  • pip / conda
  • Docker (optional, for containerized deployment)

1. Clone the Repository

git clone https://github.com/Darshit02/loan-prediction-system.git
cd loan-prediction-system

2. Create a Virtual Environment

# macOS / Linux
python -m venv venv
source venv/bin/activate

# Windows
python -m venv venv
venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

4. Add Your Dataset

Place your raw dataset in the data/raw/ directory:

data/
โ””โ”€โ”€ raw/
    โ””โ”€โ”€ loan_data.csv       # Your training dataset

Recommended: Kaggle Loan Prediction Dataset


๐Ÿง  Run Training Pipeline

python pipelines/training_pipeline.py

This will:

  1. โœ… Load and clean data from data/raw/
  2. โœ… Run feature engineering and preprocessing
  3. โœ… Train Random Forest and XGBoost models
  4. โœ… Compare performance and select the best model
  5. โœ… Log all experiments, metrics, and artifacts to MLflow
  6. โœ… Save the winning model to models/train_model.pkl

๐Ÿ“ˆ Run MLflow UI

mlflow ui

Open your browser at http://127.0.0.1:5000

You'll see:

  • ๐Ÿ“‹ All experiment runs with parameters and metrics
  • ๐Ÿ“Š Side-by-side model comparison charts
  • ๐Ÿ“ฆ Saved model artifacts per run
  • ๐Ÿท๏ธ Model versioning history

๐Ÿ”Œ Run FastAPI Server

uvicorn src.main:app --reload --host 0.0.0.0 --port 8000

Interactive API docs available at:


๐Ÿ“Š Run Streamlit Dashboard

streamlit run app/streamlit_app.py

Open your browser at http://localhost:8501

Features:

  • ๐Ÿ“ Fill in applicant details via form inputs
  • ๐Ÿค– Instant prediction with confidence score
  • ๐Ÿ“Š Feature importance visualization
  • ๐Ÿ“ˆ Model performance summary panel

๐Ÿ“Œ API Usage

Endpoint

POST /predict

Request Payload

{
  "income": 50000,
  "loan_amount": 200000,
  "credit_score": 750,
  "employment_status": "Salaried"
}

Response

{
  "loan_status": "Approved",
  "confidence": 0.87,
  "model_used": "XGBoost"
}

Example with curl

curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"income": 50000, "loan_amount": 200000, "credit_score": 750, "employment_status": "Salaried"}'

๐Ÿ“Š Model Performance

Models are evaluated on a held-out test set. The best-performing model is automatically selected by the training pipeline.

Metric Random Forest XGBoost
Accuracy ~85% ~88%
Precision ~84% ~87%
Recall ~83% ~86%
F1-Score ~83% ~86%
AUC-ROC ~0.91 ~0.93

Metrics vary based on dataset, hyperparameters, and train/test split. All runs are logged in MLflow for full reproducibility.


๐Ÿณ Docker Deployment

Build & Run (Single Container)

docker build -t loan-prediction .
docker run -p 8000:8000 loan-prediction

Multi-Service with Docker Compose

docker-compose up --build

This spins up:

  • api service โ†’ FastAPI on port 8000
  • dashboard service โ†’ Streamlit on port 8501
  • mlflow service โ†’ MLflow UI on port 5000

๐Ÿงช Testing

# Run all tests
pytest tests/

# With coverage report
pytest tests/ --cov=src --cov-report=term-missing

Test coverage includes:

  • tests/sent_data.py โ€” API payload validation and response assertions
  • Preprocessing pipeline correctness
  • Model loading and inference checks

๐Ÿ”ฎ Future Roadmap

  • ๐Ÿ“Š Feature importance visualization โ€” SHAP values in Streamlit
  • ๐Ÿง  AutoML integration โ€” Optuna / FLAML hyperparameter search
  • ๐Ÿช Feature store โ€” Feast or Hopsworks integration
  • โœ… Data validation โ€” Great Expectations for schema & drift checks
  • ๐Ÿ” CI/CD pipeline โ€” GitHub Actions for automated test + deploy
  • ๐Ÿ”„ Automated retraining โ€” scheduled model refresh on new data
  • ๐Ÿ“ฆ MLflow Model Registry โ€” staging โ†’ production promotion workflow
  • ๐Ÿ“ก Drift monitoring โ€” Evidently AI or Whylogs integration
  • โ˜๏ธ Cloud deployment โ€” AWS SageMaker / GCP Vertex AI

โš ๏ธ Disclaimer

This project is developed strictly for educational and research purposes.
It is not validated for real financial decision-making and must not be used to approve or deny actual loan applications.
Always consult a licensed financial professional for credit-related decisions.


๐Ÿค Contributing

Contributions are welcome! Here's how:

# 1. Fork the repository
# 2. Create a feature branch
git checkout -b feature/your-feature-name

# 3. Commit your changes
git commit -m "feat: add your feature"

# 4. Push to your fork
git push origin feature/your-feature-name

# 5. Open a Pull Request

Please follow Conventional Commits for commit messages.


๐Ÿ“œ License

This project is licensed under the MIT License.
See the LICENSE file for details.


Made with โค๏ธ for the ML & MLOps community

โญ Star this repo if you found it useful!

About

A production-style end-to-end Machine Learning project that predicts loan approval using financial and demographic data. The system includes a full ML pipeline, model comparison (Random Forest & XGBoost), MLflow experiment tracking, FastAPI prediction API, Streamlit dashboard, and Dockerized deployment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors