Skip to content

Commit 22fa790

Browse files
authored
Merge pull request #31 from mtandrita/feature/diabetes-ml-pipeline
Feature/diabetes ml pipeline
2 parents 02f6615 + c56cb21 commit 22fa790

File tree

12 files changed

+338
-0
lines changed

12 files changed

+338
-0
lines changed
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
__pycache__/
2+
*.pyc
3+
model/*.pkl
4+
.venv/
Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# Diabetes Prediction – Machine Learning Pipeline
2+
3+
> ⚠️ This repository is a **forked project**.
4+
> The work below represents my **independent contribution and extension** to the original codebase.
5+
6+
This project implements a complete **end-to-end machine learning pipeline** for predicting diabetes using the Pima Indians Diabetes dataset.
7+
The pipeline covers **data preprocessing, model training, evaluation, experimentation, and inference via CLI**.
8+
9+
---
10+
11+
## 📁 Project Structure
12+
diabetes_pipeline/
13+
14+
├── dataset/
15+
│ └── kaggle_diabetes.csv
16+
17+
├── model/
18+
│ ├── diabetes_model.pkl
19+
│ └── scaler.pkl
20+
21+
├── experiments/
22+
│ └── experiment_runner.py
23+
24+
├── data_preprocessing.py
25+
├── train.py
26+
├── predict.py
27+
├── evaluate.py
28+
└── README.md
29+
30+
---
31+
32+
## 🚀 My Contributions
33+
34+
I independently designed and implemented the following components:
35+
36+
### 1. Data Preprocessing Pipeline
37+
- Handled missing values in medical features:
38+
- `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, `BMI`
39+
- Replaced invalid zeros with `NaN`
40+
- Applied **mean / median imputation**
41+
- Standardized features using `StandardScaler`
42+
- Ensured consistent feature names across training and inference
43+
44+
📄 `data_preprocessing.py`
45+
46+
---
47+
48+
### 2. Model Training
49+
- Implemented a reproducible training pipeline
50+
- Trained and persisted:
51+
- Random Forest classifier
52+
- Feature scaler
53+
- Stored trained artifacts for reuse and deployment
54+
55+
📄 `train.py`
56+
57+
---
58+
59+
### 3. Model Evaluation
60+
- Added evaluation logic with:
61+
- Accuracy
62+
- Precision, Recall, F1-score
63+
- Verified generalization on the test set
64+
65+
📄 `evaluate.py`
66+
67+
---
68+
69+
### 4. Experimentation Framework
70+
- Benchmarked multiple ML models:
71+
- Logistic Regression
72+
- Decision Tree
73+
- Random Forest
74+
- Support Vector Machine (SVM)
75+
- Automatically reports accuracy and F1-score
76+
77+
📄 `experiments/experiment_runner.py`
78+
79+
#### Sample Results
80+
81+
| Model | Accuracy | F1 Score |
82+
|----------------------|----------|----------|
83+
| Logistic Regression | 0.7875 | 0.6320 |
84+
| Decision Tree | 0.9875 | 0.9805 |
85+
| Random Forest | 0.9950 | 0.9921 |
86+
| SVM | 0.8450 | 0.7328 |
87+
88+
✔️ **Random Forest performs best on this dataset**
89+
90+
---
91+
92+
### 5. Command-Line Prediction Interface
93+
- Built a CLI-based inference script
94+
- Ensures:
95+
- Correct feature order
96+
- Feature-name alignment with trained scaler
97+
- Predicts diabetes for a single patient input
98+
99+
📄 `predict.py`
100+
101+
Example:
102+
```bash
103+
python predict.py \
104+
--pregnancies 2 \
105+
--glucose 120 \
106+
--bp 70 \
107+
--skin 20 \
108+
--insulin 80 \
109+
--bmi 25 \
110+
--dpf 0.5 \
111+
--age 35
112+
113+
114+
115+
---
116+
117+
## 🛠️ Tech Stack
118+
119+
- Python 3.10+
120+
- pandas
121+
- numpy
122+
- scikit-learn
123+
- joblib
124+
125+
---
126+
127+
## 🧩 Notes
128+
129+
- Project is modular and deployment-ready
130+
- Structured to support FastAPI / Flask integration
131+
- Generated files cleaned using `.gitignore`
132+
- Suitable for internship-level ML engineering evaluation
133+
134+
---
135+
136+
## 👩‍💻 Author Contribution
137+
138+
**Contributor:** Tandrita Mukherjee
139+
140+
**Contribution Scope:**
141+
- ML pipeline design
142+
- Data preprocessing
143+
- Model training & evaluation
144+
- Experimentation framework
145+
- CLI-based inference system
146+
147+
---
148+
149+
## 📌 Disclaimer
150+
151+
This repository is a fork of an existing project.
152+
All enhancements, restructuring, and ML pipeline components listed above were implemented independently as part of my learning and internship preparation.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
from pathlib import Path
2+
3+
BASE_DIR = Path(__file__).resolve().parent
4+
5+
MODEL_DIR = BASE_DIR / "model"
6+
MODEL_PATH = MODEL_DIR / "diabetes_model.pkl"
7+
SCALER_PATH = MODEL_DIR / "scaler.pkl"
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# diabetes_pipeline/data_preprocessing.py
2+
3+
import pandas as pd
4+
import numpy as np
5+
from pathlib import Path
6+
from sklearn.model_selection import train_test_split
7+
8+
def load_and_preprocess(test_size=0.2, random_state=0):
9+
BASE_DIR = Path(__file__).resolve().parent
10+
csv_path = BASE_DIR / "dataset" / "kaggle_diabetes.csv"
11+
df = pd.read_csv(csv_path)
12+
13+
df = df.rename(columns={'DiabetesPedigreeFunction': 'DPF'})
14+
15+
cols_with_zero = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
16+
df[cols_with_zero] = df[cols_with_zero].replace(0, np.nan)
17+
18+
df['Glucose'] = df['Glucose'].fillna(df['Glucose'].mean())
19+
df['BloodPressure'] = df['BloodPressure'].fillna(df['BloodPressure'].mean())
20+
df['SkinThickness'] = df['SkinThickness'].fillna(df['SkinThickness'].median())
21+
df['Insulin'] = df['Insulin'].fillna(df['Insulin'].median())
22+
df['BMI'] = df['BMI'].fillna(df['BMI'].median())
23+
24+
X = df.drop(columns='Outcome')
25+
y = df['Outcome']
26+
27+
return train_test_split(
28+
X, y, test_size=test_size, random_state=random_state
29+
)

Diabetes Prediction [END 2 END]/dataset/kaggle_diabetes.csv renamed to Diabetes Prediction [END 2 END]/diabetes_pipeline/dataset/kaggle_diabetes.csv

File renamed without changes.
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
import joblib
2+
from sklearn.metrics import accuracy_score, classification_report
3+
from data_preprocessing import load_and_preprocess
4+
from config import MODEL_PATH
5+
6+
# Load data
7+
X_train, X_test, y_train, y_test, _ = load_and_preprocess()
8+
9+
# Load trained model
10+
model = joblib.load(MODEL_PATH)
11+
12+
# Predict
13+
y_pred = model.predict(X_test)
14+
15+
# Metrics
16+
accuracy = accuracy_score(y_test, y_pred)
17+
report = classification_report(y_test, y_pred)
18+
19+
print("Accuracy:", accuracy)
20+
print("\nClassification Report:\n", report)

Diabetes Prediction [END 2 END]/diabetes_pipeline/experiments/__init__.py

Whitespace-only changes.
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# diabetes_pipeline/experiments/experiment_runner.py
2+
3+
import pandas as pd
4+
from sklearn.pipeline import Pipeline
5+
from sklearn.preprocessing import StandardScaler
6+
from sklearn.linear_model import LogisticRegression
7+
from sklearn.tree import DecisionTreeClassifier
8+
from sklearn.ensemble import RandomForestClassifier
9+
from sklearn.svm import SVC
10+
from sklearn.metrics import accuracy_score, f1_score
11+
12+
from diabetes_pipeline.data_preprocessing import load_and_preprocess
13+
14+
X_train, X_test, y_train, y_test = load_and_preprocess()
15+
16+
models = {
17+
"LogisticRegression": LogisticRegression(max_iter=1000),
18+
"DecisionTree": DecisionTreeClassifier(random_state=0),
19+
"RandomForest": RandomForestClassifier(n_estimators=50, random_state=0),
20+
"SVM": SVC()
21+
}
22+
23+
results = []
24+
25+
for name, model in models.items():
26+
pipeline = Pipeline([
27+
("scaler", StandardScaler()),
28+
("model", model)
29+
])
30+
31+
pipeline.fit(X_train, y_train)
32+
preds = pipeline.predict(X_test)
33+
34+
results.append({
35+
"Model": name,
36+
"Accuracy": accuracy_score(y_test, preds),
37+
"F1 Score": f1_score(y_test, preds)
38+
})
39+
40+
df = pd.DataFrame(results)
41+
print(df)
42+
43+
df.to_csv("diabetes_pipeline/experiments/results.csv", index=False)
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
Model,Accuracy,F1 Score
2+
LogisticRegression,0.7875,0.6320346320346321
3+
DecisionTree,0.9875,0.980544747081712
4+
RandomForest,0.995,0.9921259842519685
5+
SVM,0.845,0.7327586206896551
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
2025-12-28 11:48:56,518 - INFO - Training started
2+
2025-12-28 11:48:56,641 - INFO - Model and scaler saved successfully
3+
2025-12-28 11:49:14,730 - INFO - Training started
4+
2025-12-28 11:49:14,821 - INFO - Model and scaler saved successfully

0 commit comments

Comments
 (0)