An end-to-end machine learning pipeline for predicting student academic performance using structured educational data.
This project explores both classical machine learning and deep learning approaches on the UCI Student Performance dataset, with a strong focus on preprocessing, feature engineering, model comparison, and evaluation.
Live Demo Link : Try it here!
The objective is to predict whether a student will pass or fail based on demographic, academic, and behavioral attributes.
Two modeling scenarios were explored:
-
With prior grades (
G1,G2)- Easier prediction setting with strong academic history signals.
-
Without prior grades (
G1,G2removed)- Harder and more realistic prediction setting.
This comparison helps evaluate model robustness under different feature availability conditions.
Dataset: UCI Student Performance Dataset
- Source: Kaggle / UCI Repository
- Samples: 395
- Features: student demographics, family background, study habits, academic history, lifestyle attributes
Target variable:
- Binary classification:
- 1 = Pass
- 0 = Fail
Performed exploratory data analysis to understand feature distributions and dataset characteristics.
Tasks completed:
- Dataset inspection using:
head()shapeinfo()describe()
- Missing value analysis
- Duplicate check
- Class imbalance analysis
- Histograms
- Boxplots
- Correlation heatmap
- Outlier inspection
Key observations:
- No missing values
- No duplicates
- Moderate class imbalance
- Strong correlation of
G1,G2with final grade
Preprocessing steps included:
- One-hot encoding for categorical features
- Standardization using
StandardScaler - Train-test split (80/20)
- Stratified sampling for label balance preservation
Scaling applied only where required:
- Logistic Regression
- SVM
- MLP
No scaling used for:
- Random Forest
Custom engineered features were created to improve predictive performance.
Features engineered:
study_efficiencyrisk_scorestudy_discipline
Purpose:
- Capture behavioral and academic patterns beyond raw features.
Feature engineering was validated through Random Forest feature importance analysis.
Used as a linear baseline classifier.
Concepts explored:
- linear decision boundaries
- classification probabilities
- regularization
Kernel-based nonlinear classifier.
Hyperparameter tuning:
C
Concepts explored:
- margin maximization
- nonlinear decision boundaries
- kernel methods
Ensemble tree-based classifier.
Hyperparameter tuning:
n_estimators
Concepts explored:
- bagging
- ensemble learning
- feature importance
- variance reduction
Deep learning benchmark built using TensorFlow/Keras.
Architecture:
- 2 hidden layers
- ReLU activation
- Dropout regularization
- Adam optimizer
Training techniques:
- Early stopping
- Validation split
Concepts explored:
- backpropagation
- gradient descent
- neural network regularization
Models were evaluated using:
- Accuracy
- Precision
- Recall
- F1 Score
- Confusion Matrix
Additional validation:
- 5-Fold Cross Validation
- Learning Curves
- Bias-Variance Analysis
| Model | Accuracy (With Grades) | F1 (With Grades) | Accuracy (Without Grades) | F1 (Without Grades) |
|---|---|---|---|---|
| Logistic Regression | 0.8734 | 0.9020 | 0.6456 | 0.7544 |
| SVM | 0.8228 | 0.8679 | 0.6835 | 0.7934 |
| Random Forest | 0.8861 | 0.9109 | 0.6709 | 0.7797 |
| MLP | 0.8481 | 0.8846 | 0.6709 | 0.7969 |
| Model | CV F1 (With Grades) | CV F1 (Without Grades) |
|---|---|---|
| Logistic Regression | 0.9333 | 0.7818 |
| SVM | 0.8979 | 0.7974 |
| Random Forest | 0.9354 | 0.7890 |
Best model: Random Forest
Why:
- highest F1 score
- highest CV F1
- lowest variance across folds
Insight:
- prior academic grades are highly predictive.
Most reliable model: SVM
Why:
- best CV F1 under reduced feature setting
Insight:
- nonlinear models perform better when strong grade predictors are removed.
MLP did not outperform Random Forest on this small tabular dataset.
Conclusion:
- neural networks are not automatically superior for structured data.
Learning curves were generated for:
- Logistic Regression
- Random Forest
Insights:
- Logistic Regression showed low bias and mild variance
- Random Forest showed near-perfect training performance with controlled overfitting
MLP was analyzed using:
- training loss curves
- validation loss curves
- early stopping behavior
- Small dataset size (395 samples)
- Moderate class imbalance
- Strong grade-based predictors simplify task
- Limited hyperparameter tuning
Possible extensions:
- GridSearchCV / RandomizedSearchCV
- SMOTE / class balancing
- Feature selection methods
- XGBoost / LightGBM benchmarking
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn
- TensorFlow / Keras
- Google Colab
student-performance-predictor/
│
├── data/
├── notebooks/
├── results/
├── README.md
└── requirements.txt
pip install -r requirements.txt
streamlit run app.pyBuilt as a hands-on machine learning project to strengthen understanding of:
- supervised learning
- preprocessing pipelines
- model training
- evaluation metrics
- cross-validation
- deep learning fundamentals
- bias-variance tradeoff