Binary classification project predicting whether an individual's annual income exceeds $50K based on demographic and employment attributes from the UCI Adult Census dataset.
- Overview
- Dataset
- Project Workflow
- Tech Stack
- Project Structure
- Installation
- Usage
- Exploratory Data Analysis
- Modeling
- Results
- Model Evaluation
- Conclusion
- Future Improvements
- Author
- License
This project applies supervised machine learning to predict whether a person earns more than $50K per year using the well-known UCI Adult Census Income dataset. The pipeline covers data cleaning, exploratory data analysis (EDA), feature engineering, model training, hyperparameter tuning, and evaluation — all packaged in a reproducible workflow suitable for portfolio and internship presentation.
- Source: UCI Machine Learning Repository — Adult Dataset
- Records: ~48,842 instances
- Features: 14 demographic and employment attributes
- Target:
income—<=50Kor>50K
Key Features
age,workclass,education,education-nummarital-status,occupation,relationshiprace,sex,capital-gain,capital-losshours-per-week,native-country
- Data Loading & Inspection
- Data Cleaning — handling missing values & duplicates
- Exploratory Data Analysis (EDA)
- Feature Engineering & Encoding
- Train / Test Split & Scaling
- Model Training (multiple algorithms)
- Hyperparameter Tuning
- Model Evaluation & Comparison
- Model Persistence (
.pkl) - Conclusion & Insights
| Category | Tools |
|---|---|
| Language | Python 3.9+ |
| Data Handling | pandas, numpy |
| Visualization | matplotlib, seaborn |
| Machine Learning | scikit-learn |
| Model Persistence | joblib |
| Environment | Jupyter Notebook |
salary-prediction/
│
├── data/
│ └── adult.csv
│
├── notebooks/
│ └── salary_prediction.ipynb
│
├── models/
│ └── best_model.pkl
│
├── images/
│ └── eda_plots/
│ ├── correlation_heatmap.png
│ ├── feature_importance.png
│ ├── confusion_matrix.png
│ └── roc_curve.png
│
├── requirements.txt
├── README.md
├── LICENSE
└── .gitignore
1. Clone the repository
git clone https://github.com/ParvathyM155/Salary_Prediction_Using_Machine_Learning.git
cd Salary_Prediction_Using_Machine_Learning2. Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # macOS / Linux
venv\Scripts\activate # Windows3. Install dependencies
pip install -r requirements.txtLaunch the notebook:
jupyter notebook notebooks/salary_prediction.ipynbOr load the saved model directly in Python:
import joblib
model = joblib.load("models/best_model.pkl")
prediction = model.predict(new_data)Key insights uncovered during EDA:
- Strong correlation between education level and income.
- Hours-per-week and age significantly affect earning probability.
- Marital status and occupation are powerful categorical predictors.
- The dataset is imbalanced (~76%
<=50K, ~24%>50K).
Visualizes pairwise relationships between numerical features and the target variable.
The following classifiers were trained and compared:
- Logistic Regression
- Decision Tree
- Random Forest
- Gradient Boosting
- K-Nearest Neighbors
- Support Vector Machine
Hyperparameter tuning was performed using GridSearchCV with cross-validation.
Top features driving the Gradient Boosting model's predictions.
| Model | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.85 | 0.74 | 0.60 | 0.66 | 0.90 |
| Random Forest | 0.86 | 0.76 | 0.63 | 0.69 | 0.91 |
| Gradient Boosting | 0.87 | 0.79 | 0.65 | 0.71 | 0.92 |
✅ Gradient Boosting achieved the best overall performance and was selected as the final model.
Evaluation techniques applied:
- Confusion Matrix
- Classification Report
- ROC Curve & AUC
- Precision–Recall Curve
- K-Fold Cross-Validation
- Feature Importance Analysis
Breakdown of correct vs incorrect predictions for each income class.
Trade-off between true-positive and false-positive rates across thresholds.
The final Gradient Boosting model reliably predicts income brackets with ~87% accuracy and a 0.92 ROC-AUC, demonstrating strong generalization. Education, age, hours-per-week, and capital gain emerged as the most influential features — aligning with real-world economic intuition.
- Address class imbalance using SMOTE or class weighting
- Experiment with XGBoost and LightGBM
- Deploy the model as a Streamlit or Flask web app
- Add MLflow for experiment tracking
- Build a CI pipeline with GitHub Actions
Parvathy M
- 🌐 Portfolio: yourwebsite.com
- 💼 LinkedIn: linkedin.com/in/parvathym155
- 🐙 GitHub: @ParvathyM155
This project is licensed under the MIT License — see the LICENSE file for details.
⭐ If you found this project helpful, please consider giving it a star!



