|
1 | | -# Student-performance-analysis-using-Big-data |
2 | | -execution of this project is a piece of cake. |
3 | | -first of all save datasets.csv file and student.py in same folder. |
4 | | -then open terminal from same folder and type "python student.py". |
5 | | -Kudos. |
| 1 | +# Student Performance Analysis using Big Data |
| 2 | + |
| 3 | +A machine learning project that analyzes and predicts student academic performance based on educational behavioral data. This project leverages multiple regression and ensemble algorithms to identify key factors influencing student success. |
| 4 | + |
| 5 | +## 📋 Overview |
| 6 | + |
| 7 | +This project performs comprehensive analysis of student performance data, including: |
| 8 | +- **Exploratory Data Analysis (EDA)**: Visualizes distributions across student demographics and behaviors |
| 9 | +- **Feature Engineering**: Encodes categorical variables and identifies the most important predictors |
| 10 | +- **Model Selection**: Compares multiple machine learning algorithms |
| 11 | +- **Hyperparameter Tuning**: Optimizes the best-performing models |
| 12 | +- **Ensemble Methods**: Implements advanced ensemble techniques for improved predictions |
| 13 | + |
| 14 | +## 🎯 Key Features |
| 15 | + |
| 16 | +- **Data Processing**: Handles categorical encoding and feature scaling |
| 17 | +- **Feature Importance Analysis**: Identifies which student behaviors matter most |
| 18 | +- **Multiple Algorithms**: Tests 6+ regression models including: |
| 19 | + - Linear Regression |
| 20 | + - LASSO & Elastic Net |
| 21 | + - K-Nearest Neighbors |
| 22 | + - Decision Trees |
| 23 | + - Support Vector Regression |
| 24 | + - Ensemble methods (AdaBoost, Gradient Boosting, Random Forest, Extra Trees) |
| 25 | +- **Visualization**: Generates comparison plots and feature importance charts |
| 26 | +- **Cross-Validation**: 10-fold cross-validation for robust model evaluation |
| 27 | + |
| 28 | +## 📊 Dataset |
| 29 | + |
| 30 | +The analysis uses the **xAPI-Edu-Data.csv** dataset containing 482 student records with: |
| 31 | +- **Student Information**: Gender, nationality, grade level, section |
| 32 | +- **Academic Engagement**: Raised hands, visited resources, announcements viewed |
| 33 | +- **Participation**: Discussion contributions |
| 34 | +- **Attendance**: Absence patterns |
| 35 | +- **Parent Involvement**: Survey response and school satisfaction |
| 36 | +- **Target Variable**: Student class (performance level) |
| 37 | + |
| 38 | +## 🚀 Quick Start |
| 39 | + |
| 40 | +### Prerequisites |
| 41 | +- Python 3.x |
| 42 | +- pandas |
| 43 | +- numpy |
| 44 | +- scikit-learn |
| 45 | +- matplotlib |
| 46 | +- seaborn |
| 47 | + |
| 48 | +### Installation |
| 49 | + |
| 50 | +1. Clone the repository: |
| 51 | +```bash |
| 52 | +git clone <repository-url> |
| 53 | +cd Student-performance-analysis-using-Big-data |
| 54 | +``` |
| 55 | + |
| 56 | +2. Install dependencies: |
| 57 | +```bash |
| 58 | +pip install pandas numpy scikit-learn matplotlib seaborn |
| 59 | +``` |
| 60 | + |
| 61 | +3. Run the analysis: |
| 62 | +```bash |
| 63 | +python student.py |
| 64 | +``` |
| 65 | + |
| 66 | +The script will output: |
| 67 | +- Dataset statistics and descriptive analysis |
| 68 | +- Feature importance rankings |
| 69 | +- Algorithm comparison results |
| 70 | +- Cross-validation scores |
| 71 | +- Final model performance metrics |
| 72 | +- Visualization plots |
| 73 | + |
| 74 | +## 📈 Project Workflow |
| 75 | + |
| 76 | +1. **Data Loading & Exploration**: Load CSV and display basic statistics |
| 77 | +2. **Data Preprocessing**: |
| 78 | + - Remove irrelevant features |
| 79 | + - Encode categorical variables |
| 80 | + - Scale features for algorithms |
| 81 | +3. **Dimensionality Reduction**: Identify and retain top 6 most important features |
| 82 | +4. **Model Evaluation**: |
| 83 | + - Spot-check baseline algorithms |
| 84 | + - Test scaled versions with StandardScaler |
| 85 | + - Compare ensemble methods |
| 86 | +5. **Hyperparameter Tuning**: |
| 87 | + - Grid search for optimal LASSO alpha |
| 88 | + - AdaBoost estimator optimization |
| 89 | +6. **Final Model**: Train Gradient Boosting with optimized parameters |
| 90 | + |
| 91 | +## 📁 Project Structure |
| 92 | + |
| 93 | +``` |
| 94 | +├── datasets.csv # Student performance dataset |
| 95 | +├── student.py # Main analysis script |
| 96 | +├── README.md # This file |
| 97 | +├── LICENSE # Project license |
| 98 | +└── R-paper.pdf # Research paper reference |
| 99 | +``` |
| 100 | + |
| 101 | +## 🔍 Results & Analysis |
| 102 | + |
| 103 | +The analysis identifies which factors most strongly predict student performance: |
| 104 | +- **Top Predictors**: VisitedResources, RaisedHands, AnnouncementsView, StudentAbsenceDays, Discussion |
| 105 | + |
| 106 | +The best-performing model combines feature scaling with Gradient Boosting regression, achieving optimal predictions on the test set. |
| 107 | + |
| 108 | +## 📝 Notes |
| 109 | + |
| 110 | +- The project uses deprecated `train_test_split` from `cross_validation` module (consider updating to `model_selection`) |
| 111 | +- All visualizations are displayed during script execution |
| 112 | +- Results are printed to console output |
| 113 | + |
| 114 | +## 👨💻 Author |
| 115 | + |
| 116 | +Created by Dharmendra Choudhary - VIT University, Vellore, Tamil Nadu |
| 117 | + |
| 118 | +## 📄 License |
| 119 | + |
| 120 | +See the LICENSE file for details. |
| 121 | + |
| 122 | +## 🤝 Contributing |
| 123 | + |
| 124 | +Contributions are welcome! Please feel free to submit pull requests or open issues for improvements. |
0 commit comments