Skip to content

Commit 93e3534

Browse files
committed
Improved README.md
1 parent 28cd16d commit 93e3534

1 file changed

Lines changed: 124 additions & 5 deletions

File tree

README.md

Lines changed: 124 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,124 @@
1-
# Student-performance-analysis-using-Big-data
2-
execution of this project is a piece of cake.
3-
first of all save datasets.csv file and student.py in same folder.
4-
then open terminal from same folder and type "python student.py".
5-
Kudos.
1+
# Student Performance Analysis using Big Data
2+
3+
A machine learning project that analyzes and predicts student academic performance based on educational behavioral data. This project leverages multiple regression and ensemble algorithms to identify key factors influencing student success.
4+
5+
## 📋 Overview
6+
7+
This project performs comprehensive analysis of student performance data, including:
8+
- **Exploratory Data Analysis (EDA)**: Visualizes distributions across student demographics and behaviors
9+
- **Feature Engineering**: Encodes categorical variables and identifies the most important predictors
10+
- **Model Selection**: Compares multiple machine learning algorithms
11+
- **Hyperparameter Tuning**: Optimizes the best-performing models
12+
- **Ensemble Methods**: Implements advanced ensemble techniques for improved predictions
13+
14+
## 🎯 Key Features
15+
16+
- **Data Processing**: Handles categorical encoding and feature scaling
17+
- **Feature Importance Analysis**: Identifies which student behaviors matter most
18+
- **Multiple Algorithms**: Tests 6+ regression models including:
19+
- Linear Regression
20+
- LASSO & Elastic Net
21+
- K-Nearest Neighbors
22+
- Decision Trees
23+
- Support Vector Regression
24+
- Ensemble methods (AdaBoost, Gradient Boosting, Random Forest, Extra Trees)
25+
- **Visualization**: Generates comparison plots and feature importance charts
26+
- **Cross-Validation**: 10-fold cross-validation for robust model evaluation
27+
28+
## 📊 Dataset
29+
30+
The analysis uses the **xAPI-Edu-Data.csv** dataset containing 482 student records with:
31+
- **Student Information**: Gender, nationality, grade level, section
32+
- **Academic Engagement**: Raised hands, visited resources, announcements viewed
33+
- **Participation**: Discussion contributions
34+
- **Attendance**: Absence patterns
35+
- **Parent Involvement**: Survey response and school satisfaction
36+
- **Target Variable**: Student class (performance level)
37+
38+
## 🚀 Quick Start
39+
40+
### Prerequisites
41+
- Python 3.x
42+
- pandas
43+
- numpy
44+
- scikit-learn
45+
- matplotlib
46+
- seaborn
47+
48+
### Installation
49+
50+
1. Clone the repository:
51+
```bash
52+
git clone <repository-url>
53+
cd Student-performance-analysis-using-Big-data
54+
```
55+
56+
2. Install dependencies:
57+
```bash
58+
pip install pandas numpy scikit-learn matplotlib seaborn
59+
```
60+
61+
3. Run the analysis:
62+
```bash
63+
python student.py
64+
```
65+
66+
The script will output:
67+
- Dataset statistics and descriptive analysis
68+
- Feature importance rankings
69+
- Algorithm comparison results
70+
- Cross-validation scores
71+
- Final model performance metrics
72+
- Visualization plots
73+
74+
## 📈 Project Workflow
75+
76+
1. **Data Loading & Exploration**: Load CSV and display basic statistics
77+
2. **Data Preprocessing**:
78+
- Remove irrelevant features
79+
- Encode categorical variables
80+
- Scale features for algorithms
81+
3. **Dimensionality Reduction**: Identify and retain top 6 most important features
82+
4. **Model Evaluation**:
83+
- Spot-check baseline algorithms
84+
- Test scaled versions with StandardScaler
85+
- Compare ensemble methods
86+
5. **Hyperparameter Tuning**:
87+
- Grid search for optimal LASSO alpha
88+
- AdaBoost estimator optimization
89+
6. **Final Model**: Train Gradient Boosting with optimized parameters
90+
91+
## 📁 Project Structure
92+
93+
```
94+
├── datasets.csv # Student performance dataset
95+
├── student.py # Main analysis script
96+
├── README.md # This file
97+
├── LICENSE # Project license
98+
└── R-paper.pdf # Research paper reference
99+
```
100+
101+
## 🔍 Results & Analysis
102+
103+
The analysis identifies which factors most strongly predict student performance:
104+
- **Top Predictors**: VisitedResources, RaisedHands, AnnouncementsView, StudentAbsenceDays, Discussion
105+
106+
The best-performing model combines feature scaling with Gradient Boosting regression, achieving optimal predictions on the test set.
107+
108+
## 📝 Notes
109+
110+
- The project uses deprecated `train_test_split` from `cross_validation` module (consider updating to `model_selection`)
111+
- All visualizations are displayed during script execution
112+
- Results are printed to console output
113+
114+
## 👨‍💻 Author
115+
116+
Created by Dharmendra Choudhary - VIT University, Vellore, Tamil Nadu
117+
118+
## 📄 License
119+
120+
See the LICENSE file for details.
121+
122+
## 🤝 Contributing
123+
124+
Contributions are welcome! Please feel free to submit pull requests or open issues for improvements.

0 commit comments

Comments
 (0)