Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 124 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,124 @@
# Student-performance-analysis-using-Big-data
execution of this project is a piece of cake.
first of all save datasets.csv file and student.py in same folder.
then open terminal from same folder and type "python student.py".
Kudos.
# Student Performance Analysis using Big Data

A machine learning project that analyzes and predicts student academic performance based on educational behavioral data. This project leverages multiple regression and ensemble algorithms to identify key factors influencing student success.
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is not entirely accurate. The code actually uses both classification and regression algorithms. A RandomForestClassifier is used for feature importance analysis (line 61 in student.py), while regression algorithms are used for the prediction task. The statement "machine learning project that analyzes and predicts student academic performance based on educational behavioral data" could be clarified to reflect that it's primarily a classification problem (predicting student performance class/level).

Copilot uses AI. Check for mistakes.

## 📋 Overview

This project performs comprehensive analysis of student performance data, including:
- **Exploratory Data Analysis (EDA)**: Visualizes distributions across student demographics and behaviors
- **Feature Engineering**: Encodes categorical variables and identifies the most important predictors
- **Model Selection**: Compares multiple machine learning algorithms
- **Hyperparameter Tuning**: Optimizes the best-performing models
- **Ensemble Methods**: Implements advanced ensemble techniques for improved predictions

## 🎯 Key Features

- **Data Processing**: Handles categorical encoding and feature scaling
- **Feature Importance Analysis**: Identifies which student behaviors matter most
- **Multiple Algorithms**: Tests 6+ regression models including:
- Linear Regression
- LASSO & Elastic Net
- K-Nearest Neighbors
- Decision Trees
- Support Vector Regression
- Ensemble methods (AdaBoost, Gradient Boosting, Random Forest, Extra Trees)
- **Visualization**: Generates comparison plots and feature importance charts
- **Cross-Validation**: 10-fold cross-validation for robust model evaluation

## 📊 Dataset

The analysis uses the **xAPI-Edu-Data.csv** dataset containing 482 student records with:
- **Student Information**: Gender, nationality, grade level, section
- **Academic Engagement**: Raised hands, visited resources, announcements viewed
- **Participation**: Discussion contributions
- **Attendance**: Absence patterns
- **Parent Involvement**: Survey response and school satisfaction
- **Target Variable**: Student class (performance level)

## 🚀 Quick Start

### Prerequisites
- Python 3.x
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn

### Installation

1. Clone the repository:
```bash
git clone <repository-url>
cd Student-performance-analysis-using-Big-data
```

2. Install dependencies:
```bash
pip install pandas numpy scikit-learn matplotlib seaborn
```

3. Run the analysis:
```bash
python student.py
```

The script will output:
- Dataset statistics and descriptive analysis
- Feature importance rankings
- Algorithm comparison results
- Cross-validation scores
- Final model performance metrics
- Visualization plots

## 📈 Project Workflow

1. **Data Loading & Exploration**: Load CSV and display basic statistics
2. **Data Preprocessing**:
- Remove irrelevant features
- Encode categorical variables
- Scale features for algorithms
3. **Dimensionality Reduction**: Identify and retain top 6 most important features
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statement about identifying the "top 6 most important features" is misleading. The code in student.py (line 74) hardcodes a specific list of 6 features to retain, but this is not based on automatically selecting the "top 6" from the feature importance analysis. The features are manually specified in a list, not dynamically selected based on importance rankings.

Suggested change
3. **Dimensionality Reduction**: Identify and retain top 6 most important features
3. **Dimensionality Reduction**: Retain a predefined subset of 6 important features as specified in the script

Copilot uses AI. Check for mistakes.
4. **Model Evaluation**:
- Spot-check baseline algorithms
- Test scaled versions with StandardScaler
- Compare ensemble methods
5. **Hyperparameter Tuning**:
- Grid search for optimal LASSO alpha
- AdaBoost estimator optimization
6. **Final Model**: Train Gradient Boosting with optimized parameters

## 📁 Project Structure

```
├── datasets.csv # Student performance dataset
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename listed here is incorrect. The actual dataset file is named "xAPI-Edu-Data.csv" (as correctly mentioned in line 30 and used in student.py), not "datasets.csv". This inconsistency could confuse users about which file to use.

Suggested change
├── datasets.csv # Student performance dataset
├── xAPI-Edu-Data.csv # Student performance dataset

Copilot uses AI. Check for mistakes.
├── student.py # Main analysis script
├── README.md # This file
├── LICENSE # Project license
└── R-paper.pdf # Research paper reference
```

## 🔍 Results & Analysis

The analysis identifies which factors most strongly predict student performance:
- **Top Predictors**: VisitedResources, RaisedHands, AnnouncementsView, StudentAbsenceDays, Discussion
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature name capitalization is inconsistent with the actual code. In student.py (line 74), the feature is named 'VisITedResources' (with capital I and T), not 'VisitedResources'. Similarly, 'raisedhands' is all lowercase, not 'RaisedHands'. While this is describing the results, using the exact feature names from the code would improve accuracy.

Suggested change
- **Top Predictors**: VisitedResources, RaisedHands, AnnouncementsView, StudentAbsenceDays, Discussion
- **Top Predictors**: VisITedResources, raisedhands, AnnouncementsView, StudentAbsenceDays, Discussion

Copilot uses AI. Check for mistakes.

The best-performing model combines feature scaling with Gradient Boosting regression, achieving optimal predictions on the test set.

## 📝 Notes

- The project uses deprecated `train_test_split` from `cross_validation` module (consider updating to `model_selection`)
- All visualizations are displayed during script execution
- Results are printed to console output

## 👨‍💻 Author

Created by Dharmendra Choudhary - VIT University, Vellore, Tamil Nadu

## 📄 License

See the LICENSE file for details.

## 🤝 Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for improvements.