-
Notifications
You must be signed in to change notification settings - Fork 16
Improved README.md #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,5 +1,124 @@ | ||||||
| # Student-performance-analysis-using-Big-data | ||||||
| execution of this project is a piece of cake. | ||||||
| first of all save datasets.csv file and student.py in same folder. | ||||||
| then open terminal from same folder and type "python student.py". | ||||||
| Kudos. | ||||||
| # Student Performance Analysis using Big Data | ||||||
|
|
||||||
| A machine learning project that analyzes and predicts student academic performance based on educational behavioral data. This project leverages multiple regression and ensemble algorithms to identify key factors influencing student success. | ||||||
|
|
||||||
| ## 📋 Overview | ||||||
|
|
||||||
| This project performs comprehensive analysis of student performance data, including: | ||||||
| - **Exploratory Data Analysis (EDA)**: Visualizes distributions across student demographics and behaviors | ||||||
| - **Feature Engineering**: Encodes categorical variables and identifies the most important predictors | ||||||
| - **Model Selection**: Compares multiple machine learning algorithms | ||||||
| - **Hyperparameter Tuning**: Optimizes the best-performing models | ||||||
| - **Ensemble Methods**: Implements advanced ensemble techniques for improved predictions | ||||||
|
|
||||||
| ## 🎯 Key Features | ||||||
|
|
||||||
| - **Data Processing**: Handles categorical encoding and feature scaling | ||||||
| - **Feature Importance Analysis**: Identifies which student behaviors matter most | ||||||
| - **Multiple Algorithms**: Tests 6+ regression models including: | ||||||
| - Linear Regression | ||||||
| - LASSO & Elastic Net | ||||||
| - K-Nearest Neighbors | ||||||
| - Decision Trees | ||||||
| - Support Vector Regression | ||||||
| - Ensemble methods (AdaBoost, Gradient Boosting, Random Forest, Extra Trees) | ||||||
| - **Visualization**: Generates comparison plots and feature importance charts | ||||||
| - **Cross-Validation**: 10-fold cross-validation for robust model evaluation | ||||||
|
|
||||||
| ## 📊 Dataset | ||||||
|
|
||||||
| The analysis uses the **xAPI-Edu-Data.csv** dataset containing 482 student records with: | ||||||
| - **Student Information**: Gender, nationality, grade level, section | ||||||
| - **Academic Engagement**: Raised hands, visited resources, announcements viewed | ||||||
| - **Participation**: Discussion contributions | ||||||
| - **Attendance**: Absence patterns | ||||||
| - **Parent Involvement**: Survey response and school satisfaction | ||||||
| - **Target Variable**: Student class (performance level) | ||||||
|
|
||||||
| ## 🚀 Quick Start | ||||||
|
|
||||||
| ### Prerequisites | ||||||
| - Python 3.x | ||||||
| - pandas | ||||||
| - numpy | ||||||
| - scikit-learn | ||||||
| - matplotlib | ||||||
| - seaborn | ||||||
|
|
||||||
| ### Installation | ||||||
|
|
||||||
| 1. Clone the repository: | ||||||
| ```bash | ||||||
| git clone <repository-url> | ||||||
| cd Student-performance-analysis-using-Big-data | ||||||
| ``` | ||||||
|
|
||||||
| 2. Install dependencies: | ||||||
| ```bash | ||||||
| pip install pandas numpy scikit-learn matplotlib seaborn | ||||||
| ``` | ||||||
|
|
||||||
| 3. Run the analysis: | ||||||
| ```bash | ||||||
| python student.py | ||||||
| ``` | ||||||
|
|
||||||
| The script will output: | ||||||
| - Dataset statistics and descriptive analysis | ||||||
| - Feature importance rankings | ||||||
| - Algorithm comparison results | ||||||
| - Cross-validation scores | ||||||
| - Final model performance metrics | ||||||
| - Visualization plots | ||||||
|
|
||||||
| ## 📈 Project Workflow | ||||||
|
|
||||||
| 1. **Data Loading & Exploration**: Load CSV and display basic statistics | ||||||
| 2. **Data Preprocessing**: | ||||||
| - Remove irrelevant features | ||||||
| - Encode categorical variables | ||||||
| - Scale features for algorithms | ||||||
| 3. **Dimensionality Reduction**: Identify and retain top 6 most important features | ||||||
|
||||||
| 3. **Dimensionality Reduction**: Identify and retain top 6 most important features | |
| 3. **Dimensionality Reduction**: Retain a predefined subset of 6 important features as specified in the script |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filename listed here is incorrect. The actual dataset file is named "xAPI-Edu-Data.csv" (as correctly mentioned in line 30 and used in student.py), not "datasets.csv". This inconsistency could confuse users about which file to use.
| ├── datasets.csv # Student performance dataset | |
| ├── xAPI-Edu-Data.csv # Student performance dataset |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature name capitalization is inconsistent with the actual code. In student.py (line 74), the feature is named 'VisITedResources' (with capital I and T), not 'VisitedResources'. Similarly, 'raisedhands' is all lowercase, not 'RaisedHands'. While this is describing the results, using the exact feature names from the code would improve accuracy.
| - **Top Predictors**: VisitedResources, RaisedHands, AnnouncementsView, StudentAbsenceDays, Discussion | |
| - **Top Predictors**: VisITedResources, raisedhands, AnnouncementsView, StudentAbsenceDays, Discussion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description is not entirely accurate. The code actually uses both classification and regression algorithms. A RandomForestClassifier is used for feature importance analysis (line 61 in student.py), while regression algorithms are used for the prediction task. The statement "machine learning project that analyzes and predicts student academic performance based on educational behavioral data" could be clarified to reflect that it's primarily a classification problem (predicting student performance class/level).