This project explores a dataset of Olympic team statistics and builds a linear regression model to predict the number of medals a country will win based on features like number of athletes and prior medals.
It also evaluates the model using mean absolute error and analyzes prediction errors by country.
The project uses a CSV file:
which contains the following columns:
team: Name of the team/countrycountry: Country codeyear: Olympic yearathletes: Number of athletes enteredage: Average age of athletesprev_medals: Number of medals won in previous Olympicsmedals: Number of medals won in the current year
-
Data Exploration
- Computes correlations between features and medals
- Visualizes relationships with regression plots (commented out β for Jupyter)
-
Data Cleaning
- Removes rows with missing values
- Splits data into training and test sets (pre-2012 for training, 2012 and later for testing)
-
Model Training
- Trains a Linear Regression model on:
athletesprev_medals
- Predicts number of medals for test set countries
- Trains a Linear Regression model on:
-
Postprocessing
- Ensures no negative predictions (clipped at 0)
- Rounds predictions to nearest whole number
-
Evaluation
- Computes Mean Absolute Error (MAE) on test data
- Analyzes errors by country
- Plots a histogram of error ratio across countries
- MAE of the model
- Per-country prediction errors
- Per-country error ratio (error divided by average medals won)
- Histogram of error ratio across countries
- Python 3.x
- pandas
- numpy
- scikit-learn
- seaborn
- matplotlib
You can install them with: pip install pandas numpy scikit-learn seaborn matplotlib
Created by dheetya