This file contains all the Machine Learning problem statements sorted based on its difficulty level. For additional insight more information about the problem is shared so reader gets more idea about the difficulty level of the problem.

Beginner

Titanic: Machine Learning from Disaster

Start here! Predict survival on the Titanic and get familiar with ML basics
Kaggle : https://www.kaggle.com/c/titanic
Problem Statement: Titanic is one of the most infamous disaster in recent human history which resulted in the death of 1502 out of 2224 passengers. Analysis shows that while some amount of luck was involved, some passengers were more likely to survive than others. Train a machine learning model to predict what sort of people were likely to survive.

Playground problem for beginner in Data Science
Beginner friendly problem
Binary classification
Evaluation metrics is 'Accuracy'
Easy to make intuition based on features
Comparitively smaller dataset
Missing values present in a few features
'Cabin' feature poses a challenge with 77% missing values

House Price Prediction - Regression problem

Start here! predict the final price of each home and get familiar with ML basics
This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition.
Kaggle : https://www.kaggle.com/c/house-prices-advanced-regression-techniques
Problem Statement: Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Playground problem for beginner in Data Science

Intermediate

Restaurant Revenue Prediction - Regression problem

Predict annual restaurant sales based on objective measurements
Kaggle : https://www.kaggle.com/c/restaurant-revenue-prediction
Problem Statement: With over 1,200 quick service restaurants across the globe, TFI is the company behind some of the world's most well-known brands: Burger King, Sbarro, Popeyes, Usta Donerci, and Arby’s. They employ over 20,000 people in Europe and Asia and make significant daily investments in developing new restaurant sites.

Right now, deciding when and where to open new restaurants is largely a subjective process based on the personal judgement and experience of development teams. This subjective data is difficult to accurately extrapolate across geographies and cultures.

New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred.

Finding a mathematical model to increase the effectiveness of investments in new restaurant sites would allow TFI to invest more in other important business areas, like sustainability, innovation, and training for new employees. Using demographic, real estate, and commercial data, this competition challenges you to predict the annual restaurant sales of 100,000 regional locations.

Huge Dataset
Real time Problem -Gives domain knowledge as well

U.S. News and World Report’s College Data - Clustering Problem

Statistics for a large number of US Colleges from the 1995 issue of US News and World Report
Kaggle : https://www.kaggle.com/flyingwombat/us-news-and-world-reports-college-data
Problem Statement: You ahve to selecta college for admission thus need to segreate/rank college overall based on parameters provided. Cluster colleges in different segments in order to select best college for your studies.

Unsupervised Learning
Checking for variable importance
Converting Unsupervised to Supervised Learning

Chronic Kidney Disease - Classification Problem

Data has 25 features which may predict a patient with chronic kidney disease
Kaggle : https://www.kaggle.com/colearninglounge/chronic-kidney-disease
Problem Statement: The data was taken over a 2-month period in India with 25 features ( eg, red blood cell count, white blood cell count, etc). The target is the 'classification', which is either 'ckd' or 'notckd' - ckd=chronic kidney disease. Use machine learning techniques to predict if a patient is suffering from a chronic kidney disease or not.

Binary Classification
Medical Terminology used
High Multicollinearity in data

Credit Card Fraud Detection

Anonymized credit card transactions labeled as fraudulent or genuine
Kaggle : https://www.kaggle.com/c/1056lab-fraud-detection-in-credit-card/overview/evaluation
Problem Statement: It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. There is a lack of public available datasets on financial services and especially in the emerging mobile money transactions domain. Part of the problem is the intrinsically private nature of financial transactions, which leads to no publicly available datasets.

PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

Highly imbalanced classes
Anonymous features
PCA transformed features
Binary classification
Evaluation metrics is 'AUC'
Challenging Exploratory Data Analysis
Challenge to make intuition based on features
Time factor involved

Google Analytics Customer Revenue Prediction - Regression problem

predict revenue per customer
Kaggle : https://www.kaggle.com/c/ga-customer-revenue-prediction
Problem Statement: In this competition, you’re challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.

Huge Dataset

Advance

Santander Customer Satisfaction

Which customers are happy customers?
Kaggle : https://www.kaggle.com/c/santander-customer-satisfaction
Problem Statement: From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. What's more, unhappy customers rarely voice their dissatisfaction before leaving.

Santander Bank is asking Kagglers to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late.

In this competition, you'll work with hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience.

Too many features
Anonymous features
Binary classification
Evaluation metrics is 'AUC'
Challenging Exploratory Data Analysis
Challenge to make intuition based on features

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Beginner

Titanic: Machine Learning from Disaster

House Price Prediction - Regression problem

Intermediate

Restaurant Revenue Prediction - Regression problem

U.S. News and World Report’s College Data - Clustering Problem

Chronic Kidney Disease - Classification Problem

Credit Card Fraud Detection

Google Analytics Customer Revenue Prediction - Regression problem

Advance

Santander Customer Satisfaction

Uh oh!

FilesExpand file tree

ML_problem_statements.md

Latest commit

History

ML_problem_statements.md

File metadata and controls

Beginner

Titanic: Machine Learning from Disaster

House Price Prediction - Regression problem

Intermediate

Restaurant Revenue Prediction - Regression problem

U.S. News and World Report’s College Data - Clustering Problem

Chronic Kidney Disease - Classification Problem

Credit Card Fraud Detection

Google Analytics Customer Revenue Prediction - Regression problem

Advance

Santander Customer Satisfaction