Data-Gathering-And-Preprocessing/cleaning.md at main · iAmKankan/Data-Gathering-And-Preprocessing

Missing Value Handling

Ignoring missing values in a data set is a huge mistake as most algorithms simply don’t accept them.
Some companies deal with this problem by imputing the missing values based on other observations or dropping the observations with missing values altogether.
But these strategies lead to loss of information (note that “no value” also tells us something.
If companies miss categorical data, they can label them as “Missing.” Missing numeric data should be flagged and filled with 0 to allow the algorithm estimate the optimal constant for such a situation.

Why is data missing?

The source of missing data can be very different and here are just a few examples:
- A value is missing because it was forgotten or lost or not stored properly
- For a certain observation, the value of the variable does not exist
- The value can't be known or identified
One of the most important questions you can ask yourself to help figure this out is this:

Is this value missing because it wasn't recorded or because it doesn't exist?

Measures for missing data

If a value is missing becuase it doesn't exist (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be.
These values you probably do want to keep as NaN.
On the other hand, if a value is missing because it wasn't recorded, then you can try to guess what it might have been based on the other values in that column and row.
This is called imputation.

In statistics, imputation is the process of replacing missing data with substituted values.

Drop missing values

If you're in a hurry or don't have a reason to figure out why your values are missing, one option you have is to just remove any rows or columns that contain missing values.

Note: Generally this approch is not recommend for important projects! It's usually worth it to take the time to go through your data and really look at all the columns with missing values one-by-one to really get to know your dataset.

To drop rows with missing values, Pandas does have a handy function, dropna() to help you do this.

Filling in missing values automatically

We can use the Panda's fillna() function to fill in missing values in a dataframe for us.
One option we have is to specify what we want the NaN values to be replaced with.
Here, I would like to replace all the NaN values with 0.

df.fillna({'NameColumn':8,'AddressColumn':0})

df[['col1', 'col2']].fillna(value=0, inplace=True)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Value Handling

Why is data missing?

Measures for missing data

Drop missing values

Filling in missing values automatically

FilesExpand file tree

cleaning.md

Latest commit

History

cleaning.md

File metadata and controls

Missing Value Handling

Why is data missing?

Measures for missing data

Drop missing values

Filling in missing values automatically