Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 35 additions & 33 deletions python_scripts/01_tabular_data_exploration.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,45 +162,42 @@

# %% [markdown]
# Let's look at the distribution of individual features, to get some insights
# about the data. We can start by plotting histograms, note that this only works
# for features containing numerical values:
# about the data. We will use `skrub`'s `TableReport` class to generate an
# overview of the dataset.

# %%
_ = adult_census.hist(figsize=(20, 14))

from skrub import TableReport

report = TableReport(adult_census)
report
# _ = adult_census.hist(figsize=(20, 14))

# %% [markdown]
# ```{tip}
# In the previous cell, we used the following pattern: `_ = func()`. We do this
# to avoid showing the output of `func()` which in this case is not that
# useful. We actually assign the output of `func()` into the variable `_`
# (called underscore). By convention, in Python the underscore variable is used
# as a "garbage" variable to store results that we are not interested in.
# ```
# The report shows many useful statistics about each variable. On the first tab
# "Table", we have a representation of the dataframe. Clicking on each column
# name shows a statistical summary of the variable. For a better view of the
# distribution of each variable, we can click on the "Distributions" tab.
#
# We can already make a few comments about some of the variables:
# Numerical features's distributions are displayed as histograms, while
# categorical values are shown as bar plots. We can already make a few comments
# about some of the variables:
#
# * `"age"`: there are not that many points for `age > 70`. The dataset
# description does indicate that retired people have been filtered out
# (`hours-per-week > 0`);
# * `"education-num"`: peak at 10 and 13, hard to tell what it corresponds to
# without looking much further. We'll do that later in this notebook;
# * most values of `"capital-gain"` and `"capital-loss"` are close to zero;
# * `"hours-per-week"` peaks at 40, this was very likely the standard number of
# working hours at the time of the data collection;
# * most values of `"capital-gain"` and `"capital-loss"` are close to zero.

# %% [markdown]
# For categorical variables, we can look at the distribution of values:

# %%
adult_census["sex"].value_counts()

# %% [markdown]
# Note that the data collection process resulted in an important imbalance
# between the number of male/female samples.
# * `"sex"`: the data collection process resulted in an important imbalance
# between the number of male/female samples.
#
# Be aware that training a model with such data imbalance can cause
# disproportioned prediction errors for the under-represented groups. This is a
# typical cause of
# About the last observation, be aware that training a model with such data
# imbalance can cause disproportioned prediction errors for the
# under-represented sensitive groups (based on gender or ethnicity for
# instance). This is a typical cause of
# [fairness](https://docs.microsoft.com/en-us/azure/machine-learning/concept-fairness-ml#what-is-machine-learning-fairness)
# problems if used naively when deploying a machine learning based system in a
# real life setting.
Expand All @@ -210,35 +207,40 @@
# related to the deployment of automated decision making systems that rely on
# machine learning components.
#
# Studying why the data collection process of this dataset lead to such an
# Studying why the data collection process of this dataset led to such an
# unexpected gender imbalance is beyond the scope of this MOOC but we should
# keep in mind that this dataset is not representative of the US population
# before drawing any conclusions based on its statistics or the predictions of
# models trained on it.

# %%
adult_census["education"].value_counts()

# %% [markdown]
# As noted above, `"education-num"` distribution has two clear peaks around 10
# and 13. It would be reasonable to expect that `"education-num"` is the number
# of years of education.
#
# Let's look at the relationship between `"education"` and `"education-num"`.
# Let's look at the relationship between `"education"` and `"education-num"` by
# going to the "Associations" tab of the report. This tab shows the statistical
# relationship between each pair of variables in the dataset, using [Cramér's
# V](https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V) and [Pearson's
# Correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).
# We can see that `"education"` and `"education-num"` are very strongly related.

# %%
pd.crosstab(
index=adult_census["education"], columns=adult_census["education-num"]
)

# %% [markdown]
# For every entry in `\"education\"`, there is only one single corresponding
# value in `\"education-num\"`. This shows that `"education"` and
# For every entry in `"education"`, there is only one single corresponding
# value in `"education-num"`. This shows that `"education"` and
# `"education-num"` give you the same information. For example,
# `"education-num"=2` is equivalent to `"education"="1st-4th"`. In practice that
# means we can remove `"education-num"` without losing information. Note that
# having redundant (or highly correlated) columns can be a problem for machine
# learning algorithms.

#
# All of this data expliration can be done manually with `pandas`. See [this
# notebook](lol) for more information.
# %% [markdown]
# ```{note}
# In the upcoming notebooks, we will only keep the `"education"` variable,
Expand Down
Loading