Skip to content

Commit aed069b

Browse files
committed
reworking slides
1 parent 731d1ac commit aed069b

3 files changed

Lines changed: 94 additions & 258 deletions

File tree

slides/chapters/00_intro.qmd

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Chapter 1: Introduction"
2+
title: "Course introduction"
33
format:
44
html:
55
toc: true
@@ -12,6 +12,17 @@ execute:
1212
echo: true
1313
---
1414

15+
## Structure of the tutorial
16+
- Two ways to follow the course: a book and a set of slide decks
17+
- The material is split in sections that cintain two or more chapters
18+
- Each section ends with a Quiz section, and a coding exercise
19+
20+
## Logistics
21+
- If I go too fast, stop me!
22+
- Q&A time at the end of the chapter
23+
- You can ask in chat and I will answer later
24+
- Break after section 3
25+
1526
## A world without skrub {.smaller}
1627

1728
Let's consider a world where skrub does not exist, and all we can do is use

slides/chapters/01_exploring_data.qmd

Lines changed: 27 additions & 168 deletions
Original file line numberDiff line numberDiff line change
@@ -10,190 +10,99 @@ format:
1010
code-tools: true
1111
footer: "[← Back to Slides Index](../../slides/index.html)"
1212
---
13-
## Introduction
14-
In this chapter, we will show how we use the skrub `TableReport` to explore
15-
tabular data. We will use the Adult Census dataset as our example table, and
16-
perform some exploratory analysis to learn about the characteristics of the data.
13+
## Why do we need to explore data?
14+
Before any kind of data processing or usage, we need to know what we are dealing with.
1715

18-
First, let's import the necessary libraries and load the dataset.
16+
Useful information includes:
1917

18+
- The size of the dataset.
19+
- The data types and names of the columns.
20+
- The distribution of values in the columns.
21+
- Whether null values are present, in what measure and where.
22+
- Discrete/categorical features, and their cardinality.
23+
- Columns strongly correlated with each other.
24+
25+
26+
## Loading the data
2027
```{python}
2128
import pandas as pd
22-
from sklearn.ensemble import RandomForestClassifier
2329
from sklearn.datasets import fetch_openml
2430
2531
# Load the Adult Census dataset
2632
data = pd.read_csv("../data/adult_census/data.csv")
2733
target = pd.read_csv("../data/adult_census/target.csv")
2834
```
2935

30-
Now that we have a dataframe we can work with, here is a list of features of the
31-
data we would like to find out:
32-
33-
- The size of the dataset.
34-
- The data types and names of the columns.
35-
- The distribution of values in the columns.
36-
- Whether null values are present, in what measure and where.
37-
- Discrete/categorical features, and their cardinality.
38-
- Columns strongly correlated with each other.
39-
40-
## Exploring data with Pandas tools
36+
## Exploring data with Pandas tools 1/3
4137
Let's first explore the data using Pandas only.
4238

43-
We can get an idea of the content of the table by printing the first few lines,
44-
which gives an idea of the datatypes and the columns we are dealing with.
45-
4639
```{python}
4740
data.head(5)
4841
```
4942

43+
## Exploring data with Pandas tools 2/3
5044
If we want to have a simpler view of the datatypes in the dataframe, we must
5145
use `data.info()`:
5246

5347
```{python}
5448
data.info()
5549
```
5650

57-
With `.info()` we can find out the shape of the dataframe (the number of rows
58-
and columns), the datatype and the number of non-null values for each column.
59-
51+
## Exploring data with Pandas tools 3/3
6052
We can also get a richer summary of the data with the `.describe()` method:
6153

6254
```{python}
6355
data.describe(include="all")
6456
```
6557

66-
This gives us useful information about all the features in the dataset. Among
67-
others, we can find the number of unique values in each column, various statistics
68-
for the numerical columns and the number of null values.
69-
7058
## Exploring data with the `TableReport`
7159
Now, let's create a TableReport to explore the dataset.
7260
```{python}
7361
from skrub import TableReport
74-
TableReport(data, verbose=0)
62+
TableReport(data)
7563
```
7664

77-
::: {.content-hidden when-format="revealjs"}
7865

79-
### Default view of the TableReport
80-
The `TableReport` gives us a comprehensive overview of the dataset. The default
81-
view shows all the columns in the dataset, and allows to select and copy the content
66+
## Default view of the TableReport
67+
The `TableReport` shows all the columns in the dataset, and allows to select and copy the content
8268
of the cells shown in the preview.
8369

84-
The `TableReport` is intended to show a preview of the data, so it does not
85-
contain all the rows in the dataset, rather it shows only the first and last
86-
few rows by default. Similarly, it stores only the top 10 most frequent values
87-
for each column, if column distributions are plotted.
88-
89-
:::
70+
The `TableReport` shows a preview of the data, so it displays only the first and last
71+
few rows by default.
9072

9173
### The "Stats" tab
9274

9375
```{python}
9476
TableReport(data, open_tab="stats")
9577
```
9678

97-
::: {.content-hidden when-format="revealjs"}
98-
99-
The "Stats" tab provides a variety of descriptive statistics for each column in
100-
the dataset.
101-
This includes:
102-
103-
- The column name
104-
- The detected data type of the column
105-
- Whether the column is sorted or not
106-
- The number of null values in the column, as well as the percentage
107-
- The number of unique values in the column
108-
109-
For numerical columns, additional statistics are provided:
110-
111-
- Mean
112-
- Standard deviation
113-
- Minimum and maximum values
114-
- Median
115-
116-
Stat columns can also be sorted, for example to quickly identify which columns
117-
contain the most nulls, or have the largest cardinality (number of unique values).
118-
119-
:::
120-
121-
::: {.callout}
122-
### Filters
79+
## Filters
12380
Pre-made column filters are also available, allowing to select columns by dtype
12481
or other characteristics. Filters are shared across tabs.
125-
:::
12682

127-
### The "Distributions" tab
83+
## The "Distributions" tab
12884
```{python}
12985
TableReport(data, open_tab="distributions")
13086
```
13187

132-
::: {.content-hidden when-format="revealjs"}
133-
134-
The "Distributions" tab provides visualizations of the distributions of values
135-
in each column. This includes histograms for numerical columns and bar plots for
136-
categorical columns.
137-
138-
The "Distributions" tab helps with detecting potential issues in the data, such as:
139-
140-
- Skewed distributions
141-
- Outliers
142-
- Unexpected value frequencies
143-
144-
For example, in this dataset we can see that some columns are heavily
145-
skewed, such as "workclass", "race", and "native-country": this is important
146-
information to keep track of, because these columns may require special handling
147-
during data preprocessing or modeling.
14888

149-
Additionally, the "Distributions" tab allows to select columns manually, so that
150-
they can be added to a script and selected for further analysis or modeling.
15189

152-
::: {.callout-caution}
153-
#### Outlier detection
90+
## Outlier detection
15491
The `TableReport` detects outliers using a simple interquartile test, marking
15592
as outliers all values that are beyond the IQR. This is a simple heuristic, and
15693
should not be treated as perfect. If your problem requires reliable outlier
15794
detection, you should not rely exclusively on what the `TableReport` shows.
158-
:::
159-
16095

161-
:::
162-
163-
### The "Associations" tab
96+
## The "Associations" tab
16497

16598
```{python}
16699
TableReport(data, open_tab="associations")
167100
```
168101

169-
::: {.content-hidden when-format="revealjs"}
170-
171-
The "Associations" tab provides insights into the relationships between different
172-
columns in the dataset.
173-
It shows [Pearson's correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)
174-
coefficient for numerical columns, as well as
175-
[Cramér's V](https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V) for all columns.
176-
177-
While this is a somewhat rough measure of association, it can help identify potential
178-
relationships worth exploring further during the analysis, and highlights
179-
highly correlated columns: depending on the modeling technique used, these may need
180-
to be handled specially to avoid issues with multicollinearity.
181-
182-
In this example, we can see that "education-num" and "education" have perfect
183-
correlation, which means that one of the two columns can be dropped without losing
184-
information.
185-
186-
187-
:::
188-
189102
## Exploring the target variable
190103
Besides dataframes, the `TableReport` handles series and mono- and bi-dimensional
191104
numpy arrays.
192105

193-
So, let's take a closer look at the target variable, which indicates whether an
194-
individual's income exceeds $50K per year. We can create a separate `TableReport`
195-
for the target variable to explore its distribution:
196-
197106
```{python}
198107
TableReport(target)
199108
```
@@ -207,34 +116,6 @@ TableReport(data).write_html("report.html")
207116
Then, the report can be opened using any internet browser, with no need to run
208117
a Jupyter notebok or a python interactive console.
209118

210-
::: {.content-hidden when-format="revealjs"}
211-
212-
213-
It is possible to configure various parameters using the skrub global config.
214-
For example, it is possible to replace the default Pandas or Polars dataframe
215-
display with the TableReport by using `patch_display` (and `unpatch_display`):
216-
217-
```{python}
218-
from skrub import patch_display, unpatch_display
219-
220-
# replace the default pandas repr
221-
patch_display()
222-
data
223-
```
224-
225-
To disable, use `unpatch_display`:
226-
```{python}
227-
unpatch_display()
228-
data
229-
```
230-
231-
232-
More detail on the skrub configuration is reported in the
233-
[User Guide](https://skrub-data.org/dev/modules/configuration_and_utils/customizing_configuration.html).
234-
235-
236-
:::
237-
238119
## Working with big tables
239120

240121
::: {.content-hidden when-format="revealjs"}
@@ -259,33 +140,11 @@ TableReport(
259140
```
260141

261142

262-
## Conclusions
263-
264-
::: {.content-hidden when-format="revealjs"}
265-
266-
267-
In this chapter we have learned how the `TableReport` can be used to speed up
268-
data exploration, allowing us to find possible criticalities in the data.
143+
## What we have seen in this chapter
269144

270-
We covered:
271-
272-
- Creating and configuring a `TableReport` for fast, interactive data exploration
273-
- Exploring column statistics, value distributions, and associations visually
274-
- Detecting nulls, outliers, and highly correlated columns at a glance
275-
- Filtering columns by type or characteristics using built-in filters
276-
- Saving and sharing interactive reports as standalone HTML files
277-
- Adjusting `TableReport` settings for large datasets to optimize performance
278-
279-
In the
280-
next chapter, we will find out how to address some of the possible problems using
281-
the skrub `Cleaner`.
282-
283-
:::
284145

285-
::: {.content-visible when-format="revealjs"}
286-
- The `TableReport` is a powerful EDA tool
287-
- It shows a rich preview of the content of the dataframe
146+
- The `TableReport` shows a rich preview of the content of the dataframe
288147
- It provides precomputed statistics for all the columns
289148
- It prepares distribution plots for each column
290149
- It measures the association between columns
291-
:::
150+
- It can be stored as a HTML file and shared without needing a running kernel

0 commit comments

Comments
 (0)