@@ -10,190 +10,99 @@ format:
1010 code-tools : true
1111 footer : " [← Back to Slides Index](../../slides/index.html)"
1212---
13- ## Introduction
14- In this chapter, we will show how we use the skrub ` TableReport ` to explore
15- tabular data. We will use the Adult Census dataset as our example table, and
16- perform some exploratory analysis to learn about the characteristics of the data.
13+ ## Why do we need to explore data?
14+ Before any kind of data processing or usage, we need to know what we are dealing with.
1715
18- First, let's import the necessary libraries and load the dataset.
16+ Useful information includes:
1917
18+ - The size of the dataset.
19+ - The data types and names of the columns.
20+ - The distribution of values in the columns.
21+ - Whether null values are present, in what measure and where.
22+ - Discrete/categorical features, and their cardinality.
23+ - Columns strongly correlated with each other.
24+
25+
26+ ## Loading the data
2027``` {python}
2128import pandas as pd
22- from sklearn.ensemble import RandomForestClassifier
2329from sklearn.datasets import fetch_openml
2430
2531# Load the Adult Census dataset
2632data = pd.read_csv("../data/adult_census/data.csv")
2733target = pd.read_csv("../data/adult_census/target.csv")
2834```
2935
30- Now that we have a dataframe we can work with, here is a list of features of the
31- data we would like to find out:
32-
33- - The size of the dataset.
34- - The data types and names of the columns.
35- - The distribution of values in the columns.
36- - Whether null values are present, in what measure and where.
37- - Discrete/categorical features, and their cardinality.
38- - Columns strongly correlated with each other.
39-
40- ## Exploring data with Pandas tools
36+ ## Exploring data with Pandas tools 1/3
4137Let's first explore the data using Pandas only.
4238
43- We can get an idea of the content of the table by printing the first few lines,
44- which gives an idea of the datatypes and the columns we are dealing with.
45-
4639``` {python}
4740data.head(5)
4841```
4942
43+ ## Exploring data with Pandas tools 2/3
5044If we want to have a simpler view of the datatypes in the dataframe, we must
5145use ` data.info() ` :
5246
5347``` {python}
5448data.info()
5549```
5650
57- With ` .info() ` we can find out the shape of the dataframe (the number of rows
58- and columns), the datatype and the number of non-null values for each column.
59-
51+ ## Exploring data with Pandas tools 3/3
6052We can also get a richer summary of the data with the ` .describe() ` method:
6153
6254``` {python}
6355data.describe(include="all")
6456```
6557
66- This gives us useful information about all the features in the dataset. Among
67- others, we can find the number of unique values in each column, various statistics
68- for the numerical columns and the number of null values.
69-
7058## Exploring data with the ` TableReport `
7159Now, let's create a TableReport to explore the dataset.
7260``` {python}
7361from skrub import TableReport
74- TableReport(data, verbose=0 )
62+ TableReport(data)
7563```
7664
77- ::: {.content-hidden when-format="revealjs"}
7865
79- ### Default view of the TableReport
80- The ` TableReport ` gives us a comprehensive overview of the dataset. The default
81- view shows all the columns in the dataset, and allows to select and copy the content
66+ ## Default view of the TableReport
67+ The ` TableReport ` shows all the columns in the dataset, and allows to select and copy the content
8268of the cells shown in the preview.
8369
84- The ` TableReport ` is intended to show a preview of the data, so it does not
85- contain all the rows in the dataset, rather it shows only the first and last
86- few rows by default. Similarly, it stores only the top 10 most frequent values
87- for each column, if column distributions are plotted.
88-
89- :::
70+ The ` TableReport ` shows a preview of the data, so it displays only the first and last
71+ few rows by default.
9072
9173### The "Stats" tab
9274
9375``` {python}
9476TableReport(data, open_tab="stats")
9577```
9678
97- ::: {.content-hidden when-format="revealjs"}
98-
99- The "Stats" tab provides a variety of descriptive statistics for each column in
100- the dataset.
101- This includes:
102-
103- - The column name
104- - The detected data type of the column
105- - Whether the column is sorted or not
106- - The number of null values in the column, as well as the percentage
107- - The number of unique values in the column
108-
109- For numerical columns, additional statistics are provided:
110-
111- - Mean
112- - Standard deviation
113- - Minimum and maximum values
114- - Median
115-
116- Stat columns can also be sorted, for example to quickly identify which columns
117- contain the most nulls, or have the largest cardinality (number of unique values).
118-
119- :::
120-
121- ::: {.callout}
122- ### Filters
79+ ## Filters
12380Pre-made column filters are also available, allowing to select columns by dtype
12481or other characteristics. Filters are shared across tabs.
125- :::
12682
127- ### The "Distributions" tab
83+ ## The "Distributions" tab
12884``` {python}
12985TableReport(data, open_tab="distributions")
13086```
13187
132- ::: {.content-hidden when-format="revealjs"}
133-
134- The "Distributions" tab provides visualizations of the distributions of values
135- in each column. This includes histograms for numerical columns and bar plots for
136- categorical columns.
137-
138- The "Distributions" tab helps with detecting potential issues in the data, such as:
139-
140- - Skewed distributions
141- - Outliers
142- - Unexpected value frequencies
143-
144- For example, in this dataset we can see that some columns are heavily
145- skewed, such as "workclass", "race", and "native-country": this is important
146- information to keep track of, because these columns may require special handling
147- during data preprocessing or modeling.
14888
149- Additionally, the "Distributions" tab allows to select columns manually, so that
150- they can be added to a script and selected for further analysis or modeling.
15189
152- ::: {.callout-caution}
153- #### Outlier detection
90+ ## Outlier detection
15491The ` TableReport ` detects outliers using a simple interquartile test, marking
15592as outliers all values that are beyond the IQR. This is a simple heuristic, and
15693should not be treated as perfect. If your problem requires reliable outlier
15794detection, you should not rely exclusively on what the ` TableReport ` shows.
158- :::
159-
16095
161- :::
162-
163- ### The "Associations" tab
96+ ## The "Associations" tab
16497
16598``` {python}
16699TableReport(data, open_tab="associations")
167100```
168101
169- ::: {.content-hidden when-format="revealjs"}
170-
171- The "Associations" tab provides insights into the relationships between different
172- columns in the dataset.
173- It shows [ Pearson's correlation] ( https://en.wikipedia.org/wiki/Pearson_correlation_coefficient )
174- coefficient for numerical columns, as well as
175- [ Cramér's V] ( https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V ) for all columns.
176-
177- While this is a somewhat rough measure of association, it can help identify potential
178- relationships worth exploring further during the analysis, and highlights
179- highly correlated columns: depending on the modeling technique used, these may need
180- to be handled specially to avoid issues with multicollinearity.
181-
182- In this example, we can see that "education-num" and "education" have perfect
183- correlation, which means that one of the two columns can be dropped without losing
184- information.
185-
186-
187- :::
188-
189102## Exploring the target variable
190103Besides dataframes, the ` TableReport ` handles series and mono- and bi-dimensional
191104numpy arrays.
192105
193- So, let's take a closer look at the target variable, which indicates whether an
194- individual's income exceeds $50K per year. We can create a separate ` TableReport `
195- for the target variable to explore its distribution:
196-
197106``` {python}
198107TableReport(target)
199108```
@@ -207,34 +116,6 @@ TableReport(data).write_html("report.html")
207116Then, the report can be opened using any internet browser, with no need to run
208117a Jupyter notebok or a python interactive console.
209118
210- ::: {.content-hidden when-format="revealjs"}
211-
212-
213- It is possible to configure various parameters using the skrub global config.
214- For example, it is possible to replace the default Pandas or Polars dataframe
215- display with the TableReport by using ` patch_display ` (and ` unpatch_display ` ):
216-
217- ``` {python}
218- from skrub import patch_display, unpatch_display
219-
220- # replace the default pandas repr
221- patch_display()
222- data
223- ```
224-
225- To disable, use ` unpatch_display ` :
226- ``` {python}
227- unpatch_display()
228- data
229- ```
230-
231-
232- More detail on the skrub configuration is reported in the
233- [ User Guide] ( https://skrub-data.org/dev/modules/configuration_and_utils/customizing_configuration.html ) .
234-
235-
236- :::
237-
238119## Working with big tables
239120
240121::: {.content-hidden when-format="revealjs"}
@@ -259,33 +140,11 @@ TableReport(
259140```
260141
261142
262- ## Conclusions
263-
264- ::: {.content-hidden when-format="revealjs"}
265-
266-
267- In this chapter we have learned how the ` TableReport ` can be used to speed up
268- data exploration, allowing us to find possible criticalities in the data.
143+ ## What we have seen in this chapter
269144
270- We covered:
271-
272- - Creating and configuring a ` TableReport ` for fast, interactive data exploration
273- - Exploring column statistics, value distributions, and associations visually
274- - Detecting nulls, outliers, and highly correlated columns at a glance
275- - Filtering columns by type or characteristics using built-in filters
276- - Saving and sharing interactive reports as standalone HTML files
277- - Adjusting ` TableReport ` settings for large datasets to optimize performance
278-
279- In the
280- next chapter, we will find out how to address some of the possible problems using
281- the skrub ` Cleaner ` .
282-
283- :::
284145
285- ::: {.content-visible when-format="revealjs"}
286- - The ` TableReport ` is a powerful EDA tool
287- - It shows a rich preview of the content of the dataframe
146+ - The ` TableReport ` shows a rich preview of the content of the dataframe
288147- It provides precomputed statistics for all the columns
289148- It prepares distribution plots for each column
290149- It measures the association between columns
291- :::
150+ - It can be stored as a HTML file and shared without needing a running kernel
0 commit comments