Skip to content

Commit a05a0d6

Browse files
committed
_
1 parent 1b4108e commit a05a0d6

7 files changed

Lines changed: 206 additions & 137 deletions

File tree

book/chapters/00_intro.qmd

Lines changed: 52 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Chapter 1: Introduction"
2+
title: "Data Preparation with skrub"
33
format:
44
html:
55
toc: true
@@ -18,7 +18,7 @@ Let's consider a world where skrub does not exist, and all we can do is use
1818
pandas and scikit-learn to prepare data for a machine learning model.
1919

2020

21-
## Load and explore the data
21+
### Load and explore the data
2222
```{python}
2323
import pandas as pd
2424
import numpy as np
@@ -28,23 +28,23 @@ y = pd.read_csv("../data/employee_salaries/target.csv")["current_annual_salary"]
2828
X.head(5)
2929
```
3030

31-
## Explore the target
31+
### Explore the target
3232
Let's take a look at the target:
3333
```{python}
3434
y.head(5)
3535
```
3636

3737
This is a **regression** task: we want to predict the value of `current_annual_salary`.
3838

39-
## Strategizing
39+
### Strategizing
4040
We can begin by exploring the dataframe with `.describe`, and then think of a
4141
plan for pre-processing our data.
4242

4343
```{python}
4444
X.describe(include="all")
4545
```
4646

47-
## Our plan
47+
### Our plan
4848
We need to:
4949

5050
- Impute some missing values in the `gender` column.
@@ -53,7 +53,8 @@ We need to:
5353
- Scale numerical features.
5454
- Evaluate the performance of the model.
5555

56-
## Step 1: Convert date features to numerical {.smaller}
56+
## Feature engineering
57+
### Step 1: Convert date features to numerical {.smaller}
5758

5859
We extract numerical features from the `date_first_hired` column.
5960

@@ -75,7 +76,7 @@ print("Features after date transformation:")
7576
print("\nShape:", X_processed.shape)
7677
```
7778

78-
## Step 2: Encode categorical features {.smaller}
79+
### Step 2: Encode categorical features {.smaller}
7980

8081
We encode the categorical features using one-hot encoding.
8182

@@ -89,7 +90,7 @@ X_encoded = pd.get_dummies(X_processed, columns=categorical_cols)
8990
print("\nShape after encoding:", X_encoded.shape)
9091
```
9192

92-
## Step 3: Impute missing values {.smaller}
93+
### Step 3: Impute missing values {.smaller}
9394

9495
We impute the missing values in the `gender` column
9596

@@ -104,7 +105,7 @@ X_encoded_imputed = pd.DataFrame(
104105
)
105106
```
106107

107-
## Step 4: Scale numerical features {.smaller}
108+
### Step 4: Scale numerical features {.smaller}
108109

109110
Scale numerical features for the Ridge regression model.
110111

@@ -119,18 +120,18 @@ X_scaled = scaler.fit_transform(X_encoded_imputed)
119120
X_scaled = pd.DataFrame(X_scaled, columns=X_encoded_imputed.columns)
120121
```
121122

122-
## Step 5: Train Ridge model with cross-validation {.smaller}
123+
### Step 5: Train Ridge model with cross-validation {.smaller}
123124

124125
Train a `Ridge` regression model and evaluate with cross-validation.
125126

126127
```{python}
127128
#| warning: false
128-
from sklearn.linear_model import Ridge
129+
from sklearn.linear_model import RidgeCV
129130
from sklearn.model_selection import cross_val_score, cross_validate
130131
import numpy as np
131132
132133
# Initialize Ridge model
133-
ridge = Ridge(alpha=1.0)
134+
ridge = RidgeCV()
134135
135136
# Perform cross-validation (5-fold)
136137
cv_results = cross_validate(ridge, X_scaled, y, cv=5, scoring=["r2", "neg_mean_squared_error"])
@@ -146,13 +147,26 @@ print(
146147
print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")
147148
```
148149

149-
## "Just ask an agent to write the code" {.smaller}
150-
- Operations in the wrong order.
151-
- Trying to impute categorical features without converting them to numeric values.
152-
- The datetime feature was treated like a categorical feature.
153-
- Cells could not be executed in order without proper debugging and re-prompting.
154-
- `pd.get_dummies` was executed on the full dataframe, rather than only on the
155-
training split, leading to data leakage.
150+
## Looking back
151+
### "Just ask an agent to write the code" {.smaller}
152+
That's what I did: I asked an AI agent to turn the "strategizing" section into
153+
code, and the result had all sorts of issues. In fact, the strategy I suggested
154+
is not laid out in the best order, but the agent still followed it, leading to
155+
more problems.
156+
157+
Imputation is performed before converting categorical features to numeric values,
158+
so it fails in various ways.
159+
160+
A wrong order of operations also means that datetimes are not parsed as datetimes
161+
before applying `OneHotEncoder`, therefore they are considered categorical and
162+
encoded suboptimally.
163+
164+
Finally, `pd.get_dummies` was executed on the full dataframe, rather than only on
165+
the training split: this causes **data leakage** by creating an encoding for
166+
categorical features that would appear only in the test set.
167+
168+
The order of our operations is just as important as the operations themselves,
169+
and relying blindly on agents can have bad consequences.
156170

157171
## Waking up from a nightmare {.smaller}
158172
Thankfully, we can `import skrub`:
@@ -175,15 +189,27 @@ print(f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test
175189
print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")
176190
```
177191

178-
## Roadmap for the course {.smaller}
192+
### What is happening?
193+
The `tabular_pipeline` is implementing all the steps we have just covered, except
194+
in a more robust and battle-tested fashion. Before encoding, the input table is
195+
sanitized so that features have the proper dtype, null values are marked as such,
196+
and then the appropriate encoder is applied to each column.
179197

180-
1. Data exploration with skrub's `TableReport`
181-
2. Data cleaning and sanitization with the `Cleaner`
182-
3. Intermission: simplifying column operations with skrub
183-
4. Feature engineering with the skrub encoders
184-
5. Putting everything together: `TableVectorizer` and `tabular_pipeline`
198+
The functioning of the `tabular_pipeline` is explored in more detail in a later
199+
chapter.
185200

186201
## What we saw in this chapter
187202
- We built a predictive pipeline using traditional tools
188203
- We saw some possible shortcomings
189-
- We tested skrub's `tabular_pipeline`
204+
- We tested skrub's `tabular_pipeline`
205+
206+
## Roadmap for the course {.smaller}
207+
208+
1. Data exploration with skrub's `TableReport`
209+
2. Data cleaning and sanitization with the `Cleaner`
210+
3. Automatic feature engineering with the `TableVectorizer`
211+
4. A robust baseline for machine learning tasks with the `tabular_pipeline`
212+
5. Columnwise operations with `ApplyToCols`
213+
6. Advanced column selectors and how to use them
214+
7. Feature engineering with the skrub encoders
215+
8. Building dynamic pipelines with the skrub Data Ops (short intro)

0 commit comments

Comments
 (0)