11---
2- title : " Chapter 1: Introduction "
2+ title : " Data Preparation with skrub "
33format :
44 html :
55 toc : true
@@ -18,7 +18,7 @@ Let's consider a world where skrub does not exist, and all we can do is use
1818pandas and scikit-learn to prepare data for a machine learning model.
1919
2020
21- ## Load and explore the data
21+ ### Load and explore the data
2222``` {python}
2323import pandas as pd
2424import numpy as np
@@ -28,23 +28,23 @@ y = pd.read_csv("../data/employee_salaries/target.csv")["current_annual_salary"]
2828X.head(5)
2929```
3030
31- ## Explore the target
31+ ### Explore the target
3232Let's take a look at the target:
3333``` {python}
3434y.head(5)
3535```
3636
3737This is a ** regression** task: we want to predict the value of ` current_annual_salary ` .
3838
39- ## Strategizing
39+ ### Strategizing
4040We can begin by exploring the dataframe with ` .describe ` , and then think of a
4141plan for pre-processing our data.
4242
4343``` {python}
4444X.describe(include="all")
4545```
4646
47- ## Our plan
47+ ### Our plan
4848We need to:
4949
5050- Impute some missing values in the ` gender ` column.
@@ -53,7 +53,8 @@ We need to:
5353- Scale numerical features.
5454- Evaluate the performance of the model.
5555
56- ## Step 1: Convert date features to numerical {.smaller}
56+ ## Feature engineering
57+ ### Step 1: Convert date features to numerical {.smaller}
5758
5859We extract numerical features from the ` date_first_hired ` column.
5960
@@ -75,7 +76,7 @@ print("Features after date transformation:")
7576print("\nShape:", X_processed.shape)
7677```
7778
78- ## Step 2: Encode categorical features {.smaller}
79+ ### Step 2: Encode categorical features {.smaller}
7980
8081We encode the categorical features using one-hot encoding.
8182
@@ -89,7 +90,7 @@ X_encoded = pd.get_dummies(X_processed, columns=categorical_cols)
8990print("\nShape after encoding:", X_encoded.shape)
9091```
9192
92- ## Step 3: Impute missing values {.smaller}
93+ ### Step 3: Impute missing values {.smaller}
9394
9495We impute the missing values in the ` gender ` column
9596
@@ -104,7 +105,7 @@ X_encoded_imputed = pd.DataFrame(
104105)
105106```
106107
107- ## Step 4: Scale numerical features {.smaller}
108+ ### Step 4: Scale numerical features {.smaller}
108109
109110Scale numerical features for the Ridge regression model.
110111
@@ -119,18 +120,18 @@ X_scaled = scaler.fit_transform(X_encoded_imputed)
119120X_scaled = pd.DataFrame(X_scaled, columns=X_encoded_imputed.columns)
120121```
121122
122- ## Step 5: Train Ridge model with cross-validation {.smaller}
123+ ### Step 5: Train Ridge model with cross-validation {.smaller}
123124
124125Train a ` Ridge ` regression model and evaluate with cross-validation.
125126
126127``` {python}
127128#| warning: false
128- from sklearn.linear_model import Ridge
129+ from sklearn.linear_model import RidgeCV
129130from sklearn.model_selection import cross_val_score, cross_validate
130131import numpy as np
131132
132133# Initialize Ridge model
133- ridge = Ridge(alpha=1.0 )
134+ ridge = RidgeCV( )
134135
135136# Perform cross-validation (5-fold)
136137cv_results = cross_validate(ridge, X_scaled, y, cv=5, scoring=["r2", "neg_mean_squared_error"])
@@ -146,13 +147,26 @@ print(
146147print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")
147148```
148149
149- ## "Just ask an agent to write the code" {.smaller}
150- - Operations in the wrong order.
151- - Trying to impute categorical features without converting them to numeric values.
152- - The datetime feature was treated like a categorical feature.
153- - Cells could not be executed in order without proper debugging and re-prompting.
154- - ` pd.get_dummies ` was executed on the full dataframe, rather than only on the
155- training split, leading to data leakage.
150+ ## Looking back
151+ ### "Just ask an agent to write the code" {.smaller}
152+ That's what I did: I asked an AI agent to turn the "strategizing" section into
153+ code, and the result had all sorts of issues. In fact, the strategy I suggested
154+ is not laid out in the best order, but the agent still followed it, leading to
155+ more problems.
156+
157+ Imputation is performed before converting categorical features to numeric values,
158+ so it fails in various ways.
159+
160+ A wrong order of operations also means that datetimes are not parsed as datetimes
161+ before applying ` OneHotEncoder ` , therefore they are considered categorical and
162+ encoded suboptimally.
163+
164+ Finally, ` pd.get_dummies ` was executed on the full dataframe, rather than only on
165+ the training split: this causes ** data leakage** by creating an encoding for
166+ categorical features that would appear only in the test set.
167+
168+ The order of our operations is just as important as the operations themselves,
169+ and relying blindly on agents can have bad consequences.
156170
157171## Waking up from a nightmare {.smaller}
158172Thankfully, we can ` import skrub ` :
@@ -175,15 +189,27 @@ print(f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test
175189print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")
176190```
177191
178- ## Roadmap for the course {.smaller}
192+ ### What is happening?
193+ The ` tabular_pipeline ` is implementing all the steps we have just covered, except
194+ in a more robust and battle-tested fashion. Before encoding, the input table is
195+ sanitized so that features have the proper dtype, null values are marked as such,
196+ and then the appropriate encoder is applied to each column.
179197
180- 1 . Data exploration with skrub's ` TableReport `
181- 2 . Data cleaning and sanitization with the ` Cleaner `
182- 3 . Intermission: simplifying column operations with skrub
183- 4 . Feature engineering with the skrub encoders
184- 5 . Putting everything together: ` TableVectorizer ` and ` tabular_pipeline `
198+ The functioning of the ` tabular_pipeline ` is explored in more detail in a later
199+ chapter.
185200
186201## What we saw in this chapter
187202- We built a predictive pipeline using traditional tools
188203- We saw some possible shortcomings
189- - We tested skrub's ` tabular_pipeline `
204+ - We tested skrub's ` tabular_pipeline `
205+
206+ ## Roadmap for the course {.smaller}
207+
208+ 1 . Data exploration with skrub's ` TableReport `
209+ 2 . Data cleaning and sanitization with the ` Cleaner `
210+ 3 . Automatic feature engineering with the ` TableVectorizer `
211+ 4 . A robust baseline for machine learning tasks with the ` tabular_pipeline `
212+ 5 . Columnwise operations with ` ApplyToCols `
213+ 6 . Advanced column selectors and how to use them
214+ 7 . Feature engineering with the skrub encoders
215+ 8 . Building dynamic pipelines with the skrub Data Ops (short intro)
0 commit comments