Skip to content

Commit 2a9f14d

Browse files
committed
replacing links with actual files
1 parent 2def2b7 commit 2a9f14d

15 files changed

+3002
-15
lines changed

slides/chapters/00_intro.qmd

Lines changed: 0 additions & 1 deletion
This file was deleted.

slides/chapters/00_intro.qmd

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
---
2+
title: "Chapter 1: Introduction"
3+
format:
4+
html:
5+
toc: true
6+
revealjs:
7+
slide-number: true
8+
toc: false
9+
code-fold: false
10+
code-tools: true
11+
execute:
12+
echo: true
13+
---
14+
15+
## A world without skrub {.smaller}
16+
17+
Let's consider a world where skrub does not exist, and all we can do is use
18+
pandas and scikit-learn to prepare data for a machine learning model.
19+
20+
21+
## Load and explore the data
22+
```{python}
23+
import pandas as pd
24+
import numpy as np
25+
26+
X = pd.read_csv("../data/employee_salaries/data.csv")
27+
y = pd.read_csv("../data/employee_salaries/target.csv")["current_annual_salary"]
28+
X.head(5)
29+
```
30+
31+
## Explore the target
32+
Let's take a look at the target:
33+
```{python}
34+
y.head(5)
35+
```
36+
37+
This is a **regression** task: we want to predict the value of `current_annual_salary`.
38+
39+
## Strategizing
40+
We can begin by exploring the dataframe with `.describe`, and then think of a
41+
plan for pre-processing our data.
42+
43+
```{python}
44+
X.describe(include="all")
45+
```
46+
47+
## Our plan
48+
We need to:
49+
50+
- Impute some missing values in the `gender` column.
51+
- Encode convert categorical features into numerical features.
52+
- Convert the column `date_first_hired` into numerical features.
53+
- Scale numerical features.
54+
- Evaluate the performance of the model.
55+
56+
## Step 1: Convert date features to numerical {.smaller}
57+
58+
We extract numerical features from the `date_first_hired` column.
59+
60+
```{python}
61+
# Create a copy to work with
62+
X_processed = X.copy()
63+
64+
# Parse the date column
65+
X_processed['date_first_hired'] = pd.to_datetime(X_processed['date_first_hired'])
66+
67+
# Extract numerical features from date
68+
X_processed['hired_month'] = X_processed['date_first_hired'].dt.month
69+
X_processed['hired_year'] = X_processed['date_first_hired'].dt.year
70+
71+
# Drop original date column
72+
X_processed = X_processed.drop('date_first_hired', axis=1)
73+
74+
print("Features after date transformation:")
75+
print("\nShape:", X_processed.shape)
76+
```
77+
78+
## Step 2: Encode categorical features {.smaller}
79+
80+
We encode the categorical features using one-hot encoding.
81+
82+
```{python}
83+
# Identify only the non-numerical (truly categorical) columns
84+
categorical_cols = X_processed.select_dtypes(include=['object']).columns.tolist()
85+
print("Categorical columns to encode:", categorical_cols)
86+
87+
# Apply one-hot encoding only to categorical columns
88+
X_encoded = pd.get_dummies(X_processed, columns=categorical_cols)
89+
print("\nShape after encoding:", X_encoded.shape)
90+
```
91+
92+
## Step 3: Impute missing values {.smaller}
93+
94+
We impute the missing values in the `gender` column
95+
96+
```{python}
97+
from sklearn.impute import SimpleImputer
98+
99+
# Impute missing values with most frequent value
100+
imputer = SimpleImputer(strategy='most_frequent')
101+
X_encoded_imputed = pd.DataFrame(
102+
imputer.fit_transform(X_encoded),
103+
columns=X_encoded.columns
104+
)
105+
```
106+
107+
## Step 4: Scale numerical features {.smaller}
108+
109+
Scale numerical features for the Ridge regression model.
110+
111+
```{python}
112+
from sklearn.preprocessing import StandardScaler
113+
114+
# Initialize the scaler
115+
scaler = StandardScaler()
116+
117+
# Fit and transform the data
118+
X_scaled = scaler.fit_transform(X_encoded_imputed)
119+
X_scaled = pd.DataFrame(X_scaled, columns=X_encoded_imputed.columns)
120+
```
121+
122+
## Step 5: Train Ridge model with cross-validation {.smaller}
123+
124+
Train a `Ridge` regression model and evaluate with cross-validation.
125+
126+
```{python}
127+
#| warning: false
128+
from sklearn.linear_model import Ridge
129+
from sklearn.model_selection import cross_val_score, cross_validate
130+
import numpy as np
131+
132+
# Initialize Ridge model
133+
ridge = Ridge(alpha=1.0)
134+
135+
# Perform cross-validation (5-fold)
136+
cv_results = cross_validate(ridge, X_scaled, y, cv=5, scoring=["r2", "neg_mean_squared_error"])
137+
138+
# Convert MSE to RMSE
139+
test_rmse = np.sqrt(-cv_results["test_neg_mean_squared_error"])
140+
141+
# Display results
142+
print("Cross-Validation Results:")
143+
print(
144+
f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std():.4f})"
145+
)
146+
print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")
147+
```
148+
149+
## "Just ask an agent to write the code" {.smaller}
150+
- Operations in the wrong order.
151+
- Trying to impute categorical features without converting them to numeric values.
152+
- The datetime feature was treated like a categorical feature.
153+
- Cells could not be executed in order without proper debugging and re-prompting.
154+
- `pd.get_dummies` was executed on the full dataframe, rather than only on the
155+
training split, leading to data leakage.
156+
157+
## Waking up from a nightmare {.smaller}
158+
Thankfully, we can `import skrub`:
159+
```{python}
160+
#| warning: false
161+
from skrub import tabular_pipeline
162+
163+
# Perform cross-validation (5-fold)
164+
cv_results = cross_validate(tabular_pipeline("regression"), X, y, cv=5,
165+
scoring=['r2', 'neg_mean_squared_error'],
166+
return_train_score=True)
167+
168+
# Convert MSE to RMSE
169+
train_rmse = np.sqrt(-cv_results['train_neg_mean_squared_error'])
170+
test_rmse = np.sqrt(-cv_results['test_neg_mean_squared_error'])
171+
172+
# Display results
173+
print("Cross-Validation Results:")
174+
print(f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std():.4f})")
175+
print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")
176+
```
177+
178+
## Roadmap for the course {.smaller}
179+
180+
1. Data exploration with skrub's `TableReport`
181+
2. Data cleaning and sanitization with the `Cleaner`
182+
3. Intermission: simplifying column operations with skrub
183+
4. Feature engineering with the skrub encoders
184+
5. Putting everything together: `TableVectorizer` and `tabular_pipeline`
185+
186+
## What we saw in this chapter
187+
- We built a predictive pipeline using traditional tools
188+
- We saw some possible shortcomings
189+
- We tested skrub's `tabular_pipeline`

slides/chapters/01_exploring_data.qmd

Lines changed: 0 additions & 1 deletion
This file was deleted.

0 commit comments

Comments
 (0)