Skip to content

Commit 731d1ac

Browse files
committed
formatting
1 parent 2a9f14d commit 731d1ac

3 files changed

Lines changed: 96 additions & 91 deletions

File tree

content/exercises/01_ex_explore_clean.py

Lines changed: 29 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,21 @@
11
# %% [markdown]
22
# # Exercise: exploring a new table
3-
# For this exercise, we will use the `employee_salaries` dataframe to answer some
4-
# questions.
3+
# For this exercise, we will use the `employee_salaries` dataframe to answer some
4+
# questions.
55
#
66
# Run the following code to import the dataframe:
77

88
# %%
99
import pandas as pd
10+
1011
data = pd.read_csv("../data/employee_salaries/data.csv")
1112

1213
# %% [markdown]
13-
# Now use the skrub `TableReport` and answer the following questions:
14+
# Now use the skrub `TableReport` and answer the following questions:
1415

1516
# %%
1617
from skrub import TableReport
18+
1719
TableReport(data)
1820

1921
# %% [markdown]
@@ -45,40 +47,42 @@
4547
# - 9228 rows × 8 columns
4648
# - How many columns have object/numerical/datetime
4749
# - No datetime columns, one integer column (`year_first_hired`), all other columns
48-
# are objects.
50+
# are objects.
4951
# - Are there columns with a large number of missing values?
5052
# - No, only the `gender` column contains a small fraction (0.2%) of missing
5153
# values.
5254
# - Are there columns that have a high cardinality?
53-
# - Yes, `division`, `employee_position_title`, `date_first_hired` have a
54-
# cardinality larger than 40.
55+
# - Yes, `division`, `employee_position_title`, `date_first_hired` have a
56+
# cardinality larger than 40.
5557
# - Were datetime columns parsed correctly?
56-
# - No, the `date_first_hired` column has dtype Object.
58+
# - No, the `date_first_hired` column has dtype Object.
5759
# - Which columns have outliers?
58-
# - No columns seem to include outliers.
60+
# - No columns seem to include outliers.
5961
# - Which columns have an imbalanced distribution?
60-
# - `assignment_category` has an unbalanced distribution.
62+
# - `assignment_category` has an unbalanced distribution.
6163
# - Which columns are strongly correlated with each other?
62-
# - `department` and `department_name` have a Cramer's V of 1, so they are
63-
# very strongly correlated.
64+
# - `department` and `department_name` have a Cramer's V of 1, so they are
65+
# very strongly correlated.
6466

6567
# %% [markdown]
66-
# # Exercise: clean a dataframe using the `Cleaner`
67-
# Load the given dataframe.
68+
# # Exercise: clean a dataframe using the `Cleaner`
69+
# Load the given dataframe.
6870

6971
# %%
7072
import pandas as pd
73+
7174
df = pd.read_csv("../data/cleaner_data.csv")
7275

7376
# %% [markdown]
74-
# Use the `TableReport` to answer the following questions:
77+
# Use the `TableReport` to answer the following questions:
7578
#
76-
# - Are there constant columns?
77-
# - Are there datetime columns? If so, were they parsed correctly?
78-
# - What is the dtype of the numerical features?
79+
# - Are there constant columns?
80+
# - Are there datetime columns? If so, were they parsed correctly?
81+
# - What is the dtype of the numerical features?
7982

8083
# %%
8184
from skrub import TableReport
85+
8286
TableReport(df)
8387

8488
# %% [markdown]
@@ -92,14 +96,14 @@
9296
from skrub import Cleaner
9397

9498
# Write your answer here
95-
#
96-
#
97-
#
98-
#
99-
#
100-
#
101-
#
102-
#
99+
#
100+
#
101+
#
102+
#
103+
#
104+
#
105+
#
106+
#
103107

104108
# %%
105109
# solution

content/exercises/03_ex_feat_eng.py

Lines changed: 32 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
# %% [markdown]
22
# # Exercise
33
# Use one of the methods explained so far (Cleaner/ApplyToCols) to convert the provided
4-
# dataframe to datetime dtype, then extract the following features:
5-
# - All parts of the datetime
4+
# dataframe to datetime dtype, then extract the following features:
5+
# - All parts of the datetime
66
# - The number of seconds from epoch
77
# - The day in the week
88
# - The day of the year
99
#
10-
# **Hint**: use the format `"%d %B %Y"` for the datetime.
10+
# **Hint**: use the format `"%d %B %Y"` for the datetime.
1111
#
1212

1313
# %%
@@ -29,20 +29,20 @@
2929

3030
# %%
3131
# Write your solution here
32-
#
33-
#
34-
#
35-
#
36-
#
37-
#
38-
#
39-
#
40-
#
41-
#
42-
#
43-
#
44-
#
45-
#
32+
#
33+
#
34+
#
35+
#
36+
#
37+
#
38+
#
39+
#
40+
#
41+
#
42+
#
43+
#
44+
#
45+
#
4646

4747
# %%
4848
# Solution with ApplyToCols and ToDatetime
@@ -80,23 +80,23 @@
8080

8181
# %%
8282
# Write your solution here
83-
#
84-
#
85-
#
86-
#
87-
#
88-
#
89-
#
90-
#
91-
#
92-
#
93-
#
94-
#
95-
#
96-
#
83+
#
84+
#
85+
#
86+
#
87+
#
88+
#
89+
#
90+
#
91+
#
92+
#
93+
#
94+
#
95+
#
96+
#
9797

9898
# %% [markdown]
99-
# Now modify the script above to add spline features (`periodic_encoding="spline"`).
99+
# Now modify the script above to add spline features (`periodic_encoding="spline"`).
100100
#
101101

102102
# %%

content/exercises/04_ex_table_vec.py

Lines changed: 35 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# %% [markdown]
22
# # Exercise: implementing a `TableVectorizer` from its components
3-
# Replicate the behavior of a `TableVectorizer` using `ApplyToCols`, the skrub
4-
# selectors, and the given transformers.
3+
# Replicate the behavior of a `TableVectorizer` using `ApplyToCols`, the skrub
4+
# selectors, and the given transformers.
55

66
# %%
77
from skrub import Cleaner, ApplyToCols, StringEncoder, DatetimeEncoder
@@ -10,24 +10,23 @@
1010
import skrub.selectors as s
1111

1212
# %% [markdown]
13-
# Notes on the implementation:
13+
# Notes on the implementation:
1414
#
1515
# - In the first step, the TableVectorizer cleans the data to parse datetimes and other
1616
# dtypes.
17-
# - Numeric features are left untouched, i.e., they use a Passthrough transformer.
18-
# - String and categorical feature are split into high and low cardinality features.
19-
# - For this exercise, set the the cardinality `threshold` to 4.
17+
# - Numeric features are left untouched, i.e., they use a Passthrough transformer.
18+
# - String and categorical feature are split into high and low cardinality features.
19+
# - For this exercise, set the the cardinality `threshold` to 4.
2020
# - High cardinality features are transformed with a `StringEncoder`. In this exercise,
21-
# set `n_components` to 2.
22-
# - Low cardinality features are transformed with a `OneHotEncoder`, and the first
23-
# category in binary features is dropped (hint: check the docs of the `OneHotEncoder`
24-
# for the `drop` parameter). Set `sparse_output=True`.
25-
# - Remember `cardinality_below` is one of the skrub selectors.
26-
# - Datetimes are transformed by a default `DatetimeEncoder`.
27-
# - Everything should be wrapped in a scikit-learn `Pipeline`.
21+
# set `n_components` to 2.
22+
# - Low cardinality features are transformed with a `OneHotEncoder` with
23+
# `sparse_output=False` and `drop="if_binary"`.
24+
# - Remember `cardinality_below` is one of the skrub selectors.
25+
# - Datetimes are transformed by a default `DatetimeEncoder`.
26+
# - Everything should be wrapped in a scikit-learn `Pipeline`.
27+
# - Remember that the order of the operations matters!
2828
#
29-
#
30-
# Use the following dataframe to test the result.
29+
# Use the following dataframe to test the result.
3130

3231
# %%
3332
import pandas as pd
@@ -40,21 +39,23 @@
4039
"str2": ["officer", "manager", "lawyer", "chef", "teacher"],
4140
"bool": [True, False, True, False, True],
4241
"datetime-col": [
43-
"2020-02-03T12:30:05",
44-
"2021-03-15T00:37:15",
45-
"2022-02-13T17:03:25",
46-
"2023-05-22T08:45:55",
42+
"2020-02-03T12:30:05",
43+
"2021-03-15T00:37:15",
44+
"2022-02-13T17:03:25",
45+
"2023-05-22T08:45:55",
4746
]
4847
+ [None],
4948
}
5049
df = pd.DataFrame(data)
5150
df
5251

5352
# %% [markdown]
54-
# Use the following `PassThrough` transformer where needed.
53+
# Use the following `PassThrough` transformer where needed.
5554

5655
# %%
57-
from skrub._apply_to_cols import SingleColumnTransformer
56+
from skrub._single_column_transformer import SingleColumnTransformer
57+
58+
5859
class PassThrough(SingleColumnTransformer):
5960
def fit_transform(self, column, y=None):
6061
return column
@@ -78,17 +79,17 @@ def transform(self, column):
7879
# %%
7980
# Write your code here
8081
#
81-
#
82-
#
83-
#
84-
#
85-
#
86-
#
87-
#
88-
#
89-
#
90-
#
91-
#
82+
#
83+
#
84+
#
85+
#
86+
#
87+
#
88+
#
89+
#
90+
#
91+
#
92+
#
9293

9394
# %%
9495
# Solution
@@ -101,10 +102,10 @@ def transform(self, column):
101102
cols=s.cardinality_below(4) & s.string(),
102103
)
103104
numeric = ApplyToCols(PassThrough(), cols=s.numeric())
104-
datetime = ApplyToCols(DatetimeEncoder(), cols=s.any_date())
105+
dt = ApplyToCols(DatetimeEncoder(), cols=s.any_date())
105106

106107
my_table_vectorizer = make_pipeline(
107-
cleaner, numeric, high_cardinality, low_cardinality, datetime
108+
cleaner, numeric, high_cardinality, low_cardinality, dt
108109
)
109110

110111
my_table_vectorizer.fit_transform(df)

0 commit comments

Comments
 (0)