11# %% [markdown]
22# # Exercise: implementing a `TableVectorizer` from its components
3- # Replicate the behavior of a `TableVectorizer` using `ApplyToCols`, the skrub
4- # selectors, and the given transformers.
3+ # Replicate the behavior of a `TableVectorizer` using `ApplyToCols`, the skrub
4+ # selectors, and the given transformers.
55
66# %%
77from skrub import Cleaner , ApplyToCols , StringEncoder , DatetimeEncoder
1010import skrub .selectors as s
1111
1212# %% [markdown]
13- # Notes on the implementation:
13+ # Notes on the implementation:
1414#
1515# - In the first step, the TableVectorizer cleans the data to parse datetimes and other
1616# dtypes.
17- # - Numeric features are left untouched, i.e., they use a Passthrough transformer.
18- # - String and categorical feature are split into high and low cardinality features.
19- # - For this exercise, set the the cardinality `threshold` to 4.
17+ # - Numeric features are left untouched, i.e., they use a Passthrough transformer.
18+ # - String and categorical feature are split into high and low cardinality features.
19+ # - For this exercise, set the the cardinality `threshold` to 4.
2020# - High cardinality features are transformed with a `StringEncoder`. In this exercise,
21- # set `n_components` to 2.
22- # - Low cardinality features are transformed with a `OneHotEncoder`, and the first
23- # category in binary features is dropped (hint: check the docs of the `OneHotEncoder`
24- # for the `drop` parameter). Set `sparse_output=True` .
25- # - Remember `cardinality_below` is one of the skrub selectors.
26- # - Datetimes are transformed by a default `DatetimeEncoder`.
27- # - Everything should be wrapped in a scikit-learn `Pipeline`.
21+ # set `n_components` to 2.
22+ # - Low cardinality features are transformed with a `OneHotEncoder` with
23+ # `sparse_output=False` and `drop="if_binary"`.
24+ # - Remember `cardinality_below` is one of the skrub selectors .
25+ # - Datetimes are transformed by a default `DatetimeEncoder`.
26+ # - Everything should be wrapped in a scikit-learn `Pipeline`.
27+ # - Remember that the order of the operations matters!
2828#
29- #
30- # Use the following dataframe to test the result.
29+ # Use the following dataframe to test the result.
3130
3231# %%
3332import pandas as pd
4039 "str2" : ["officer" , "manager" , "lawyer" , "chef" , "teacher" ],
4140 "bool" : [True , False , True , False , True ],
4241 "datetime-col" : [
43- "2020-02-03T12:30:05" ,
44- "2021-03-15T00:37:15" ,
45- "2022-02-13T17:03:25" ,
46- "2023-05-22T08:45:55" ,
42+ "2020-02-03T12:30:05" ,
43+ "2021-03-15T00:37:15" ,
44+ "2022-02-13T17:03:25" ,
45+ "2023-05-22T08:45:55" ,
4746 ]
4847 + [None ],
4948}
5049df = pd .DataFrame (data )
5150df
5251
5352# %% [markdown]
54- # Use the following `PassThrough` transformer where needed.
53+ # Use the following `PassThrough` transformer where needed.
5554
5655# %%
57- from skrub ._apply_to_cols import SingleColumnTransformer
56+ from skrub ._single_column_transformer import SingleColumnTransformer
57+
58+
5859class PassThrough (SingleColumnTransformer ):
5960 def fit_transform (self , column , y = None ):
6061 return column
@@ -78,17 +79,17 @@ def transform(self, column):
7879# %%
7980# Write your code here
8081#
81- #
82- #
83- #
84- #
85- #
86- #
87- #
88- #
89- #
90- #
91- #
82+ #
83+ #
84+ #
85+ #
86+ #
87+ #
88+ #
89+ #
90+ #
91+ #
92+ #
9293
9394# %%
9495# Solution
@@ -101,10 +102,10 @@ def transform(self, column):
101102 cols = s .cardinality_below (4 ) & s .string (),
102103)
103104numeric = ApplyToCols (PassThrough (), cols = s .numeric ())
104- datetime = ApplyToCols (DatetimeEncoder (), cols = s .any_date ())
105+ dt = ApplyToCols (DatetimeEncoder (), cols = s .any_date ())
105106
106107my_table_vectorizer = make_pipeline (
107- cleaner , numeric , high_cardinality , low_cardinality , datetime
108+ cleaner , numeric , high_cardinality , low_cardinality , dt
108109)
109110
110111my_table_vectorizer .fit_transform (df )
0 commit comments