Skip to content

Commit 3b45dce

Browse files
committed
update
1 parent a05a0d6 commit 3b45dce

26 files changed

Lines changed: 622 additions & 15474 deletions

File tree

book/_freeze/chapters/01_exploring_data/execute-results/html.json

Lines changed: 3 additions & 3 deletions
Large diffs are not rendered by default.

content/notebooks/01_ex_explore_clean.ipynb

Lines changed: 28 additions & 15364 deletions
Large diffs are not rendered by default.

content/notebooks/02_ex_selectors.ipynb

Lines changed: 25 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"cells": [
33
{
44
"cell_type": "markdown",
5-
"id": "880408fe",
5+
"id": "e63ac5a3",
66
"metadata": {},
77
"source": [
88
"# Exercise: using selectors together with `ApplyToCols`\n",
@@ -12,7 +12,19 @@
1212
{
1313
"cell_type": "code",
1414
"execution_count": null,
15-
"id": "a9a828c7",
15+
"id": "4d99c628",
16+
"metadata": {
17+
"lines_to_next_cell": 0
18+
},
19+
"outputs": [],
20+
"source": [
21+
"%pip install skrub"
22+
]
23+
},
24+
{
25+
"cell_type": "code",
26+
"execution_count": null,
27+
"id": "bf446998",
1628
"metadata": {},
1729
"outputs": [],
1830
"source": [
@@ -34,7 +46,7 @@
3446
},
3547
{
3648
"cell_type": "markdown",
37-
"id": "99d407aa",
49+
"id": "71bf9f7e",
3850
"metadata": {},
3951
"source": [
4052
"Using the skrub selectors and `ApplyToCols`:\n",
@@ -47,10 +59,11 @@
4759
{
4860
"cell_type": "code",
4961
"execution_count": null,
50-
"id": "0e8a98aa",
62+
"id": "4c82837b",
5163
"metadata": {},
5264
"outputs": [],
5365
"source": [
66+
"%pip install skrub\n",
5467
"import skrub.selectors as s\n",
5568
"from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
5669
"from skrub import ApplyToCols\n",
@@ -71,7 +84,7 @@
7184
{
7285
"cell_type": "code",
7386
"execution_count": null,
74-
"id": "4ff2162f",
87+
"id": "f39b997c",
7588
"metadata": {},
7689
"outputs": [],
7790
"source": [
@@ -90,7 +103,7 @@
90103
},
91104
{
92105
"cell_type": "markdown",
93-
"id": "c2630fde",
106+
"id": "766861eb",
94107
"metadata": {},
95108
"source": [
96109
"Given the same dataframe and using selectors, drop only string columns that contain\n",
@@ -100,7 +113,7 @@
100113
{
101114
"cell_type": "code",
102115
"execution_count": null,
103-
"id": "1284c5d8",
116+
"id": "b28f8b74",
104117
"metadata": {},
105118
"outputs": [],
106119
"source": [
@@ -119,7 +132,7 @@
119132
{
120133
"cell_type": "code",
121134
"execution_count": null,
122-
"id": "6af8da58",
135+
"id": "51b5680f",
123136
"metadata": {},
124137
"outputs": [],
125138
"source": [
@@ -130,7 +143,7 @@
130143
},
131144
{
132145
"cell_type": "markdown",
133-
"id": "f4e2832b",
146+
"id": "b4105df4",
134147
"metadata": {},
135148
"source": [
136149
"Now write a custom function that selects columns where all values are lower than\n",
@@ -140,7 +153,7 @@
140153
{
141154
"cell_type": "code",
142155
"execution_count": null,
143-
"id": "1a133192",
156+
"id": "419f5e31",
144157
"metadata": {},
145158
"outputs": [],
146159
"source": [
@@ -159,7 +172,7 @@
159172
{
160173
"cell_type": "code",
161174
"execution_count": null,
162-
"id": "9640e045",
175+
"id": "0266efbd",
163176
"metadata": {},
164177
"outputs": [],
165178
"source": [
@@ -176,7 +189,7 @@
176189
{
177190
"cell_type": "code",
178191
"execution_count": null,
179-
"id": "1c6ada46",
192+
"id": "9bd6a9e7",
180193
"metadata": {},
181194
"outputs": [],
182195
"source": []

content/notebooks/03_ex_feat_eng.ipynb

Lines changed: 168 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"cells": [
33
{
44
"cell_type": "markdown",
5-
"id": "55868e5d",
5+
"id": "07ca67bb",
66
"metadata": {},
77
"source": [
88
"# Exercise\n",
@@ -19,7 +19,17 @@
1919
{
2020
"cell_type": "code",
2121
"execution_count": null,
22-
"id": "f7c73e92",
22+
"id": "3fbaa238",
23+
"metadata": {},
24+
"outputs": [],
25+
"source": [
26+
"%pip install skrub"
27+
]
28+
},
29+
{
30+
"cell_type": "code",
31+
"execution_count": null,
32+
"id": "065ddca9",
2333
"metadata": {},
2434
"outputs": [],
2535
"source": [
@@ -43,7 +53,7 @@
4353
{
4454
"cell_type": "code",
4555
"execution_count": null,
46-
"id": "4e5634a2",
56+
"id": "6cde54f4",
4757
"metadata": {},
4858
"outputs": [],
4959
"source": [
@@ -67,7 +77,7 @@
6777
{
6878
"cell_type": "code",
6979
"execution_count": null,
70-
"id": "a1116746",
80+
"id": "99270841",
7181
"metadata": {},
7282
"outputs": [],
7383
"source": [
@@ -90,7 +100,7 @@
90100
{
91101
"cell_type": "code",
92102
"execution_count": null,
93-
"id": "fb8a8cc4",
103+
"id": "685ccab3",
94104
"metadata": {},
95105
"outputs": [],
96106
"source": [
@@ -110,7 +120,7 @@
110120
},
111121
{
112122
"cell_type": "markdown",
113-
"id": "837491a4",
123+
"id": "f3e17f09",
114124
"metadata": {},
115125
"source": [
116126
"Modify the script so that the `DatetimeEncoder` adds periodic encoding with sine\n",
@@ -120,7 +130,7 @@
120130
{
121131
"cell_type": "code",
122132
"execution_count": null,
123-
"id": "2866891b",
133+
"id": "acacaaac",
124134
"metadata": {},
125135
"outputs": [],
126136
"source": [
@@ -141,18 +151,10 @@
141151
"#"
142152
]
143153
},
144-
{
145-
"cell_type": "markdown",
146-
"id": "2816f4b9",
147-
"metadata": {},
148-
"source": [
149-
"Now modify the script above to add spline features (`periodic_encoding=\"spline\"`).\n"
150-
]
151-
},
152154
{
153155
"cell_type": "code",
154156
"execution_count": null,
155-
"id": "0b163b46",
157+
"id": "85f036ad",
156158
"metadata": {},
157159
"outputs": [],
158160
"source": [
@@ -163,7 +165,7 @@
163165
"\n",
164166
"datetime_encoder = ApplyToCols(\n",
165167
" DatetimeEncoder(\n",
166-
" periodic_encoding=\"spline\",\n",
168+
" periodic_encoding=\"circular\",\n",
167169
" add_total_seconds=True,\n",
168170
" add_weekday=True,\n",
169171
" add_day_of_year=True,\n",
@@ -175,10 +177,158 @@
175177
"encoder.fit_transform(df)"
176178
]
177179
},
180+
{
181+
"cell_type": "markdown",
182+
"id": "4bc63de9",
183+
"metadata": {},
184+
"source": [
185+
"# Exercise\n",
186+
"Build a custom `SingleColumnTransformer` that unpacks the combined string column\n",
187+
"in the provided dataframe into separate columns for `str_id`, `num_id`, and\n",
188+
"`datetime`. The `datetime` column should be converted to datetime dtype. Then,\n",
189+
"use this transformer in a pipeline to extract datetime features as shown in\n",
190+
"the previous exercises.\n",
191+
"\n",
192+
"The transformer should reject columns that are not of string type or that cannot \n",
193+
"be unpacked properly.\n",
194+
"IDs are in the format `STR-NUM-DATETIME`, where `STR` is a string identifier, \n",
195+
"`NUM` is a numeric identifier, and `DATETIME` is a Unix timestamp.\n",
196+
"\n",
197+
"Hint: you can use the following snippet to extract the components from the string column:\n",
198+
"```python\n",
199+
"split_data = X.str.split(\"-\", expand=True)\n",
200+
"res = pd.DataFrame(\n",
201+
" {\n",
202+
" \"str_id\": split_data[0],\n",
203+
" \"num_id\": split_data[1].astype(\"int64\"),\n",
204+
" \"datetime\": pd.to_datetime(split_data[2].astype(\"int64\"), unit=\"s\"),\n",
205+
" }\n",
206+
")\n",
207+
"```"
208+
]
209+
},
210+
{
211+
"cell_type": "code",
212+
"execution_count": null,
213+
"id": "090e3f3c",
214+
"metadata": {
215+
"lines_to_next_cell": 0
216+
},
217+
"outputs": [],
218+
"source": [
219+
"from skrub.core import SingleColumnTransformer, RejectColumn\n",
220+
"import pandas as pd\n",
221+
"from skrub import ApplyToCols\n",
222+
"df_id = pd.DataFrame(\n",
223+
" {\n",
224+
" \"id\": [\n",
225+
" \"BQG-1001-1577836800\",\n",
226+
" \"TYW-1002-1577923200\",\n",
227+
" \"JAY-1003-1578009600\",\n",
228+
" ]\n",
229+
" }\n",
230+
")"
231+
]
232+
},
233+
{
234+
"cell_type": "code",
235+
"execution_count": null,
236+
"id": "251f93c6",
237+
"metadata": {},
238+
"outputs": [],
239+
"source": [
240+
"# Write your solution here\n",
241+
"#\n",
242+
"#\n",
243+
"#\n",
244+
"#\n",
245+
"#\n",
246+
"#\n",
247+
"#"
248+
]
249+
},
250+
{
251+
"cell_type": "code",
252+
"execution_count": null,
253+
"id": "0595b38e",
254+
"metadata": {},
255+
"outputs": [],
256+
"source": [
257+
"# Solution\n",
258+
"class Unpacker(SingleColumnTransformer):\n",
259+
" \"\"\"Unpacker for pandas DataFrames.\"\"\"\n",
260+
"\n",
261+
" def fit_transform(self, X, y=None):\n",
262+
" \"\"\"Unpack combined string column into separate columns.\"\"\"\n",
263+
" if X.dtype != object:\n",
264+
" raise RejectColumn(\"UnpackerPandas only works on string columns.\")\n",
265+
" try:\n",
266+
" split_data = X.str.split(\"-\", expand=True)\n",
267+
" res = pd.DataFrame(\n",
268+
" {\n",
269+
" \"str_id\": split_data[0],\n",
270+
" \"num_id\": split_data[1].astype(\"int64\"),\n",
271+
" \"datetime\": pd.to_datetime(split_data[2].astype(\"int64\"), unit=\"s\"),\n",
272+
" }\n",
273+
" )\n",
274+
" return res\n",
275+
" except Exception as exc:\n",
276+
" raise RejectColumn(\"UnpackerPandas failed to unpack the column.\") from exc\n",
277+
"\n",
278+
"\n",
279+
"ApplyToCols(Unpacker(), allow_reject=True).fit_transform(df_id)"
280+
]
281+
},
282+
{
283+
"cell_type": "markdown",
284+
"id": "144f8f59",
285+
"metadata": {},
286+
"source": [
287+
"Now use this `Unpacker` in a pipeline to extract datetime features as shown in\n",
288+
"the previous exercises. You can use the default `DatetimeEncoder` settings for\n",
289+
"this part."
290+
]
291+
},
292+
{
293+
"cell_type": "code",
294+
"execution_count": null,
295+
"id": "42d9de34",
296+
"metadata": {},
297+
"outputs": [],
298+
"source": [
299+
"# Write your solution here\n",
300+
"#\n",
301+
"#\n",
302+
"#\n",
303+
"#\n",
304+
"#\n",
305+
"#\n",
306+
"#"
307+
]
308+
},
309+
{
310+
"cell_type": "code",
311+
"execution_count": null,
312+
"id": "6b727f5f",
313+
"metadata": {
314+
"lines_to_next_cell": 0
315+
},
316+
"outputs": [],
317+
"source": [
318+
"from sklearn.pipeline import make_pipeline\n",
319+
"from skrub import DatetimeEncoder\n",
320+
"\n",
321+
"pipeline = make_pipeline(\n",
322+
" ApplyToCols(Unpacker(), allow_reject=True),\n",
323+
" ApplyToCols(DatetimeEncoder(), allow_reject=True),\n",
324+
")\n",
325+
"pipeline.fit_transform(df_id)"
326+
]
327+
},
178328
{
179329
"cell_type": "code",
180330
"execution_count": null,
181-
"id": "d37c2d83",
331+
"id": "cdada497",
182332
"metadata": {},
183333
"outputs": [],
184334
"source": []

0 commit comments

Comments
 (0)