Skip to content

Commit c944734

Browse files
committed
various updates and fixes for the slides
1 parent a3ea532 commit c944734

File tree

36 files changed

+334129
-2983
lines changed

36 files changed

+334129
-2983
lines changed
Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "c8b7476a",
6+
"metadata": {},
7+
"source": [
8+
"# Exercise: exploring a new table\n",
9+
"For this exercise, we will use the `employee_salaries` dataframe to answer some\n",
10+
"questions.\n",
11+
"\n",
12+
"Run the following code to import the dataframe:"
13+
]
14+
},
15+
{
16+
"cell_type": "code",
17+
"execution_count": null,
18+
"id": "10dc3dd9",
19+
"metadata": {},
20+
"outputs": [],
21+
"source": [
22+
"import pandas as pd\n",
23+
"\n",
24+
"data = pd.read_csv(\"../data/employee_salaries/data.csv\")"
25+
]
26+
},
27+
{
28+
"cell_type": "markdown",
29+
"id": "b95a5029",
30+
"metadata": {},
31+
"source": [
32+
"Now use the skrub `TableReport` and answer the following questions:"
33+
]
34+
},
35+
{
36+
"cell_type": "code",
37+
"execution_count": null,
38+
"id": "a205456e",
39+
"metadata": {},
40+
"outputs": [],
41+
"source": [
42+
"%pip install skrub\n",
43+
"from skrub import TableReport\n",
44+
"\n",
45+
"TableReport(data)"
46+
]
47+
},
48+
{
49+
"cell_type": "markdown",
50+
"id": "23c471c3",
51+
"metadata": {},
52+
"source": [
53+
"## Questions\n",
54+
"- What's the size of the dataframe? (columns and rows)\n",
55+
"- How many columns have object/numerical/datetime\n",
56+
"- Are there columns with a large number of missing values?\n",
57+
"- Are there columns that have a high cardinality (>40 unique values)?\n",
58+
"- Were datetime columns parsed correctly?\n",
59+
"- Which columns have outliers?\n",
60+
"- Which columns have an imbalanced distribution?\n",
61+
"- Which columns are strongly correlated with each other?\n",
62+
"\n",
63+
"```{.python}\n",
64+
"# PLACEHOLDER\n",
65+
"#\n",
66+
"#\n",
67+
"#\n",
68+
"#\n",
69+
"#\n",
70+
"#\n",
71+
"#\n",
72+
"#\n",
73+
"#\n",
74+
"```\n",
75+
"\n",
76+
"## Answers\n",
77+
"- What's the size of the dataframe? (columns and rows)\n",
78+
" - 9228 rows × 8 columns\n",
79+
"- How many columns have object/numerical/datetime\n",
80+
" - No datetime columns, one integer column (`year_first_hired`), all other columns\n",
81+
" are objects.\n",
82+
"- Are there columns with a large number of missing values?\n",
83+
" - No, only the `gender` column contains a small fraction (0.2%) of missing\n",
84+
" values.\n",
85+
"- Are there columns that have a high cardinality?\n",
86+
" - Yes, `division`, `employee_position_title`, `date_first_hired` have a\n",
87+
" cardinality larger than 40.\n",
88+
"- Were datetime columns parsed correctly?\n",
89+
" - No, the `date_first_hired` column has dtype Object.\n",
90+
"- Which columns have outliers?\n",
91+
" - No columns seem to include outliers.\n",
92+
"- Which columns have an imbalanced distribution?\n",
93+
" - `assignment_category` has an unbalanced distribution.\n",
94+
"- Which columns are strongly correlated with each other?\n",
95+
" - `department` and `department_name` have a Cramer's V of 1, so they are\n",
96+
" very strongly correlated."
97+
]
98+
},
99+
{
100+
"cell_type": "markdown",
101+
"id": "f20bde70",
102+
"metadata": {},
103+
"source": [
104+
"# Exercise: clean a dataframe using the `Cleaner`\n",
105+
"Load the given dataframe."
106+
]
107+
},
108+
{
109+
"cell_type": "code",
110+
"execution_count": null,
111+
"id": "1a512d31",
112+
"metadata": {},
113+
"outputs": [],
114+
"source": [
115+
"import pandas as pd\n",
116+
"\n",
117+
"df = pd.read_csv(\"../data/cleaner_data.csv\")"
118+
]
119+
},
120+
{
121+
"cell_type": "markdown",
122+
"id": "2d8454f4",
123+
"metadata": {},
124+
"source": [
125+
"Use the `TableReport` to answer the following questions:\n",
126+
"\n",
127+
"- Are there constant columns?\n",
128+
"- Are there datetime columns? If so, were they parsed correctly?\n",
129+
"- What is the dtype of the numerical features?"
130+
]
131+
},
132+
{
133+
"cell_type": "code",
134+
"execution_count": null,
135+
"id": "50244f15",
136+
"metadata": {},
137+
"outputs": [],
138+
"source": [
139+
"from skrub import TableReport\n",
140+
"\n",
141+
"TableReport(df)"
142+
]
143+
},
144+
{
145+
"cell_type": "markdown",
146+
"id": "03dcbdcb",
147+
"metadata": {},
148+
"source": [
149+
"Then, use the `Cleaner` to sanitize the data so that:\n",
150+
"- Constant columns are removed\n",
151+
"- Datetimes are parsed properly (hint: use `\"%d-%b-%Y\"` as the datetime format)\n",
152+
"- All columns with more than 50% missing values are removed\n",
153+
"- Numerical features are converted to `float32`"
154+
]
155+
},
156+
{
157+
"cell_type": "code",
158+
"execution_count": null,
159+
"id": "e78ad1a3",
160+
"metadata": {},
161+
"outputs": [],
162+
"source": [
163+
"from skrub import Cleaner\n",
164+
"\n",
165+
"# Write your answer here\n",
166+
"#\n",
167+
"#\n",
168+
"#\n",
169+
"#\n",
170+
"#\n",
171+
"#\n",
172+
"#\n",
173+
"#"
174+
]
175+
},
176+
{
177+
"cell_type": "code",
178+
"execution_count": null,
179+
"id": "f7370994",
180+
"metadata": {},
181+
"outputs": [],
182+
"source": [
183+
"# solution\n",
184+
"from skrub import Cleaner\n",
185+
"\n",
186+
"cleaner = Cleaner(\n",
187+
" drop_if_constant=True,\n",
188+
" drop_null_fraction=0.5,\n",
189+
" numeric_dtype=\"float32\",\n",
190+
" datetime_format=\"%d-%b-%Y\",\n",
191+
")\n",
192+
"\n",
193+
"# Apply the cleaner\n",
194+
"df_cleaned = cleaner.fit_transform(df)\n",
195+
"\n",
196+
"# Display the cleaned dataframe\n",
197+
"TableReport(df_cleaned)"
198+
]
199+
},
200+
{
201+
"cell_type": "markdown",
202+
"id": "627265cd",
203+
"metadata": {},
204+
"source": [
205+
"We can inspect which columns were dropped and what transformations were applied:"
206+
]
207+
},
208+
{
209+
"cell_type": "code",
210+
"execution_count": null,
211+
"id": "eb157043",
212+
"metadata": {},
213+
"outputs": [],
214+
"source": [
215+
"print(f\"Original shape: {df.shape}\")\n",
216+
"print(f\"Cleaned shape: {df_cleaned.shape}\")\n",
217+
"print(\n",
218+
" f\"\\nColumns dropped: {[col for col in df.columns if col not in cleaner.all_outputs_]}\"\n",
219+
")"
220+
]
221+
}
222+
],
223+
"metadata": {
224+
"jupytext": {
225+
"cell_metadata_filter": "-all",
226+
"main_language": "python",
227+
"notebook_metadata_filter": "-all"
228+
}
229+
},
230+
"nbformat": 4,
231+
"nbformat_minor": 5
232+
}

0 commit comments

Comments
 (0)