skrub-data
diff --git a/‎content/exercises/01_ex_explore_clean.ipynb‎
Lines changed: 232 additions & 0 deletions b/‎content/exercises/01_ex_explore_clean.ipynb‎
Lines changed: 232 additions & 0 deletions
@@ -0,0 +1,232 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "c8b7476a",
+   "metadata": {},
+   "source": [
+    "# Exercise: exploring a new table\n",
+    "For this exercise, we will use the `employee_salaries` dataframe to answer some\n",
+    "questions.\n",
+    "\n",
+    "Run the following code to import the dataframe:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "10dc3dd9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "data = pd.read_csv(\"../data/employee_salaries/data.csv\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b95a5029",
+   "metadata": {},
+   "source": [
+    "Now use the skrub `TableReport` and answer the following questions:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a205456e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install skrub\n",
+    "from skrub import TableReport\n",
+    "\n",
+    "TableReport(data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "23c471c3",
+   "metadata": {},
+   "source": [
+    "## Questions\n",
+    "- What's the size of the dataframe? (columns and rows)\n",
+    "- How many columns have object/numerical/datetime\n",
+    "- Are there columns with a large number of missing values?\n",
+    "- Are there columns that have a high cardinality (>40 unique values)?\n",
+    "- Were datetime columns parsed correctly?\n",
+    "- Which columns have outliers?\n",
+    "- Which columns have an imbalanced distribution?\n",
+    "- Which columns are strongly correlated with each other?\n",
+    "\n",
+    "```{.python}\n",
+    "# PLACEHOLDER\n",
+    "#\n",
+    "#\n",
+    "#\n",
+    "#\n",
+    "#\n",
+    "#\n",
+    "#\n",
+    "#\n",
+    "#\n",
+    "```\n",
+    "\n",
+    "## Answers\n",
+    "- What's the size of the dataframe? (columns and rows)\n",
+    "    - 9228 rows × 8 columns\n",
+    "- How many columns have object/numerical/datetime\n",
+    "    - No datetime columns, one integer column (`year_first_hired`), all other columns\n",
+    "    are objects.\n",
+    "- Are there columns with a large number of missing values?\n",
+    "    - No, only the `gender` column contains a small fraction (0.2%) of missing\n",
+    "    values.\n",
+    "- Are there columns that have a high cardinality?\n",
+    "    - Yes, `division`, `employee_position_title`, `date_first_hired` have a\n",
+    "    cardinality larger than 40.\n",
+    "- Were datetime columns parsed correctly?\n",
+    "    - No, the `date_first_hired` column has dtype Object.\n",
+    "- Which columns have outliers?\n",
+    "    - No columns seem to include outliers.\n",
+    "- Which columns have an imbalanced distribution?\n",
+    "    - `assignment_category` has an unbalanced distribution.\n",
+    "- Which columns are strongly correlated with each other?\n",
+    "    - `department` and `department_name` have a Cramer's V of 1, so they are\n",
+    "    very strongly correlated."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f20bde70",
+   "metadata": {},
+   "source": [
+    "# Exercise: clean a dataframe using the `Cleaner`\n",
+    "Load the given dataframe."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1a512d31",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "df = pd.read_csv(\"../data/cleaner_data.csv\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2d8454f4",
+   "metadata": {},
+   "source": [
+    "Use the `TableReport` to answer the following questions:\n",
+    "\n",
+    "- Are there constant columns?\n",
+    "- Are there datetime columns? If so, were they parsed correctly?\n",
+    "- What is the dtype of the numerical features?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "50244f15",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from skrub import TableReport\n",
+    "\n",
+    "TableReport(df)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "03dcbdcb",
+   "metadata": {},
+   "source": [
+    "Then, use the `Cleaner` to sanitize the data so that:\n",
+    "- Constant columns are removed\n",
+    "- Datetimes are parsed properly (hint: use `\"%d-%b-%Y\"` as the datetime format)\n",
+    "- All columns with more than 50% missing values are removed\n",
+    "- Numerical features are converted to `float32`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e78ad1a3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from skrub import Cleaner\n",
+    "\n",
+    "# Write your answer here\n",
+    "#\n",
+    "#\n",
+    "#\n",
+    "#\n",
+    "#\n",
+    "#\n",
+    "#\n",
+    "#"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f7370994",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# solution\n",
+    "from skrub import Cleaner\n",
+    "\n",
+    "cleaner = Cleaner(\n",
+    "    drop_if_constant=True,\n",
+    "    drop_null_fraction=0.5,\n",
+    "    numeric_dtype=\"float32\",\n",
+    "    datetime_format=\"%d-%b-%Y\",\n",
+    ")\n",
+    "\n",
+    "# Apply the cleaner\n",
+    "df_cleaned = cleaner.fit_transform(df)\n",
+    "\n",
+    "# Display the cleaned dataframe\n",
+    "TableReport(df_cleaned)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "627265cd",
+   "metadata": {},
+   "source": [
+    "We can inspect which columns were dropped and what transformations were applied:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eb157043",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(f\"Original shape: {df.shape}\")\n",
+    "print(f\"Cleaned shape: {df_cleaned.shape}\")\n",
+    "print(\n",
+    "    f\"\\nColumns dropped: {[col for col in df.columns if col not in cleaner.all_outputs_]}\"\n",
+    ")"
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}