Added exercises notebook for hands-on histogrammar session

mbaak · mbaak · commit fdec6b9aee64 · 2022-04-04T18:48:09.000+02:00
diff --git a/histogrammar/notebooks/histogrammar_tutorial_exercises.ipynb b/histogrammar/notebooks/histogrammar_tutorial_exercises.ipynb
@@ -0,0 +1,374 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Histogrammar exercises\n",
+    "\n",
+    "Histogrammar is a Python package that allows you to make histograms from numpy arrays, and pandas and spark dataframes. \n",
+    "\n",
+    "(There is also a scala backend for Histogrammar, that is used by spark.) \n",
+    "\n",
+    "You can do the exercises below after the basic tutorial.\n",
+    "\n",
+    "Enjoy!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "# install histogrammar (if not installed yet)\n",
+    "import sys\n",
+    "\n",
+    "!\"{sys.executable}\" -m pip install histogrammar"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import histogrammar as hg"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import matplotlib"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Dataset\n",
+    "Let's first load some data!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# open a pandas dataframe for use below\n",
+    "from histogrammar import resources\n",
+    "df = pd.read_csv(resources.data(\"test.csv.gz\"), parse_dates=[\"date\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.head(2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Comparing histogram types"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Histogrammar treats histograms as objects. You will see this has various advantages.\n",
+    "\n",
+    "Let's fill a simple histogram with a numpy array."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# this creates a histogram with 100 even-sized bins in the (closed) range [-5, 5]\n",
+    "hist1 = hg.Bin(num=10, low=0, high=100)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist1.fill.numpy(df['age'].values)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist1.plot.matplotlib();"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist2 = hg.SparselyBin(binWidth=10, origin=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist2.fill.numpy(df['age'].values)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist2.plot.matplotlib();"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Q: Have a look at the .values and .bins attributes of hist1 and hist2.\n",
+    "What types are these? (hist1.values is a ...?) \n",
+    "Does that make sense?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Q: In each bin, what type of object is keeping track of the bin count?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Try filling hist1 with small values (negative) or very large (> 100) or with NaNs. \n",
+    "Find out if and how hist1 keeps track of these?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now fill hist2 with small values (negative) or very large (> 100) or with NaNs. How does hist2 keeps track of these?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Categorical variables\n",
+    "\n",
+    "For categorical variables use the Categorize histogram\n",
+    "- Categorize histograms: accepting categorical variables such as strings and booleans.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "histx = hg.Categorize('eyeColor')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "histx.fill.numpy(df)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Q: A categorize histogram, what is it fundementally, a dictionary or a list?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Q: What else can it keep track of, e.g. numbers, booleans, nans? Give it a try, fill it with more entries!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Fill a histograms with a boolean array (isActive), directly from the dataframe\n",
+    "\n",
+    "Q: what type of histogram do you get?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hists = df.hg_make_histograms(features=['isActive'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Multi-dimensional histograms"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's make a 3-dimensional histogram, with axes: x=favoriteFruit, y=gender, z=isActive. (In Histogrammar, a multi-dimensional histogram is composed as recursive histograms, starting with the last one.) \n",
+    "Then fill it with the dataframe."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# hist1 = hg.Categorize(quantity='isActive')\n",
+    "# hist2 = hg.Categorize(quantity='gender', value=hist1)\n",
+    "# hist3 = hg.Categorize(quantity='favoriteFruit')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Q: How many data points end up in the bin: banana, male, True ?\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Q: Store this histogram as a json file. What is the size of the json file?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Q: Read back the histogram and then plot it."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Q: Make a histogram of the feature 'fruit', which measures the average value of 'latitude' per bin of fruit."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist1 = hg.Average(quantity='latitude')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Q: what is the mean value of latitude for the bin 'strawberry'?"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernel_info": {
+   "name": "python3"
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "nteract": {
+   "version": "0.15.0"
+  },
+  "pycharm": {
+   "stem_cell": {
+    "cell_type": "raw",
+    "metadata": {
+     "collapsed": false
+    },
+    "source": []
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/tests/test_notebooks.py b/tests/test_notebooks.py
@@ -24,3 +24,7 @@ def test_notebook_basic(nb_tester):
 
 def test_notebook_advanced(nb_tester):
     nb_tester.check(notebook("histogrammar_tutorial_advanced.ipynb"))
+
+
+def test_notebook_exercises(nb_tester):
+    nb_tester.check(notebook("histogrammar_tutorial_exercises.ipynb"))