Skip to content

Commit fdec6b9

Browse files
committed
Added exercises notebook for hands-on histogrammar session
1 parent 75eeaf2 commit fdec6b9

2 files changed

Lines changed: 378 additions & 0 deletions

File tree

Lines changed: 374 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,374 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Histogrammar exercises\n",
8+
"\n",
9+
"Histogrammar is a Python package that allows you to make histograms from numpy arrays, and pandas and spark dataframes. \n",
10+
"\n",
11+
"(There is also a scala backend for Histogrammar, that is used by spark.) \n",
12+
"\n",
13+
"You can do the exercises below after the basic tutorial.\n",
14+
"\n",
15+
"Enjoy!"
16+
]
17+
},
18+
{
19+
"cell_type": "code",
20+
"execution_count": null,
21+
"metadata": {},
22+
"outputs": [],
23+
"source": [
24+
"%%capture\n",
25+
"# install histogrammar (if not installed yet)\n",
26+
"import sys\n",
27+
"\n",
28+
"!\"{sys.executable}\" -m pip install histogrammar"
29+
]
30+
},
31+
{
32+
"cell_type": "code",
33+
"execution_count": null,
34+
"metadata": {},
35+
"outputs": [],
36+
"source": [
37+
"import histogrammar as hg"
38+
]
39+
},
40+
{
41+
"cell_type": "code",
42+
"execution_count": null,
43+
"metadata": {},
44+
"outputs": [],
45+
"source": [
46+
"import pandas as pd\n",
47+
"import numpy as np\n",
48+
"import matplotlib"
49+
]
50+
},
51+
{
52+
"cell_type": "markdown",
53+
"metadata": {},
54+
"source": [
55+
"## Dataset\n",
56+
"Let's first load some data!"
57+
]
58+
},
59+
{
60+
"cell_type": "code",
61+
"execution_count": null,
62+
"metadata": {},
63+
"outputs": [],
64+
"source": [
65+
"# open a pandas dataframe for use below\n",
66+
"from histogrammar import resources\n",
67+
"df = pd.read_csv(resources.data(\"test.csv.gz\"), parse_dates=[\"date\"])"
68+
]
69+
},
70+
{
71+
"cell_type": "code",
72+
"execution_count": null,
73+
"metadata": {},
74+
"outputs": [],
75+
"source": [
76+
"df.head(2)"
77+
]
78+
},
79+
{
80+
"cell_type": "markdown",
81+
"metadata": {},
82+
"source": [
83+
"## Comparing histogram types"
84+
]
85+
},
86+
{
87+
"cell_type": "markdown",
88+
"metadata": {},
89+
"source": [
90+
"Histogrammar treats histograms as objects. You will see this has various advantages.\n",
91+
"\n",
92+
"Let's fill a simple histogram with a numpy array."
93+
]
94+
},
95+
{
96+
"cell_type": "code",
97+
"execution_count": null,
98+
"metadata": {},
99+
"outputs": [],
100+
"source": [
101+
"# this creates a histogram with 100 even-sized bins in the (closed) range [-5, 5]\n",
102+
"hist1 = hg.Bin(num=10, low=0, high=100)"
103+
]
104+
},
105+
{
106+
"cell_type": "code",
107+
"execution_count": null,
108+
"metadata": {},
109+
"outputs": [],
110+
"source": [
111+
"hist1.fill.numpy(df['age'].values)"
112+
]
113+
},
114+
{
115+
"cell_type": "code",
116+
"execution_count": null,
117+
"metadata": {},
118+
"outputs": [],
119+
"source": [
120+
"hist1.plot.matplotlib();"
121+
]
122+
},
123+
{
124+
"cell_type": "code",
125+
"execution_count": null,
126+
"metadata": {},
127+
"outputs": [],
128+
"source": [
129+
"hist2 = hg.SparselyBin(binWidth=10, origin=0)"
130+
]
131+
},
132+
{
133+
"cell_type": "code",
134+
"execution_count": null,
135+
"metadata": {},
136+
"outputs": [],
137+
"source": [
138+
"hist2.fill.numpy(df['age'].values)"
139+
]
140+
},
141+
{
142+
"cell_type": "code",
143+
"execution_count": null,
144+
"metadata": {},
145+
"outputs": [],
146+
"source": [
147+
"hist2.plot.matplotlib();"
148+
]
149+
},
150+
{
151+
"cell_type": "markdown",
152+
"metadata": {},
153+
"source": [
154+
"Q: Have a look at the .values and .bins attributes of hist1 and hist2.\n",
155+
"What types are these? (hist1.values is a ...?) \n",
156+
"Does that make sense?"
157+
]
158+
},
159+
{
160+
"cell_type": "code",
161+
"execution_count": null,
162+
"metadata": {},
163+
"outputs": [],
164+
"source": [
165+
"hist1"
166+
]
167+
},
168+
{
169+
"cell_type": "code",
170+
"execution_count": null,
171+
"metadata": {},
172+
"outputs": [],
173+
"source": [
174+
"hist2"
175+
]
176+
},
177+
{
178+
"cell_type": "markdown",
179+
"metadata": {},
180+
"source": [
181+
"Q: In each bin, what type of object is keeping track of the bin count?"
182+
]
183+
},
184+
{
185+
"cell_type": "markdown",
186+
"metadata": {},
187+
"source": [
188+
"Try filling hist1 with small values (negative) or very large (> 100) or with NaNs. \n",
189+
"Find out if and how hist1 keeps track of these?"
190+
]
191+
},
192+
{
193+
"cell_type": "markdown",
194+
"metadata": {},
195+
"source": [
196+
"Now fill hist2 with small values (negative) or very large (> 100) or with NaNs. How does hist2 keeps track of these?"
197+
]
198+
},
199+
{
200+
"cell_type": "markdown",
201+
"metadata": {},
202+
"source": [
203+
"## Categorical variables\n",
204+
"\n",
205+
"For categorical variables use the Categorize histogram\n",
206+
"- Categorize histograms: accepting categorical variables such as strings and booleans.\n",
207+
"\n"
208+
]
209+
},
210+
{
211+
"cell_type": "code",
212+
"execution_count": null,
213+
"metadata": {},
214+
"outputs": [],
215+
"source": [
216+
"histx = hg.Categorize('eyeColor')"
217+
]
218+
},
219+
{
220+
"cell_type": "code",
221+
"execution_count": null,
222+
"metadata": {},
223+
"outputs": [],
224+
"source": [
225+
"histx.fill.numpy(df)"
226+
]
227+
},
228+
{
229+
"cell_type": "markdown",
230+
"metadata": {},
231+
"source": [
232+
"Q: A categorize histogram, what is it fundementally, a dictionary or a list?"
233+
]
234+
},
235+
{
236+
"cell_type": "markdown",
237+
"metadata": {},
238+
"source": [
239+
"Q: What else can it keep track of, e.g. numbers, booleans, nans? Give it a try, fill it with more entries!"
240+
]
241+
},
242+
{
243+
"cell_type": "markdown",
244+
"metadata": {},
245+
"source": [
246+
"Fill a histograms with a boolean array (isActive), directly from the dataframe\n",
247+
"\n",
248+
"Q: what type of histogram do you get?"
249+
]
250+
},
251+
{
252+
"cell_type": "code",
253+
"execution_count": null,
254+
"metadata": {},
255+
"outputs": [],
256+
"source": [
257+
"hists = df.hg_make_histograms(features=['isActive'])"
258+
]
259+
},
260+
{
261+
"cell_type": "code",
262+
"execution_count": null,
263+
"metadata": {},
264+
"outputs": [],
265+
"source": []
266+
},
267+
{
268+
"cell_type": "markdown",
269+
"metadata": {},
270+
"source": [
271+
"## Multi-dimensional histograms"
272+
]
273+
},
274+
{
275+
"cell_type": "markdown",
276+
"metadata": {},
277+
"source": [
278+
"Let's make a 3-dimensional histogram, with axes: x=favoriteFruit, y=gender, z=isActive. (In Histogrammar, a multi-dimensional histogram is composed as recursive histograms, starting with the last one.) \n",
279+
"Then fill it with the dataframe."
280+
]
281+
},
282+
{
283+
"cell_type": "code",
284+
"execution_count": null,
285+
"metadata": {},
286+
"outputs": [],
287+
"source": [
288+
"# hist1 = hg.Categorize(quantity='isActive')\n",
289+
"# hist2 = hg.Categorize(quantity='gender', value=hist1)\n",
290+
"# hist3 = hg.Categorize(quantity='favoriteFruit')"
291+
]
292+
},
293+
{
294+
"cell_type": "markdown",
295+
"metadata": {},
296+
"source": [
297+
"Q: How many data points end up in the bin: banana, male, True ?\n"
298+
]
299+
},
300+
{
301+
"cell_type": "markdown",
302+
"metadata": {},
303+
"source": [
304+
"Q: Store this histogram as a json file. What is the size of the json file?"
305+
]
306+
},
307+
{
308+
"cell_type": "markdown",
309+
"metadata": {},
310+
"source": [
311+
"Q: Read back the histogram and then plot it."
312+
]
313+
},
314+
{
315+
"cell_type": "markdown",
316+
"metadata": {},
317+
"source": [
318+
"Q: Make a histogram of the feature 'fruit', which measures the average value of 'latitude' per bin of fruit."
319+
]
320+
},
321+
{
322+
"cell_type": "code",
323+
"execution_count": null,
324+
"metadata": {},
325+
"outputs": [],
326+
"source": [
327+
"hist1 = hg.Average(quantity='latitude')"
328+
]
329+
},
330+
{
331+
"cell_type": "markdown",
332+
"metadata": {},
333+
"source": [
334+
"Q: what is the mean value of latitude for the bin 'strawberry'?"
335+
]
336+
}
337+
],
338+
"metadata": {
339+
"kernel_info": {
340+
"name": "python3"
341+
},
342+
"kernelspec": {
343+
"display_name": "Python 3",
344+
"language": "python",
345+
"name": "python3"
346+
},
347+
"language_info": {
348+
"codemirror_mode": {
349+
"name": "ipython",
350+
"version": 3
351+
},
352+
"file_extension": ".py",
353+
"mimetype": "text/x-python",
354+
"name": "python",
355+
"nbconvert_exporter": "python",
356+
"pygments_lexer": "ipython3",
357+
"version": "3.8.5"
358+
},
359+
"nteract": {
360+
"version": "0.15.0"
361+
},
362+
"pycharm": {
363+
"stem_cell": {
364+
"cell_type": "raw",
365+
"metadata": {
366+
"collapsed": false
367+
},
368+
"source": []
369+
}
370+
}
371+
},
372+
"nbformat": 4,
373+
"nbformat_minor": 4
374+
}

tests/test_notebooks.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,7 @@ def test_notebook_basic(nb_tester):
2424

2525
def test_notebook_advanced(nb_tester):
2626
nb_tester.check(notebook("histogrammar_tutorial_advanced.ipynb"))
27+
28+
29+
def test_notebook_exercises(nb_tester):
30+
nb_tester.check(notebook("histogrammar_tutorial_exercises.ipynb"))

0 commit comments

Comments
 (0)