From 9539db79772b5acc40d838198528172de131fd98 Mon Sep 17 00:00:00 2001
From: "github-classroom[bot]"
<66690702+github-classroom[bot]@users.noreply.github.com>
Date: Mon, 28 Jun 2021 13:39:29 +0000
Subject: [PATCH 1/2] Setting up GitHub Classroom Feedback
From 712cb2fd4e68b91b58da4c3a3776cccd4541bd9e Mon Sep 17 00:00:00 2001
From: bitcoder-17 <70048387+bitcoder-17@users.noreply.github.com>
Date: Mon, 28 Jun 2021 15:25:44 -0500
Subject: [PATCH 2/2] Submission
---
colab.ipynb | 2179 ----------------------
colab_orange_binary_classification.ipynb | 2104 +++++++++++++++++++++
2 files changed, 2104 insertions(+), 2179 deletions(-)
delete mode 100644 colab.ipynb
create mode 100644 colab_orange_binary_classification.ipynb
diff --git a/colab.ipynb b/colab.ipynb
deleted file mode 100644
index 8557fcc..0000000
--- a/colab.ipynb
+++ /dev/null
@@ -1,2179 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "view-in-github"
- },
- "source": [
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "wgjCUHZDcoFN"
- },
- "source": [
- "#### Copyright 2020 Google LLC."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "2QYNKEzLcTMN"
- },
- "outputs": [],
- "source": [
- "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
- "# you may not use this file except in compliance with the License.\n",
- "# You may obtain a copy of the License at\n",
- "#\n",
- "# https://www.apache.org/licenses/LICENSE-2.0\n",
- "#\n",
- "# Unless required by applicable law or agreed to in writing, software\n",
- "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
- "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
- "# See the License for the specific language governing permissions and\n",
- "# limitations under the License."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "w-JgxO6Qcrk9"
- },
- "source": [
- "# Binary Classification\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "jwVCC6XOhhni"
- },
- "source": [
- "In this unit we will explore [binary classification](https://en.wikipedia.org/wiki/Binary_classification) using [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression).\n",
- " \n",
- "Some of these terms might be new, so let's explore them a bit more.\n",
- " \n",
- "[Classification](https://en.wikipedia.org/wiki/Statistical_classification) is the process of mapping a set of data points to a finite set of labels. From our [regression](https://en.wikipedia.org/wiki/Regression_analysis) labs, you likely remember that regression models such as [linear regression](https://en.wikipedia.org/wiki/Linear_regression) map input variables to a range of continuous values. In the domain of machine learning, models that predict continuous values are considered regression models. Models that predict a known finite set of values are considered classification models.\n",
- " \n",
- "*So what does binary mean?*\n",
- " \n",
- "Binary means there are only two values to predict. Binary classification is used to predict one of two values. These can be *true*/*false*, *malignant*/*benign*, *yes*/*no*, or any possible this-or-that options. For simplicity, these options are usually encoded as 1 and 0.\n",
- " \n",
- "*And what about logistic regression?*\n",
- " \n",
- "You've already seen linear regression, which attempts to fit a line to a set of data in order to predict continuous values. [Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) similarly attempts to fit a line to data. However, the line is typically a [logistic/sigmoid](https://en.wikipedia.org/wiki/Logistic_function) curve. Instead of predicting a continuous value, the model uses the logistic curve to split the data into two classes. One class falls to one side of the line, and the other class falls to the other side of the line."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "ODh348gr0ddg"
- },
- "source": [
- "## Framing the Problem"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "XzMnO97Q0kbo"
- },
- "source": [
- "*Cindy's Produce For Good* has a problem. Its business model revolves around collecting unsold fruit and vegetables from local growers and distributing them to families in need so that they can consume it or resell it at local farmer's markets and roadside stands.\n",
- " \n",
- "Quite a few complaints have come in lately from families and customers who have had a bitter surprise. They've peeled what they thought was an orange only to bite in and find out that they are eating a grapefruit!\n",
- " \n",
- "Cindy's growers give her truckloads of mixed citrus: lemons, limes, oranges, and grapefruit. A volunteer crew sorts the fruit. They are really good at sorting lemons and limes, but they falsely identify grapefruit as oranges about 5% of the time.\n",
- " \n",
- "In order to ensure customers get the oranges they expect, Cindy has created a machine that measures the weight, color, and largest diameter of fruit. She wants to create some software that can use this information and tell her workers if the fruit is an orange or not.\n",
- " \n",
- "She put a few thousand pieces of orange-looking fruit from one of her shipments through the sensors and manually labelled them as oranges or grapefruit. Looking at the data, she couldn't see an obvious pattern. Her best performance was about 90% accuracy. Her human sorters can do at least 95%. She's requested our help to see if we can solve the orange vs. grapefruit problem.\n",
- " \n",
- "In this lab we'll examine Cindy's citrus data and try to build a model to help her reliably sort her fruit as well as or better than human sorters."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "BAPcKBq854Z5"
- },
- "source": [
- "### Exercise 1: Thinking About the Data\n",
- "\n",
- "Before we dive in to looking closely at the data, let's think about the problem space and the dataset. Consider the questions below."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "aFx7Ts3b6N6d"
- },
- "source": [
- "#### Question 1\n",
- "\n",
- "Is this problem actually a good fit for machine learning? Why or why not?"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "QEECNP0z4Eqz"
- },
- "source": [
- "##### **Student Solution**"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "BNcJPqVF6UNi"
- },
- "source": [
- "*Please Put Your Answer Here*"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "PZsXA8-xXAo5"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "NMfvenfO6ZK9"
- },
- "source": [
- "#### Question 2\n",
- "\n",
- "If we do build Cindy a machine learning model, what biases might exist in the data? Is there anything that might cause her model to have trouble generalizing to other data? If so, how might she make the model more resilient?"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "0P_GjcKE4Nil"
- },
- "source": [
- "##### **Student Solution**"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "XzqOPKOb6_A2"
- },
- "source": [
- "*Please Put Your Answer Here*"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "bIBWp2gPXGoo"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "814wZ3BlYqK_"
- },
- "source": [
- "#### Question 3\n",
- "\n",
- "We've been asked to create a system that determines if a piece of fruit is an orange or not an orange. But aside from that, we haven't gotten much information about how the system would work as a whole.\n",
- "\n",
- "Describe how you would design the system from end-to-end. Things to consider:\n",
- "\n",
- "- Would the input fruit be all of the fruit that Cindy receives? Only the fruit suspected of being an orange? Only questionable fruit? Anything suspected of being an orange or a grapefruit?\n",
- "\n",
- "- What happens to fruit classified as \"not orange\". Is it automatically considered a grapefruit? Is it thrown away? Put in a mixed fruit bag?\n",
- "\n",
- "Justify the inputs and the output actions for the system. What are the trade-offs?"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "M6wlYr_74KMT"
- },
- "source": [
- "##### **Student Solution**"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "nEO2I08VcgfR"
- },
- "source": [
- "*Please Put Your Answer Here*\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "bKItp8vjXKov"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "TO4DpNhgk66g"
- },
- "source": [
- "## Exploratory Data Analysis"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "nRLHpnzclbUI"
- },
- "source": [
- "### Acquire the Data"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "bCJUYfK3k_Au"
- },
- "source": [
- "We have some idea about the problem that we are trying to solve, so let's take a look at what has been collected. The data is [hosted on Kaggle](https://www.kaggle.com/joshmcadams/oranges-vs-grapefruit). You can download the dataset and then upload it to this lab or use the code blocks below to fetch the data directly.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "vPYUr0Oal8Oa"
- },
- "source": [
- "#### Direct Kaggle Download"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "wM8Cnk0WmLS1"
- },
- "source": [
- "Follow the [API Credentials](https://github.com/Kaggle/kaggle-api#api-credentials) instructions and get a `kaggle.json` file (if you don't already have one), and upload it to this lab.\n",
- "\n",
- "Then run the code block below to download the oranges vs. grapefruit dataset."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "1DCns7DaatrV"
- },
- "outputs": [],
- "source": [
- "!KAGGLE_CONFIG_DIR=`pwd` kaggle datasets download joshmcadams/oranges-vs-grapefruit\n",
- "!ls"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "-kZpRsJVnXjI"
- },
- "source": [
- "There should now be an `oranges-vs-grapefruit.zip` file in the virtual machine for this lab. Let's unzip it so we can access the data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "-lVK5qcenhL6"
- },
- "outputs": [],
- "source": [
- "!unzip -o oranges-vs-grapefruit.zip\n",
- "!ls"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "hl8SHuXhn3MA"
- },
- "source": [
- "There is now a `citrus.csv` file in our virtual machine. Let's start digging into the data next."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "Z2UUc8mkleDw"
- },
- "source": [
- "### Basic Analysis"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "ouEjcJQIl1K_"
- },
- "source": [
- "\n",
- "First and foremost, we need to load the data. For that we'll rely on [Pandas](https://pandas.pydata.org/) and use the `read_csv` function since the data was provided to us as a CSV file.\n",
- " \n",
- "After we load the data, let's sample it to get an idea of what we are working with."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "E0tP0yHJ7orM"
- },
- "outputs": [],
- "source": [
- "import pandas as pd\n",
- "\n",
- "citrus_df = pd.read_csv('citrus.csv', header=0)\n",
- "citrus_df.sample(10, random_state=2020)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "sqg9pDY3mWZe"
- },
- "source": [
- " It looks like we have a mixed bag of fruit containing oranges and grapefruit, just as expected.\n",
- "\n",
- " How many do we have of each?"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "P4bIdlfJoVgt"
- },
- "source": [
- "#### Exercise 2: Basic Statistics\n",
- "\n",
- "Let's take a moment to determine the distribution of fruit in our dataset. Use [pyplot](https://matplotlib.org/api/pyplot_api.html) to create a histogram of the values in the `name` column of our `DataFrame`."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "kFliytaio2Kk"
- },
- "source": [
- "##### **Student Solution**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "5gS50q4_oQI0"
- },
- "outputs": [],
- "source": [
- "# Your Solution Goes Here"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "o4TMbDARXPAg"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "YHo70ZE5pxAk"
- },
- "source": [
- "### Interpreting Our Histogram\n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "Gz_kZJbbpz3g"
- },
- "source": [
- "The histogram shows the data evenly distributed across different types of fruit. This distribution makes the dataset very balanced for building a model for our classifier."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "WlIQ6RartLAU"
- },
- "source": [
- "### Describing Our Dataset"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "mIPBq2G3tPt3"
- },
- "source": [
- "Next let's do a simple `describe` of our dataset to get some more detailed information."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "todQOhCQAayf"
- },
- "outputs": [],
- "source": [
- "citrus_df.describe()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "_X1rOcIju8NG"
- },
- "source": [
- "Since every count is 10,000, we don't seem to have missing values.\n",
- "\n",
- "Also, every `min` value is a positive number. This is good since it would be really odd to have negative diameters, weights, or colors.\n",
- "\n",
- "Do the values themselves look reasonable? The diameter is measured in centimeters. Is a 2 cm piece of fruit believable? What about a 16 cm piece of fruit?\n",
- "\n",
- "Similarly, do the weights seem within ranges that we'd expect?\n",
- "\n",
- "It is actually difficult to tell since we have different kinds of fruit in this bag. It would be easier to inspect summary statistics for each type of fruit."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "VeIOaYrCxfo6"
- },
- "source": [
- "#### Exercise 3: More Focused Description\n",
- "\n",
- "We have used `describe()` to get statistics about the entire dataset, but there isn't a lot of information in the data. Write Python code to print `describe()` statistics for each type of fruit in the dataset. Use the `percentiles` argument to the [describe method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) to not print the 25th and 75th percentile.\n",
- "\n",
- "Your output should look similar to:\n",
- "\n",
- "```\n",
- "orange\n",
- " diameter weight red green blue\n",
- "count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000\n",
- "mean 8.474424 152.804920 156.832800 81.988200 7.115200\n",
- "std 1.260665 18.669021 9.890258 10.090789 6.493779\n",
- "min 2.960000 86.760000 123.000000 49.000000 2.000000\n",
- "50% 8.470000 152.665000 157.000000 82.000000 4.000000\n",
- "max 12.870000 231.090000 192.000000 116.000000 38.000000\n",
- "\n",
- "grapefruit\n",
- " diameter weight red green blue\n",
- "count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000\n",
- "mean 11.476946 197.296664 150.862800 70.033000 15.611200\n",
- "std 1.221148 19.193190 10.103148 10.044924 9.271592\n",
- "min 7.630000 126.790000 115.000000 31.000000 2.000000\n",
- "50% 11.450000 197.430000 151.000000 70.000000 15.000000\n",
- "max 16.450000 261.510000 187.000000 103.000000 56.000000\n",
- "```\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "vJjtTQBVzosh"
- },
- "source": [
- "##### **Student Solution**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "c61DljFmzpY9"
- },
- "outputs": [],
- "source": [
- "# Your Solution Goes Here"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "quwhDANzXWLe"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "JBYmcbBp0hK0"
- },
- "source": [
- "### Visualizing With Box Plots"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "r3VKH6Um0lVs"
- },
- "source": [
- "Now that we've sanity checked our data, let's visualize it to see if we can gather more insight. Above we gathered the min, max, mean, etc. for each numeric column for each type of fruit in a tabular form. Let's now visualize that data using a box plot and the [Altair](https://altair-viz.github.io/) visualization library.\n",
- "\n",
- "To start using Altair, we simply import it."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "DPMKh6R51Z8d"
- },
- "outputs": [],
- "source": [
- "import altair as alt"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "Myyagald1fCs"
- },
- "source": [
- "Next we will use the `mark_boxplot` method of the [Chart](https://altair-viz.github.io/user_guide/generated/toplevel/altair.Chart.html?highlight=mark_boxplot) class to create our boxplot.\n",
- "\n",
- "Let's start by plotting the diameter by name.\n",
- "\n",
- "To do this we must first sample a subset of our data. We have 20,000 rows of data, and Altair cannot visualize that much data in a boxplot. The row limit is 5,000 rows, so we'll create a 5,000 row sample and then pass that sample to Altair."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "YlBaIjExfPFE"
- },
- "outputs": [],
- "source": [
- "citrus_df_sample = citrus_df.sample(n=5000, random_state=2020)\n",
- "\n",
- "alt.Chart(citrus_df_sample, width=400).mark_boxplot().encode(\n",
- " x='name',\n",
- " y='diameter'\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "UgEKnpIK3DXy"
- },
- "source": [
- "What insights can we glean from this graphic?\n",
- " \n",
- "As expected, the diameter of a grapefruit trends larger than that of an orange, but there is some overlap.\n",
- " \n",
- "Let's now add in weight to our boxplot."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "X9S23tjUAMXo"
- },
- "outputs": [],
- "source": [
- "alt.Chart(citrus_df_sample, width=400).mark_boxplot().encode(\n",
- " x='name',\n",
- " y='diameter'\n",
- ") | alt.Chart(citrus_df_sample, width=400).mark_boxplot().encode(\n",
- " x='name',\n",
- " y='weight'\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "o4Rag3gLEHMk"
- },
- "source": [
- "### Correlation"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "awOjMaT-AIqZ"
- },
- "source": [
- "Notice that relative weight and diameter seem pretty similar. These two columns might be closely correlated enough that we only need to use one of them. Let's check the correlation coefficient."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "Ith_gAJeBV4D"
- },
- "source": [
- "#### Exercise 4: Correlation Coefficient\n",
- "\n",
- "Based on our visualization above, we suspect that diameter and weight are highly correlated. Write code to find the correlation coefficient between the diameter and weight columns in our `DataFrame`.\n",
- "\n",
- "*Hint: Check out the [corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) documentation.*"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "nZOllTtZBo93"
- },
- "source": [
- "##### **Student Solution**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "kvguHDwYB4_P"
- },
- "outputs": [],
- "source": [
- "# Your Solution Goes Here"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "3PQ1J94LahIw"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "J8WcHxyREKeC"
- },
- "source": [
- "### Understanding the Correlation"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "xzKTnQt_ENr8"
- },
- "source": [
- "The correlation between diameter and weight is over 99%. That is a very high value.\n",
- "\n",
- "This shouldn't come as a big surprise. We should expect that the weight of a piece of fruit grows as its diameter grows.\n",
- "\n",
- "For now we can leave the data as is, but remember this correlation. We might be able to use it to remove a column from our training data without negatively affecting our model."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "0nGieXUmG_8T"
- },
- "source": [
- "Let's take another look at height and weight. They are definitely correlated, but how do they relate to each other for each fruit type?\n",
- "\n",
- "One way to see this is to use a scatter plot chart to plot the diameter versus the weight, segmented by fruit type.\n",
- "\n",
- "We'll use Altair to do this."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "voAyuNFs7uBy"
- },
- "outputs": [],
- "source": [
- "alt.Chart(citrus_df_sample).mark_circle().encode(\n",
- " x='diameter',\n",
- " y='weight',\n",
- " color='name'\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "748C7ws7I5Bo"
- },
- "source": [
- "We can see that oranges and grapefruit have very similar rates of weight gain as their diameter increases. This shouldn't be too surprising since they are very similar fruits.\n",
- " \n",
- "In this chart we can also see that there are some fruits that are clearly oranges because of their small size and weight, as well as some that are clearly grapefruit due to their large size and weight. However, we have a large number of fruits that will be difficult to classify using diameter and weight alone."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "HfhTyJXjJOh9"
- },
- "source": [
- "### Checking Color Values"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "hjAZwYeIJZmN"
- },
- "source": [
- "We've looked pretty closely at the diameter and weight values, but we haven't done much with the color (RGB) values.\n",
- "\n",
- "Let's first see if boxplots are helpful."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "yCwnzlaqMMPU"
- },
- "outputs": [],
- "source": [
- "alt.Chart(citrus_df_sample, width=400).mark_boxplot().encode(\n",
- " x='name',\n",
- " y='red'\n",
- ") | alt.Chart(citrus_df_sample, width=400).mark_boxplot().encode(\n",
- " x='name',\n",
- " y='green'\n",
- ") | alt.Chart(citrus_df_sample, width=400).mark_boxplot().encode(\n",
- " x='name',\n",
- " y='blue'\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "bhD1jxTQJy91"
- },
- "source": [
- "There doesn't seem to be a lot of value there, at least examining each element of color separately. There is quite a bit of overlap between each color element, with grapefruit displaying a little less red and green typically."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "PKEKDZTZKKcH"
- },
- "source": [
- "It would also be nice to \"sanity check\" the color values, similar to how we checked to make sure that our diameters and weights were within reason. We could see if the values fall within a reasonable range, but then we'd need to know reasonable RGB values for oranges and grapefruit.\n",
- "\n",
- "Since we are dealing with color data, we can just create an image for each piece of fruit that contains a sampling (or all) of the colors that we have and we can see if it looks reasonable."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "j4mL5EK4L04L"
- },
- "source": [
- "First, let's get an exact count of the number of samples of each fruit type."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "PfZyWY4_Lqo7"
- },
- "outputs": [],
- "source": [
- "citrus_df.groupby('name')['name'].count()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "IO7e97CPL5Db"
- },
- "source": [
- "As expected, we have 5000 samples each. We can create a 100x50 image for each fruit type and visualize the data.\n",
- "\n",
- "We'll use [PIL's Image class](https://pillow.readthedocs.io/en/stable/reference/Image.html) to create a white 100x50 image. Then we'll get the editable pixel map from the image and assign the color value for each orange in our data to a different pixel in the image.\n",
- "\n",
- "Once we have the image filled out with color, we'll use PyPlot to display the image."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "SR96sFvDLnCr"
- },
- "outputs": [],
- "source": [
- "from PIL import Image\n",
- "from matplotlib.pyplot import imshow\n",
- "import numpy as np\n",
- "\n",
- "height, width = 50, 100\n",
- "img = Image.new('RGB', (width, height), color=(255, 255, 255))\n",
- "pixels = img.load()\n",
- "\n",
- "row_i, col_i = 0, 0\n",
- "for _, fruit in citrus_df[citrus_df['name'] == 'orange'].iterrows():\n",
- " pixels[col_i, row_i] = (fruit['red'], fruit['green'], fruit['blue'])\n",
- " col_i += 1\n",
- " if col_i >= width:\n",
- " col_i = 0\n",
- " row_i += 1\n",
- "\n",
- "imshow(img)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "KViuNrS8Px5-"
- },
- "source": [
- "That looks like a pretty reasonable orange color. What about the grapefruit?"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "B4u-yPUjP3OX"
- },
- "source": [
- "#### Exercise 5: Create a Color Map Image\n",
- "\n",
- "We only visualized data from oranges. We'd really like to see the colors of all of the fruit. Create and show a 100x100 image that contains the colors for all of the oranges in the first 100x50 block. This should be followed with the colors for all of the grapefruit in the next 100x50 block. Visually inspect your image to see if the colors are believable as oranges and grapefruit."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "vEs1H0QHQhZy"
- },
- "source": [
- "##### **Student Solution**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "pWmg9cLXQlWA"
- },
- "outputs": [],
- "source": [
- "# Your Solution Goes Here"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "pIa1Ndg-cMeA"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "cX2Jwx2wRoxi"
- },
- "source": [
- "### Data Analysis Summary"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "-qWJ3X1aRxk0"
- },
- "source": [
- "We've done a lot of data analysis and have a pretty good feel for our data. We have:\n",
- "\n",
- "* Examined the distribution of our dataset and seen we have an equal distribution of fruit types\n",
- "* Determined that no data is missing\n",
- "* Determined that our weight, diameter, and color values are all within reason\n",
- "* Found a strong correlation between weight and diameter\n",
- "\n",
- "Let's see if we can build a model to classify our oranges!\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "oRd-BEHtTHyL"
- },
- "source": [
- "## Simple Logistic Model"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "jUeivRYFTLoi"
- },
- "source": [
- "It is now time to build and iterate on a model. We'll start with a simple logistic regression model using scikit-learn's [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) class and the feature columns already in our training data.\n",
- "\n",
- "Let's first remind ourselves of the columns we have at our disposal."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "AV0d1qhhOxml"
- },
- "outputs": [],
- "source": [
- "citrus_df.columns"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "WlmgcGGVWXUI"
- },
- "source": [
- "We'll use 'diameter', 'weight', 'red', 'green', and 'blue' as feature columns. Using 'name' for our target column is tempting, but remember that it contains fruit names for values, and for this exercise, we are only interested in determining if a piece of fruit is an orange or not an orange. Let's create a new column called 'is_orange' that contains the value `True` if the datum is an orange and `False` otherwise."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "Gry5FUYVXJCy"
- },
- "source": [
- "### Exercise 6: Is Orange?\n",
- "\n",
- "Create a new column in `citrus_df` called `is_orange`. The column should be a boolean column and should contain the value `True` if a given row is labeled as an orange and `False` otherwise."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "qrMMuSoMXi-3"
- },
- "source": [
- "##### **Student Solution**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "SlRxZuRSXlWx"
- },
- "outputs": [],
- "source": [
- "# Your Solution Goes Here"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "U_ia4bXudMdJ"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "VBpOC1xuX6Wp"
- },
- "source": [
- "### Examining Our New Target Column"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "8CVIhZJwYj-8"
- },
- "source": [
- "Now that we've created a new target column, we should do some checking to make sure that it was created correctly.\n",
- "\n",
- "First we'll simply see the count per value."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "6dtKjRtAYtsg"
- },
- "outputs": [],
- "source": [
- "citrus_df.groupby('is_orange')['is_orange'].count()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "RgVllZcyY2p0"
- },
- "source": [
- "There should be 5,000 `True` values and 5,000 `False` values.\n",
- "\n",
- "Now check to see that all 5,000 of the `True` values have the `name` \"orange\"."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "XDdyNHZgZEUN"
- },
- "outputs": [],
- "source": [
- "citrus_df[citrus_df['is_orange']]['name'].unique()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "cPYJT9NKYszc"
- },
- "source": [
- "We should only see a single value in the unique list ('orange') since all rows with 'is_orange' set to `True` should have a 'name' of 'orange'."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "2Tf5GjMYaHJE"
- },
- "source": [
- "### Train/Test Split"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "r62BzXuyacop"
- },
- "source": [
- "We can now split our data for training and testing. First we will create variables to hold our training and target column names."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "YuE6tX8qO3fF"
- },
- "outputs": [],
- "source": [
- "target_column = 'is_orange'\n",
- "\n",
- "feature_columns = ['diameter', 'weight', 'red', 'green', 'blue']\n",
- "\n",
- "target_column, feature_columns"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "lE_6f7eMsiW0"
- },
- "source": [
- "We need to split the data into a training and testing set. In this case we'll split 20% of the data off for testing and train on the other 80%. We can use scikit-learn's `train_test_split` function to do this. It is also a really good idea to shuffle our data, and `train_test_split` allows us to do this too.\n",
- "\n",
- "After we make the split, we can see how many data points we will train off of for each class."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "E9y3IXRaeYPD"
- },
- "outputs": [],
- "source": [
- "from sklearn.model_selection import train_test_split\n",
- "\n",
- "X_train, X_test, y_train, y_test = train_test_split(\n",
- " citrus_df[feature_columns],\n",
- " citrus_df[target_column],\n",
- " test_size=0.2,\n",
- " random_state=180,\n",
- " shuffle=True)\n",
- "\n",
- "y_train.groupby(y_train).count()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "Ga6YhfzKaxUW"
- },
- "source": [
- "Hmm. It looks like our training set has become a little uneven. Ideally, we would maintain the same ratio of oranges to non-oranges in our training and testing groups as the ratio in the whole set (50/50). But after splitting the data, we've ended up with a training set that skews towards non-oranges, and a test set that skews the opposite way, towards oranges."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "duQC2BDgnDJt"
- },
- "outputs": [],
- "source": [
- "y_test.groupby(y_test).count()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "s1_Tp1rKnqc6"
- },
- "source": [
- "Luckily, there's a solution for this problem: [stratified sampling](https://en.wikipedia.org/wiki/Stratified_sampling). Stratifying our data ensures that the ratio of distinct values in the given column remains the same in our training and test sets as it is in the whole set (half orange and half non-orange, in this case)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "x--twmjowI6D"
- },
- "source": [
- "#### Exercise 7: Stratified Train Test Split\n",
- "\n",
- "Look at the [documentation for `train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) and find the argument that can be used to stratify the data. Rewrite the split above to create a stratified split. When you are done, there should be 4,000 `True` values and 4,000 `False` values in the training data and 1,000 of each in the testing data. Print the counts to verify."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "tzk1dDO-wrWr"
- },
- "source": [
- "##### **Student Solution**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "cJ1Cf4YMwutU"
- },
- "outputs": [],
- "source": [
- "# Your Code Goes Here"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "JJg9KuFiwv4o"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "ZYWynOgepHE1"
- },
- "source": [
- "### Examining The Split Data"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "VMK0XRQYcHER"
- },
- "source": [
- "We can now verify that we have 80% of the data in training..."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "M05Xb00YQce_"
- },
- "outputs": [],
- "source": [
- "X_train.shape, y_train.shape"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "dP8SlC1NcLHJ"
- },
- "source": [
- "And 20% in testing."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "0J5iWTZVQgYr"
- },
- "outputs": [],
- "source": [
- "X_test.shape, y_test.shape"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "ZTlWBlvPdHmK"
- },
- "source": [
- "Let's look at the training data and see if it stratified correctly."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "CYqtUfInQw2K"
- },
- "outputs": [],
- "source": [
- "y_train.describe()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "oWTEqjnJdRf4"
- },
- "source": [
- "From this output we can see that there are 8,000 pieces of data with 2 unique values. The top value is `True`, and it occurs 4,000 times. That would leave us with 4,000 other values that are `False`.\n",
- "\n",
- "We can do the same for the `y_test` data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "JgTAtdGRRdzZ"
- },
- "outputs": [],
- "source": [
- "y_test.describe()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "teL0KUBhdx5B"
- },
- "source": [
- "Another alternative is to use `groupby` on the series. Notice that the `by` argument contains the series once again and not a column name."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "pwnrwBbKTLo6"
- },
- "outputs": [],
- "source": [
- "y_test.groupby(by=y_test).count()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "C-czcddqd72v"
- },
- "source": [
- "### Create and Train the Model"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "LpAKoPIUd_jW"
- },
- "source": [
- "It is finally time to build and train our model. As a reminder, we are using [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).\n",
- "\n",
- "First, we'll build a baseline model with default arguments, and see how well it does. To build the model we import `LogisticRegression`, create a class instance, and then fit the model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "jCmKNLvDTgti"
- },
- "outputs": [],
- "source": [
- "from sklearn.linear_model import LogisticRegression\n",
- "\n",
- "model = LogisticRegression(random_state=2020)\n",
- "model.fit(X_train, y_train)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "CpguGiacfHVU"
- },
- "source": [
- "### Measure Model Performance\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "-OHxe_Tle2zI"
- },
- "source": [
- "We now have a model ready to use to make predictions. Let's first make predictions on the test data that we held out of our training set and see how well we did.\n",
- "\n",
- "The first step is to actually make the predictions."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "XZcUbj2dUIJR"
- },
- "outputs": [],
- "source": [
- "predictions = model.predict(X_test)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "3UgWpmXYfnUA"
- },
- "source": [
- "Now we can use metrics functions from scikit-learn to see how well our model performed. We'll check the accuracy, precision, recall, and F1 scores."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "aIESA8B8WYgn"
- },
- "outputs": [],
- "source": [
- "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n",
- "\n",
- "print('Accuracy: ', round(accuracy_score(predictions, y_test), 3))\n",
- "print('Precision: ', round(precision_score(predictions, y_test), 3))\n",
- "print('Recall: ', round(recall_score(predictions, y_test), 3))\n",
- "print('F1: ', round(f1_score(predictions, y_test), 3))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "GD3ZQ7AFhctN"
- },
- "source": [
- "Numbers for most of the metrics are above 90%, which is better than Cindy was sorting!\n",
- "\n",
- "Let's see how this looks in a confusion matrix."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "zqkTlKu_ax3B"
- },
- "outputs": [],
- "source": [
- "from sklearn.metrics import confusion_matrix\n",
- "\n",
- "tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()\n",
- "\n",
- "print(f'True Positive: {tp}\\nTrue Negative: {tn}\\nFalse Positive: {fp}\\nFalse Negative: {fn}')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "ZTXiVIjOkBt6"
- },
- "source": [
- "We have just under 100 falsely identified fruit. There are about twice as many false negatives as there are false positives. Let's take a few minutes to think about what this confusion matrix means."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "HALEd_j5vRc3"
- },
- "source": [
- "#### Exercise 8: Interpreting a Confusion Matrix\n",
- "\n",
- "In the text cell below, explain what a false positive and false negative represent in our dataset: which is an orange classified as a grapefruit and which is a grapefruit classified as an orange?"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "8g2z3L-v4WXP"
- },
- "source": [
- "##### **Student Solution**"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "9yEN7fQNvqK1"
- },
- "source": [
- "*Please Put Your Answer Here*"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "U-upCwGgktep"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "RptgsFxBkUyf"
- },
- "source": [
- "#### ROC Curve"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "o_s0IEeaPRo_"
- },
- "source": [
- "We can visualize this in another way using the [Receiver-Operator Curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). This graph plots the true positive rate on the y-axis against the false positive rate on the x-axis."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "X4vXHTwvY4aV"
- },
- "outputs": [],
- "source": [
- "from sklearn.metrics import roc_curve\n",
- "from sklearn.metrics import roc_auc_score\n",
- "\n",
- "scores = model.decision_function(X_test)\n",
- "\n",
- "fpr, tpr, _ = roc_curve(y_test, scores, pos_label=True)\n",
- "\n",
- "plt.ylabel('True Positive Rate')\n",
- "plt.xlabel('False Positive Rate')\n",
- "plt.plot(fpr, tpr)\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "nEo_TM_ClV8F"
- },
- "source": [
- "We can see that there is a steep increase in false positives as the true positive rate crosses into the 90% range."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "u7qc9WwVlkMf"
- },
- "source": [
- "#### Precision Recall Curve"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "0OFLWPJ0P5hD"
- },
- "source": [
- "We can also get a feel for how precision and recall relate for this model by plotting the [precision recall curve](https://en.wikipedia.org/wiki/Precision_and_recall)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "YyvmOGaHluCl"
- },
- "outputs": [],
- "source": [
- "from sklearn.metrics import precision_recall_curve\n",
- "\n",
- "scores = model.decision_function(X_test)\n",
- "\n",
- "precision, recall, _ = precision_recall_curve(y_test, scores)\n",
- "\n",
- "plt.xlabel('Recall')\n",
- "plt.ylabel('Precision')\n",
- "plt.plot(recall, precision)\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "KozxQbBMnGqJ"
- },
- "source": [
- "This shows the balance between precision and recall as the model adjusts classification thresholds."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "xgGfw7FUnMnL"
- },
- "source": [
- "## Improving Our Model"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "lCXsEo4znO-s"
- },
- "source": [
- "Our initial model was actually pretty good. But can it be even better?\n",
- " \n",
- "In the next exercise we'll attempt to improve our model by exploring hyperparameters and manipulating features."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "fmOpuYyOooWx"
- },
- "source": [
- "### Exercise 9: Using GridSearchCV\n",
- "\n",
- "We will now experiment with different hyperparameters to see if we can tune the model to increase our scores. To do this we will use the [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) class to [tune hyperparameters](https://scikit-learn.org/stable/modules/grid_search.html) of the scikit-learn [LogisticRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).\n",
- "\n",
- "GridSearchCV is a class used to test different hyperparameters for a model. The search accepts a dictionary containing keys that map to model parameters. The values are lists for hyperparameters that you want to experiment with or single values for parameters that you want to keep constant."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "NXG7dvgHrWmQ"
- },
- "source": [
- "#### Question 1: Performing the Search\n",
- " \n",
- "Below is some code that imports the necessary functions and classes and sets up a logistic regression model for grid search. Add code to the grid search to test different hyperparameters such as `tol`, `C`, `solver`, and `max_iter`.\n",
- " \n",
- "The best estimator will be displayed after running the code block."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "PZunsUXx4ZpL"
- },
- "source": [
- "##### **Student Solution**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "VrFSNQv1onp4"
- },
- "outputs": [],
- "source": [
- "import pandas as pd\n",
- "\n",
- "from sklearn.linear_model import LogisticRegression\n",
- "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n",
- "from sklearn.model_selection import train_test_split, GridSearchCV\n",
- "\n",
- "citrus_df = pd.read_csv('citrus.csv', header=0)\n",
- "citrus_df['is_orange'] = citrus_df['name'].apply(lambda name: name == 'orange')\n",
- "\n",
- "target_column = 'is_orange'\n",
- "feature_columns = ['diameter', 'weight', 'red', 'green', 'blue']\n",
- "\n",
- "X_train, X_validate, y_train, y_validate = train_test_split(\n",
- " citrus_df[feature_columns],\n",
- " citrus_df[target_column],\n",
- " test_size=0.2,\n",
- " random_state=42,\n",
- " shuffle=True,\n",
- " stratify=citrus_df[target_column])\n",
- "\n",
- "model = LogisticRegression(\n",
- " random_state=2020,\n",
- ")\n",
- "\n",
- "search = GridSearchCV(model, {\n",
- " # Your Solution Goes Here\n",
- "})\n",
- "\n",
- "search.fit(X_train, y_train)\n",
- "\n",
- "print(search.best_estimator_)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "QfyXfK6qmFHR"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "3FI4zQuar69J"
- },
- "source": [
- "#### Question 2: Validate the Model\n",
- "\n",
- "Now that we have found a model that scored the highest in a cross-validation grid search, let's validate the model to see if it generalizes well on our validation data.\n",
- "\n",
- "We held out validation data in the `X_validate` and `y_validate` variables. Use that data to calculate the accuracy, precision, recall, and F1 scores for the model."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "vQ49ehXl4chc"
- },
- "source": [
- "##### **Student Solution**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "4RpgwipWsmB8"
- },
- "outputs": [],
- "source": [
- "# Your Solution Goes Here"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "Gu0xUpHimgTs"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "Cdl_tvGZsuVJ"
- },
- "source": [
- "#### Question 3: Relative Model Quality\n",
- "\n",
- "Now that we have the scores for our model on our validation set, is the version found by grid search notably better? Discuss the difference in scores, if any, between our base model and the model selected by grid search."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "YL74G6Jk4d9C"
- },
- "source": [
- "##### **Student Solution**"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "SNzRzEB8tHaZ"
- },
- "source": [
- "*Please Put Your Answer Here*"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "DWcPKf6TmobG"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "_hphJaGFtvtk"
- },
- "source": [
- "## Exercise 10: Final Model Assessment\n",
- "\n",
- "Given our model performance, is this machine learning model a good fit for the problem?"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "fwVtY7eb4gr8"
- },
- "source": [
- "##### **Student Solution**"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "vIR8qHyyuVPy"
- },
- "source": [
- "*Please Put Your Answer Here*"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "BY3VkqcYm2i3"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "QT8bPMU0xBzV"
- },
- "source": [
- "## Challenge"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "8g2r7I4y0sPv"
- },
- "source": [
- "### Question 1\n",
- "\n",
- "Normalization and standardization of data is not strictly required for performing logistic regression. It is, however, suggested in some cases. Research reasons why you might want (or not want) to normalize or standardize your input data to a logistic regression.\n",
- "\n",
- "Explain your findings and link to any relevant articles."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "RKe9gOYY4jQa"
- },
- "source": [
- "#### **Student Solution**"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "tibOqWcM1UIP"
- },
- "source": [
- "*Please Put Your Answer Here*"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "4iKsRd7UnfkH"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "v56Y-Ix93JfE"
- },
- "source": [
- "### Question 2\n",
- "\n",
- "Use the [sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to scale the feature data before training a logistic model on our oranges dataset. Use [sklearn.model_selection.GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to iterate through hyperparameters to find an optimal model."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "Nmv-j6oR4lFK"
- },
- "source": [
- "#### **Student Solution**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "YLTLFHgZyP_r"
- },
- "outputs": [],
- "source": [
- "# Your code goes here"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "BnfmK-w6oAii"
- },
- "source": [
- "---"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "WBRhFJkf3x_t"
- },
- "source": [
- "### Question 3\n",
- "\n",
- "Are the optimal hyperparameters the same for the logistic regression model before and after scaling the data? Why or why not? Did you notice any other differences?"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "jS4EWkOS3-MH"
- },
- "source": [
- "#### **Student Solution**"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "XBy3g7pz4win"
- },
- "source": [
- "*Please Put Your Answer Here*"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "YtGj2vIKoK3b"
- },
- "source": [
- "---"
- ]
- }
- ],
- "metadata": {
- "colab": {
- "collapsed_sections": [
- "wgjCUHZDcoFN",
- "CzT_KlNQz3QJ",
- "ctYb2Kciz5_g",
- "jl6jXALoz81g",
- "fIUMc607o4nx",
- "aPoWWQ_GzwRl",
- "ARQ5UMWCB6-8",
- "rzu1QHN2Qnp8",
- "KgBbJ6wjXooe",
- "It_3UR-j0bp8",
- "Shdk3nhrr2V2",
- "HkMc6Hwrsq3S",
- "XT5DFgkutI8o",
- "mGj9sLw_uYcC",
- "TVw5HbU_07oY",
- "mmX5KyZk42h_"
- ],
- "include_colab_link": true,
- "name": "Binary Classification",
- "private_outputs": true,
- "provenance": [],
- "toc_visible": true
- },
- "kernelspec": {
- "display_name": "Python 3",
- "name": "python3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 0
-}
diff --git a/colab_orange_binary_classification.ipynb b/colab_orange_binary_classification.ipynb
new file mode 100644
index 0000000..62bcbd5
--- /dev/null
+++ b/colab_orange_binary_classification.ipynb
@@ -0,0 +1,2104 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "colab_orange_binary_classification.ipynb",
+ "private_outputs": true,
+ "provenance": [],
+ "collapsed_sections": [
+ "wgjCUHZDcoFN"
+ ],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "name": "python3"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wgjCUHZDcoFN"
+ },
+ "source": [
+ "#### Copyright 2020 Google LLC."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "2QYNKEzLcTMN"
+ },
+ "source": [
+ "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+ "# you may not use this file except in compliance with the License.\n",
+ "# You may obtain a copy of the License at\n",
+ "#\n",
+ "# https://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing, software\n",
+ "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+ "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+ "# See the License for the specific language governing permissions and\n",
+ "# limitations under the License."
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "w-JgxO6Qcrk9"
+ },
+ "source": [
+ "# Binary Classification\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jwVCC6XOhhni"
+ },
+ "source": [
+ "In this unit we will explore [binary classification](https://en.wikipedia.org/wiki/Binary_classification) using [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression).\n",
+ " \n",
+ "Some of these terms might be new, so let's explore them a bit more.\n",
+ " \n",
+ "[Classification](https://en.wikipedia.org/wiki/Statistical_classification) is the process of mapping a set of data points to a finite set of labels. From our [regression](https://en.wikipedia.org/wiki/Regression_analysis) labs, you likely remember that regression models such as [linear regression](https://en.wikipedia.org/wiki/Linear_regression) map input variables to a range of continuous values. In the domain of machine learning, models that predict continuous values are considered regression models. Models that predict a known finite set of values are considered classification models.\n",
+ " \n",
+ "*So what does binary mean?*\n",
+ " \n",
+ "[Binary]() means there are only two values to predict. Binary classification is used to predict one of two values. These can be *true*/*false*, *malignant*/*benign*, *yes*/*no*, or any possible this-or-that options. For simplicity, these options are usually encoded as 1 and 0.\n",
+ " \n",
+ "*And what about logistic regression?*\n",
+ " \n",
+ "You've already seen linear regression, which attempts to fit a line to a set of data in order to predict continuous values. [Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) similarly attempts to fit a line to data. However, the line is typically a [logistic/sigmoid](https://en.wikipedia.org/wiki/Logistic_function) curve. Instead of predicting a continuous value, the model uses the logistic curve to split the data into two classes. One class falls to one side of the line, and the other class falls to the other side of the line."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ODh348gr0ddg"
+ },
+ "source": [
+ "## Framing the Problem"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XzMnO97Q0kbo"
+ },
+ "source": [
+ "*Cindy's Produce For Good* has a problem. Its business model revolves around collecting unsold fruit and vegetables from local growers and distributing them to families in need so that they can consume it or resell it at local farmer's markets and roadside stands.\n",
+ " \n",
+ "Quite a few complaints have come in lately from families and customers who have had a bitter surprise. They've peeled what they thought was an orange only to bite in and find out that they are eating a grapefruit!\n",
+ " \n",
+ "Cindy's growers give her truckloads of mixed citrus: lemons, limes, oranges, and grapefruit. A volunteer crew sorts the fruit. They are really good at sorting lemons and limes, but they falsely identify grapefruit as oranges about 5% of the time.\n",
+ " \n",
+ "In order to ensure customers get the oranges they expect, Cindy has created a machine that measures the weight, color, and largest diameter of fruit. She wants to create some software that can use this information and tell her workers if the fruit is an orange or not.\n",
+ " \n",
+ "She put a few thousand pieces of orange-looking fruit from one of her shipments through the sensors and manually labelled them as oranges or grapefruit. Looking at the data, she couldn't see an obvious pattern. Her best performance was about 90% accuracy. Her human sorters can do at least 95%. She's requested our help to see if we can solve the orange vs. grapefruit problem.\n",
+ " \n",
+ "In this lab we'll examine Cindy's citrus data and try to build a model to help her reliably sort her fruit as well as or better than human sorters."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BAPcKBq854Z5"
+ },
+ "source": [
+ "### Exercise 1: Thinking About the Data\n",
+ "\n",
+ "Before we dive in to looking closely at the data, let's think about the problem space and the dataset. Consider the questions below."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aFx7Ts3b6N6d"
+ },
+ "source": [
+ "#### Question 1\n",
+ "\n",
+ "Is this problem actually a good fit for machine learning? Why or why not?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "QEECNP0z4Eqz"
+ },
+ "source": [
+ "##### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BNcJPqVF6UNi"
+ },
+ "source": [
+ "*Response: Yes, this problem is actually a good fit for machine learning. The reason is that the model only has to identify two different classes with large amount of training data.*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PZsXA8-xXAo5"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NMfvenfO6ZK9"
+ },
+ "source": [
+ "#### Question 2\n",
+ "\n",
+ "If we do build Cindy a machine learning model, what biases might exist in the data? Is there anything that might cause her model to have trouble generalizing to other data? If so, how might she make the model more resilient?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0P_GjcKE4Nil"
+ },
+ "source": [
+ "##### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XzqOPKOb6_A2"
+ },
+ "source": [
+ "*Response: There might be a measurement bias since there will always be bias while measuring sizes.*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bIBWp2gPXGoo"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "814wZ3BlYqK_"
+ },
+ "source": [
+ "#### Question 3\n",
+ "\n",
+ "We've been asked to create a system that determines if a piece of fruit is an orange or not an orange. But aside from that, we haven't gotten much information about how the system would work as a whole.\n",
+ "\n",
+ "Describe how you would design the system from end-to-end. Things to consider:\n",
+ "\n",
+ "- Would the input fruit be all of the fruit that Cindy receives? Only the fruit suspected of being an orange? Only questionable fruit? Anything suspected of being an orange or a grapefruit?\n",
+ "\n",
+ "- What happens to fruit classified as \"not orange\". Is it automatically considered a grapefruit? Is it thrown away? Put in a mixed fruit bag?\n",
+ "\n",
+ "Justify the inputs and the output actions for the system. What are the trade-offs?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "M6wlYr_74KMT"
+ },
+ "source": [
+ "##### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nEO2I08VcgfR"
+ },
+ "source": [
+ "*Response: The input would be only questionable fruit. If the fruit classified as \"not orange\", the fruits should be put in a mixed fruit bag with label saying \"Uncertain Fruits\". The made trade-offs will be the time it takes to make a relatively perfect prediction model.*\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bKItp8vjXKov"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TO4DpNhgk66g"
+ },
+ "source": [
+ "## Exploratory Data Analysis"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nRLHpnzclbUI"
+ },
+ "source": [
+ "### Acquire the Data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bCJUYfK3k_Au"
+ },
+ "source": [
+ "We have some idea about the problem that we are trying to solve, so let's take a look at what has been collected. The data is [hosted on Kaggle](https://www.kaggle.com/joshmcadams/oranges-vs-grapefruit). You can download the dataset and then upload it to this lab or use the code blocks below to fetch the data directly.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vPYUr0Oal8Oa"
+ },
+ "source": [
+ "#### Direct Kaggle Download"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "wM8Cnk0WmLS1"
+ },
+ "source": [
+ "Follow the [API Credentials](https://github.com/Kaggle/kaggle-api#api-credentials) instructions and get a `kaggle.json` file (if you don't already have one), and upload it to this lab.\n",
+ "\n",
+ "Then run the code block below to download the oranges vs. grapefruit dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "1DCns7DaatrV"
+ },
+ "source": [
+ "!KAGGLE_CONFIG_DIR=`pwd` kaggle datasets download joshmcadams/oranges-vs-grapefruit\n",
+ "!ls"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-kZpRsJVnXjI"
+ },
+ "source": [
+ "There should now be an `oranges-vs-grapefruit.zip` file in the virtual machine for this lab. Let's unzip it so we can access the data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "-lVK5qcenhL6"
+ },
+ "source": [
+ "!unzip -o oranges-vs-grapefruit.zip\n",
+ "!ls"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hl8SHuXhn3MA"
+ },
+ "source": [
+ "There is now a `citrus.csv` file in our virtual machine. Let's start digging into the data next."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Z2UUc8mkleDw"
+ },
+ "source": [
+ "### Basic Analysis"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ouEjcJQIl1K_"
+ },
+ "source": [
+ "\n",
+ "First and foremost, we need to load the data. For that we'll rely on [Pandas](https://pandas.pydata.org/) and use the `read_csv` function since the data was provided to us as a CSV file.\n",
+ " \n",
+ "After we load the data, let's sample it to get an idea of what we are working with."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "E0tP0yHJ7orM"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "citrus_df = pd.read_csv('citrus.csv', header=0)\n",
+ "citrus_df.sample(10, random_state=2020)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "sqg9pDY3mWZe"
+ },
+ "source": [
+ " It looks like we have a mixed bag of fruit containing oranges and grapefruit, just as expected.\n",
+ "\n",
+ " How many do we have of each?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "P4bIdlfJoVgt"
+ },
+ "source": [
+ "#### Exercise 2: Basic Statistics\n",
+ "\n",
+ "Let's take a moment to determine the distribution of fruit in our dataset. Use [pyplot](https://matplotlib.org/api/pyplot_api.html) to create a histogram of the values in the `name` column of our `DataFrame`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kFliytaio2Kk"
+ },
+ "source": [
+ "##### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "5gS50q4_oQI0"
+ },
+ "source": [
+ "# Your Solution Goes Here\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "x = citrus_df['name']\n",
+ "#count = citrus_df.loc[citrus_df['name'] == 'grapefruit', 'name'].value_counts()\n",
+ "plt.hist(x)\n",
+ "plt.title(\"Fruit Data Distribution\")\n",
+ "plt.xlabel('Fruits')\n",
+ "plt.ylabel('Quantity')\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "o4TMbDARXPAg"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YHo70ZE5pxAk"
+ },
+ "source": [
+ "### Interpreting Our Histogram\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Gz_kZJbbpz3g"
+ },
+ "source": [
+ "The histogram shows the data evenly distributed across different types of fruit. This distribution makes the dataset very balanced for building a model for our classifier."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "WlIQ6RartLAU"
+ },
+ "source": [
+ "### Describing Our Dataset"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "mIPBq2G3tPt3"
+ },
+ "source": [
+ "Next let's do a simple `describe` of our dataset to get some more detailed information."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "todQOhCQAayf"
+ },
+ "source": [
+ "citrus_df.describe()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_X1rOcIju8NG"
+ },
+ "source": [
+ "Since every count is 10,000, we don't seem to have missing values.\n",
+ "\n",
+ "Also, every `min` value is a positive number. This is good since it would be really odd to have negative diameters, weights, or colors.\n",
+ "\n",
+ "Do the values themselves look reasonable? The diameter is measured in centimeters. Is a 2 cm piece of fruit believable? What about a 16 cm piece of fruit?\n",
+ "\n",
+ "Similarly, do the weights seem within ranges that we'd expect?\n",
+ "\n",
+ "It is actually difficult to tell since we have different kinds of fruit in this bag. It would be easier to inspect summary statistics for each type of fruit."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VeIOaYrCxfo6"
+ },
+ "source": [
+ "#### Exercise 3: More Focused Description\n",
+ "\n",
+ "We have used `describe()` to get statistics about the entire dataset, but there isn't a lot of information in the data. Write Python code to print `describe()` statistics for each type of fruit in the dataset. Use the `percentiles` argument to the [describe method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) to not print the 25th and 75th percentile.\n",
+ "\n",
+ "Your output should look similar to:\n",
+ "\n",
+ "```\n",
+ "orange\n",
+ " diameter weight red green blue\n",
+ "count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000\n",
+ "mean 8.474424 152.804920 156.832800 81.988200 7.115200\n",
+ "std 1.260665 18.669021 9.890258 10.090789 6.493779\n",
+ "min 2.960000 86.760000 123.000000 49.000000 2.000000\n",
+ "50% 8.470000 152.665000 157.000000 82.000000 4.000000\n",
+ "max 12.870000 231.090000 192.000000 116.000000 38.000000\n",
+ "\n",
+ "grapefruit\n",
+ " diameter weight red green blue\n",
+ "count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000\n",
+ "mean 11.476946 197.296664 150.862800 70.033000 15.611200\n",
+ "std 1.221148 19.193190 10.103148 10.044924 9.271592\n",
+ "min 7.630000 126.790000 115.000000 31.000000 2.000000\n",
+ "50% 11.450000 197.430000 151.000000 70.000000 15.000000\n",
+ "max 16.450000 261.510000 187.000000 103.000000 56.000000\n",
+ "```\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vJjtTQBVzosh"
+ },
+ "source": [
+ "##### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "c61DljFmzpY9"
+ },
+ "source": [
+ "# Your Solution Goes Here\n",
+ "\n",
+ "#orange data\n",
+ "orange_df = citrus_df[citrus_df['name'] == 'orange']\n",
+ "orange_description = orange_df.describe(percentiles=[0.50])\n",
+ "\n",
+ "#grapefruit data\n",
+ "grapefruit_df = citrus_df[citrus_df['name'] == 'grapefruit']\n",
+ "grapefruit_description = grapefruit_df.describe(percentiles=[0.50])\n",
+ "\n",
+ "\n",
+ "#output\n",
+ "print('orange')\n",
+ "print(orange_description)\n",
+ "print('\\n')\n",
+ "print('grapefruit')\n",
+ "print(grapefruit_description)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "quwhDANzXWLe"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JBYmcbBp0hK0"
+ },
+ "source": [
+ "### Visualizing With Box Plots"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "r3VKH6Um0lVs"
+ },
+ "source": [
+ "Now that we've sanity checked our data, let's visualize it to see if we can gather more insight. Above we gathered the min, max, mean, etc. for each numeric column for each type of fruit in a tabular form. Let's now visualize that data using a box plot and the [Altair](https://altair-viz.github.io/) visualization library.\n",
+ "\n",
+ "To start using Altair, we simply import it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "DPMKh6R51Z8d"
+ },
+ "source": [
+ "import altair as alt"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Myyagald1fCs"
+ },
+ "source": [
+ "Next we will use the `mark_boxplot` method of the [Chart](https://altair-viz.github.io/user_guide/generated/toplevel/altair.Chart.html?highlight=mark_boxplot) class to create our boxplot.\n",
+ "\n",
+ "Let's start by plotting the diameter by name.\n",
+ "\n",
+ "To do this we must first sample a subset of our data. We have 20,000 rows of data, and Altair cannot visualize that much data in a boxplot. The row limit is 5,000 rows, so we'll create a 5,000 row sample and then pass that sample to Altair."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YlBaIjExfPFE"
+ },
+ "source": [
+ "citrus_df_sample = citrus_df.sample(n=5000, random_state=2020)\n",
+ "\n",
+ "alt.Chart(citrus_df_sample, width=400).mark_boxplot().encode(\n",
+ " x='name',\n",
+ " y='diameter'\n",
+ ")"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UgEKnpIK3DXy"
+ },
+ "source": [
+ "What insights can we glean from this graphic?\n",
+ " \n",
+ "As expected, the diameter of a grapefruit trends larger than that of an orange, but there is some overlap.\n",
+ " \n",
+ "Let's now add in weight to our boxplot."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "X9S23tjUAMXo"
+ },
+ "source": [
+ "alt.Chart(citrus_df_sample, width=400).mark_boxplot().encode(\n",
+ " x='name',\n",
+ " y='diameter'\n",
+ ") | alt.Chart(citrus_df_sample, width=400).mark_boxplot().encode(\n",
+ " x='name',\n",
+ " y='weight'\n",
+ ")"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "o4Rag3gLEHMk"
+ },
+ "source": [
+ "### Correlation"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "awOjMaT-AIqZ"
+ },
+ "source": [
+ "Notice that relative weight and diameter seem pretty similar. These two columns might be closely correlated enough that we only need to use one of them. Let's check the correlation coefficient."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Ith_gAJeBV4D"
+ },
+ "source": [
+ "#### Exercise 4: Correlation Coefficient\n",
+ "\n",
+ "Based on our visualization above, we suspect that diameter and weight are highly correlated. Write code to find the correlation coefficient between the diameter and weight columns in our `DataFrame`.\n",
+ "\n",
+ "*Hint: Check out the [corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) documentation.*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nZOllTtZBo93"
+ },
+ "source": [
+ "##### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "kvguHDwYB4_P"
+ },
+ "source": [
+ "# Your Solution Goes Here\n",
+ "corr_df = citrus_df.corr(method='pearson')\n",
+ "correlation = corr_df['diameter']['weight']\n",
+ "print(\"The correlation coefficient between diameter and weight is: {0}\".format(correlation))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3PQ1J94LahIw"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "J8WcHxyREKeC"
+ },
+ "source": [
+ "### Understanding the Correlation"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xzKTnQt_ENr8"
+ },
+ "source": [
+ "The correlation between diameter and weight is over 99%. That is a very high value.\n",
+ "\n",
+ "This shouldn't come as a big surprise. We should expect that the weight of a piece of fruit grows as its diameter grows.\n",
+ "\n",
+ "For now we can leave the data as is, but remember this correlation. We might be able to use it to remove a column from our training data without negatively affecting our model."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0nGieXUmG_8T"
+ },
+ "source": [
+ "Let's take another look at height and weight. They are definitely correlated, but how do they relate to each other for each fruit type?\n",
+ "\n",
+ "One way to see this is to use a scatter plot chart to plot the diameter versus the weight, segmented by fruit type.\n",
+ "\n",
+ "We'll use Altair to do this."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "voAyuNFs7uBy"
+ },
+ "source": [
+ "alt.Chart(citrus_df_sample).mark_circle().encode(\n",
+ " x='diameter',\n",
+ " y='weight',\n",
+ " color='name'\n",
+ ")"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "748C7ws7I5Bo"
+ },
+ "source": [
+ "We can see that oranges and grapefruit have very similar rates of weight gain as their diameter increases. This shouldn't be too surprising since they are very similar fruits.\n",
+ " \n",
+ "In this chart we can also see that there are some fruits that are clearly oranges because of their small size and weight, as well as some that are clearly grapefruit due to their large size and weight. However, we have a large number of fruits that will be difficult to classify using diameter and weight alone."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HfhTyJXjJOh9"
+ },
+ "source": [
+ "### Checking Color Values"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hjAZwYeIJZmN"
+ },
+ "source": [
+ "We've looked pretty closely at the diameter and weight values, but we haven't done much with the color (RGB) values.\n",
+ "\n",
+ "Let's first see if boxplots are helpful."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "yCwnzlaqMMPU"
+ },
+ "source": [
+ "alt.Chart(citrus_df_sample, width=400).mark_boxplot().encode(\n",
+ " x='name',\n",
+ " y='red'\n",
+ ") | alt.Chart(citrus_df_sample, width=400).mark_boxplot().encode(\n",
+ " x='name',\n",
+ " y='green'\n",
+ ") | alt.Chart(citrus_df_sample, width=400).mark_boxplot().encode(\n",
+ " x='name',\n",
+ " y='blue'\n",
+ ")"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "bhD1jxTQJy91"
+ },
+ "source": [
+ "There doesn't seem to be a lot of value there, at least examining each element of color separately. There is quite a bit of overlap between each color element, with grapefruit displaying a little less red and green typically."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PKEKDZTZKKcH"
+ },
+ "source": [
+ "It would also be nice to \"sanity check\" the color values, similar to how we checked to make sure that our diameters and weights were within reason. We could see if the values fall within a reasonable range, but then we'd need to know reasonable RGB values for oranges and grapefruit.\n",
+ "\n",
+ "Since we are dealing with color data, we can just create an image for each piece of fruit that contains a sampling (or all) of the colors that we have and we can see if it looks reasonable."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "j4mL5EK4L04L"
+ },
+ "source": [
+ "First, let's get an exact count of the number of samples of each fruit type."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "PfZyWY4_Lqo7"
+ },
+ "source": [
+ "citrus_df.groupby('name')['name'].count()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IO7e97CPL5Db"
+ },
+ "source": [
+ "As expected, we have 5000 samples each. We can create a 100x50 image for each fruit type and visualize the data.\n",
+ "\n",
+ "We'll use [PIL's Image class](https://pillow.readthedocs.io/en/stable/reference/Image.html) to create a white 100x50 image. Then we'll get the editable pixel map from the image and assign the color value for each orange in our data to a different pixel in the image.\n",
+ "\n",
+ "Once we have the image filled out with color, we'll use PyPlot to display the image."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SR96sFvDLnCr"
+ },
+ "source": [
+ "from PIL import Image\n",
+ "from matplotlib.pyplot import imshow\n",
+ "import numpy as np\n",
+ "\n",
+ "height, width = 50, 100\n",
+ "img = Image.new('RGB', (width, height), color=(255, 255, 255))\n",
+ "pixels = img.load()\n",
+ "\n",
+ "row_i, col_i = 0, 0\n",
+ "for _, fruit in citrus_df[citrus_df['name'] == 'orange'].iterrows():\n",
+ " pixels[col_i, row_i] = (fruit['red'], fruit['green'], fruit['blue'])\n",
+ " col_i += 1\n",
+ " if col_i >= width:\n",
+ " col_i = 0\n",
+ " row_i += 1\n",
+ "\n",
+ "imshow(img)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KViuNrS8Px5-"
+ },
+ "source": [
+ "That looks like a pretty reasonable orange color. What about the grapefruit?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "B4u-yPUjP3OX"
+ },
+ "source": [
+ "#### Exercise 5: Create a Color Map Image\n",
+ "\n",
+ "We only visualized data from oranges. We'd really like to see the colors of all of the fruit. Create and show a 100x100 image that contains the colors for all of the oranges in the first 100x50 block. This should be followed with the colors for all of the grapefruit in the next 100x50 block. Visually inspect your image to see if the colors are believable as oranges and grapefruit."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vEs1H0QHQhZy"
+ },
+ "source": [
+ "##### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pWmg9cLXQlWA"
+ },
+ "source": [
+ "# Your Solution Goes Here\n",
+ "height, width = 100, 100\n",
+ "img = Image.new('RGB', (width, height), color=(255, 255, 255))\n",
+ "pixels = img.load()\n",
+ "\n",
+ "#orange\n",
+ "row_i, col_i = 0, 0\n",
+ "for _, fruit in citrus_df[citrus_df['name'] == 'orange'].iterrows():\n",
+ " pixels[col_i, row_i] = (fruit['red'], fruit['green'], fruit['blue'])\n",
+ " row_i += 1\n",
+ " if row_i >= height:\n",
+ " col_i += 1\n",
+ " row_i = 0\n",
+ "\n",
+ "#grapefruit\n",
+ "row_i, col_i = 0, 50\n",
+ "for _, fruit in citrus_df[citrus_df['name'] == 'grapefruit'].iterrows():\n",
+ " pixels[col_i, row_i] = (fruit['red'], fruit['green'], fruit['blue'])\n",
+ " row_i += 1\n",
+ " if row_i >= height:\n",
+ " col_i += 1\n",
+ " row_i = 0\n",
+ "\n",
+ "plt.title(\"Color Map\")\n",
+ "plt.xlabel(['orange', 'grapefruit'])\n",
+ "imshow(img)\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "pIa1Ndg-cMeA"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cX2Jwx2wRoxi"
+ },
+ "source": [
+ "### Data Analysis Summary"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-qWJ3X1aRxk0"
+ },
+ "source": [
+ "We've done a lot of data analysis and have a pretty good feel for our data. We have:\n",
+ "\n",
+ "* Examined the distribution of our dataset and seen we have an equal distribution of fruit types\n",
+ "* Determined that no data is missing\n",
+ "* Determined that our weight, diameter, and color values are all within reason\n",
+ "* Found a strong correlation between weight and diameter\n",
+ "\n",
+ "Let's see if we can build a model to classify our oranges!\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oRd-BEHtTHyL"
+ },
+ "source": [
+ "## Simple Logistic Model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jUeivRYFTLoi"
+ },
+ "source": [
+ "It is now time to build and iterate on a model. We'll start with a simple logistic regression model using scikit-learn's [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) class and the feature columns already in our training data.\n",
+ "\n",
+ "Let's first remind ourselves of the columns we have at our disposal."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "AV0d1qhhOxml"
+ },
+ "source": [
+ "citrus_df.columns"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "WlmgcGGVWXUI"
+ },
+ "source": [
+ "We'll use 'diameter', 'weight', 'red', 'green', and 'blue' as feature columns. Using 'name' for our target column is tempting, but remember that it contains fruit names for values, and for this exercise, we are only interested in determining if a piece of fruit is an orange or not an orange. Let's create a new column called 'is_orange' that contains the value `True` if the datum is an orange and `False` otherwise."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Gry5FUYVXJCy"
+ },
+ "source": [
+ "### Exercise 6: Is Orange?\n",
+ "\n",
+ "Create a new column in `citrus_df` called `is_orange`. The column should be a boolean column and should contain the value `True` if a given row is labeled as an orange and `False` otherwise."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qrMMuSoMXi-3"
+ },
+ "source": [
+ "##### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "SlRxZuRSXlWx"
+ },
+ "source": [
+ "# Your Solution Goes Here\n",
+ "citrus_df['is_orange'] = citrus_df['name'] == 'orange'\n",
+ "citrus_df"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "U_ia4bXudMdJ"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VBpOC1xuX6Wp"
+ },
+ "source": [
+ "### Examining Our New Target Column"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8CVIhZJwYj-8"
+ },
+ "source": [
+ "Now that we've created a new target column, we should do some checking to make sure that it was created correctly.\n",
+ "\n",
+ "First we'll simply see the count per value."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "6dtKjRtAYtsg"
+ },
+ "source": [
+ "citrus_df.groupby('is_orange')['is_orange'].count()\n",
+ "#True: orange, #False: grapefruit"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RgVllZcyY2p0"
+ },
+ "source": [
+ "There should be 5,000 `True` values and 5,000 `False` values.\n",
+ "\n",
+ "Now check to see that all 5,000 of the `True` values have the `name` \"orange\"."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "XDdyNHZgZEUN"
+ },
+ "source": [
+ "citrus_df[citrus_df['is_orange']]['name'].unique()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cPYJT9NKYszc"
+ },
+ "source": [
+ "We should only see a single value in the unique list ('orange') since all rows with 'is_orange' set to `True` should have a 'name' of 'orange'."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2Tf5GjMYaHJE"
+ },
+ "source": [
+ "### Train/Test Split"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "r62BzXuyacop"
+ },
+ "source": [
+ "We can now split our data for training and testing. First we will create variables to hold our training and target column names."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YuE6tX8qO3fF"
+ },
+ "source": [
+ "target_column = 'is_orange'\n",
+ "\n",
+ "feature_columns = ['diameter', 'weight', 'red', 'green', 'blue']\n",
+ "\n",
+ "target_column, feature_columns"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "lE_6f7eMsiW0"
+ },
+ "source": [
+ "We need to split the data into a training and testing set. In this case we'll split 20% of the data off for testing and train on the other 80%. We can use scikit-learn's `train_test_split` function to do this. It is also a really good idea to shuffle our data, and `train_test_split` allows us to do this too.\n",
+ "\n",
+ "After we make the split, we can see how many data points we will train off of for each class."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "E9y3IXRaeYPD"
+ },
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "\n",
+ "X_train, X_test, y_train, y_test = train_test_split(\n",
+ " citrus_df[feature_columns],\n",
+ " citrus_df[target_column],\n",
+ " test_size=0.2,\n",
+ " random_state=180,\n",
+ " shuffle=True)\n",
+ "\n",
+ "y_train.groupby(y_train).count()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Ga6YhfzKaxUW"
+ },
+ "source": [
+ "Hmm. It looks like our training set has become a little uneven. Ideally, we would maintain the same ratio of oranges to non-oranges in our training and testing groups as the ratio in the whole set (50/50). But after splitting the data, we've ended up with a training set that skews towards non-oranges, and a test set that skews the opposite way, towards oranges."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "duQC2BDgnDJt"
+ },
+ "source": [
+ "y_test.groupby(y_test).count()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "s1_Tp1rKnqc6"
+ },
+ "source": [
+ "Luckily, there's a solution for this problem: [stratified sampling](https://en.wikipedia.org/wiki/Stratified_sampling). Stratifying our data ensures that the ratio of distinct values in the given column remains the same in our training and test sets as it is in the whole set (half orange and half non-orange, in this case)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "x--twmjowI6D"
+ },
+ "source": [
+ "#### Exercise 7: Stratified Train Test Split\n",
+ "\n",
+ "Look at the [documentation for `train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) and find the argument that can be used to stratify the data. Rewrite the split above to create a stratified split. When you are done, there should be 4,000 `True` values and 4,000 `False` values in the training data and 1,000 of each in the testing data. Print the counts to verify."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tzk1dDO-wrWr"
+ },
+ "source": [
+ "##### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "cJ1Cf4YMwutU"
+ },
+ "source": [
+ "# Your Code Goes Here\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "\n",
+ "X_train, X_test, y_train, y_test = train_test_split(\n",
+ " citrus_df[feature_columns],\n",
+ " citrus_df[target_column],\n",
+ " test_size=0.2,\n",
+ " random_state=180,\n",
+ " shuffle=True,\n",
+ " stratify=citrus_df[target_column]\n",
+ " )\n",
+ "\n",
+ "y_train.groupby(y_train).count()\n",
+ "#y_test.groupby(y_test).count()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JJg9KuFiwv4o"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZYWynOgepHE1"
+ },
+ "source": [
+ "### Examining The Split Data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VMK0XRQYcHER"
+ },
+ "source": [
+ "We can now verify that we have 80% of the data in training..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "M05Xb00YQce_"
+ },
+ "source": [
+ "X_train.shape, y_train.shape"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dP8SlC1NcLHJ"
+ },
+ "source": [
+ "And 20% in testing."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "0J5iWTZVQgYr"
+ },
+ "source": [
+ "X_test.shape, y_test.shape"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZTlWBlvPdHmK"
+ },
+ "source": [
+ "Let's look at the training data and see if it stratified correctly."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "CYqtUfInQw2K"
+ },
+ "source": [
+ "#label for training\n",
+ "y_train.describe()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oWTEqjnJdRf4"
+ },
+ "source": [
+ "From this output we can see that there are 8,000 pieces of data with 2 unique values. The top value is `True`, and it occurs 4,000 times. That would leave us with 4,000 other values that are `False`.\n",
+ "\n",
+ "We can do the same for the `y_test` data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "JgTAtdGRRdzZ"
+ },
+ "source": [
+ "#label for testing\n",
+ "y_test.describe()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "teL0KUBhdx5B"
+ },
+ "source": [
+ "Another alternative is to use `groupby` on the series. Notice that the `by` argument contains the series once again and not a column name."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "pwnrwBbKTLo6"
+ },
+ "source": [
+ "y_test.groupby(by=y_test).count()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "C-czcddqd72v"
+ },
+ "source": [
+ "### Create and Train the Model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "LpAKoPIUd_jW"
+ },
+ "source": [
+ "It is finally time to build and train our model. As a reminder, we are using [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).\n",
+ "\n",
+ "First, we'll build a baseline model with default arguments, and see how well it does. To build the model we import `LogisticRegression`, create a class instance, and then fit the model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "jCmKNLvDTgti"
+ },
+ "source": [
+ "from sklearn.linear_model import LogisticRegression\n",
+ "\n",
+ "model = LogisticRegression(random_state=2020)\n",
+ "model.fit(X_train, y_train)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CpguGiacfHVU"
+ },
+ "source": [
+ "### Measure Model Performance\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-OHxe_Tle2zI"
+ },
+ "source": [
+ "We now have a model ready to use to make predictions. Let's first make predictions on the test data that we held out of our training set and see how well we did.\n",
+ "\n",
+ "The first step is to actually make the predictions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "XZcUbj2dUIJR"
+ },
+ "source": [
+ "#predictions holds y_test => testing label\n",
+ "predictions = model.predict(X_test)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3UgWpmXYfnUA"
+ },
+ "source": [
+ "Now we can use metrics functions from scikit-learn to see how well our model performed. We'll check the accuracy, precision, recall, and F1 scores."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "aIESA8B8WYgn"
+ },
+ "source": [
+ "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n",
+ "\n",
+ "print('Accuracy: ', round(accuracy_score(predictions, y_test), 3))\n",
+ "print('Precision: ', round(precision_score(predictions, y_test), 3))\n",
+ "print('Recall: ', round(recall_score(predictions, y_test), 3))\n",
+ "print('F1: ', round(f1_score(predictions, y_test), 3))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GD3ZQ7AFhctN"
+ },
+ "source": [
+ "Numbers for most of the metrics are above 90%, which is better than Cindy was sorting!\n",
+ "\n",
+ "Let's see how this looks in a confusion matrix."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "zqkTlKu_ax3B"
+ },
+ "source": [
+ "from sklearn.metrics import confusion_matrix\n",
+ "\n",
+ "tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()\n",
+ "\n",
+ "print(f'True Positive: {tp}\\nTrue Negative: {tn}\\nFalse Positive: {fp}\\nFalse Negative: {fn}')"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZTXiVIjOkBt6"
+ },
+ "source": [
+ "We have just under 100 falsely identified fruit. There are about twice as many false negatives as there are false positives. Let's take a few minutes to think about what this confusion matrix means."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "HALEd_j5vRc3"
+ },
+ "source": [
+ "#### Exercise 8: Interpreting a Confusion Matrix\n",
+ "\n",
+ "In the text cell below, explain what a false positive and false negative represent in our dataset: which is an orange classified as a grapefruit and which is a grapefruit classified as an orange?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "d7bs6xm24WMt"
+ },
+ "source": [
+ ""
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8g2z3L-v4WXP"
+ },
+ "source": [
+ "##### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9yEN7fQNvqK1"
+ },
+ "source": [
+ "*Response: False Positive: Model predicted the fruit is an orange ,but the correct class is grapefruit. False Nagative: Model predicted the fruit is a grapefruit, but the correct class is orange.*\n",
+ "**Orange classified as a grapefruit: False Positive, \n",
+ "Grapefruit classified as an orange: False Negative**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "U-upCwGgktep"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RptgsFxBkUyf"
+ },
+ "source": [
+ "#### ROC Curve"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "o_s0IEeaPRo_"
+ },
+ "source": [
+ "We can visualize this in another way using the [Receiver-Operator Curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). This graph plots the true positive rate on the y-axis against the false positive rate on the x-axis."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "X4vXHTwvY4aV"
+ },
+ "source": [
+ "from sklearn.metrics import roc_curve\n",
+ "from sklearn.metrics import roc_auc_score\n",
+ "\n",
+ "scores = model.decision_function(X_test)\n",
+ "\n",
+ "fpr, tpr, _ = roc_curve(y_test, scores, pos_label=True)\n",
+ "\n",
+ "plt.ylabel('True Positive Rate')\n",
+ "plt.xlabel('False Positive Rate')\n",
+ "plt.plot(fpr, tpr)\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "nEo_TM_ClV8F"
+ },
+ "source": [
+ "We can see that there is a steep increase in false positives as the true positive rate crosses into the 90% range."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "u7qc9WwVlkMf"
+ },
+ "source": [
+ "#### Precision Recall Curve"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0OFLWPJ0P5hD"
+ },
+ "source": [
+ "We can also get a feel for how precision and recall relate for this model by plotting the [precision recall curve](https://en.wikipedia.org/wiki/Precision_and_recall)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YyvmOGaHluCl"
+ },
+ "source": [
+ "from sklearn.metrics import precision_recall_curve\n",
+ "\n",
+ "scores = model.decision_function(X_test)\n",
+ "\n",
+ "precision, recall, _ = precision_recall_curve(y_test, scores)\n",
+ "\n",
+ "plt.xlabel('Recall')\n",
+ "plt.ylabel('Precision')\n",
+ "plt.plot(recall, precision)\n",
+ "plt.show()"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "KozxQbBMnGqJ"
+ },
+ "source": [
+ "This shows the balance between precision and recall as the model adjusts classification thresholds."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "xgGfw7FUnMnL"
+ },
+ "source": [
+ "## Improving Our Model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "lCXsEo4znO-s"
+ },
+ "source": [
+ "Our initial model was actually pretty good. But can it be even better?\n",
+ " \n",
+ "In the next exercise we'll attempt to improve our model by exploring hyperparameters and manipulating features."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fmOpuYyOooWx"
+ },
+ "source": [
+ "### Exercise 9: Using GridSearchCV\n",
+ "\n",
+ "We will now experiment with different hyperparameters to see if we can tune the model to increase our scores. To do this we will use the [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) class to [tune hyperparameters](https://scikit-learn.org/stable/modules/grid_search.html) of the scikit-learn [LogisticRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).\n",
+ "\n",
+ "GridSearchCV is a class used to test different hyperparameters for a model. The search accepts a dictionary containing keys that map to model parameters. The values are lists for hyperparameters that you want to experiment with or single values for parameters that you want to keep constant."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NXG7dvgHrWmQ"
+ },
+ "source": [
+ "#### Question 1: Performing the Search\n",
+ " \n",
+ "Below is some code that imports the necessary functions and classes and sets up a logistic regression model for grid search. Add code to the grid search to test different hyperparameters such as `tol`, `C`, `solver`, and `max_iter`.\n",
+ " \n",
+ "The best estimator will be displayed after running the code block."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PZunsUXx4ZpL"
+ },
+ "source": [
+ "##### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "VrFSNQv1onp4"
+ },
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n",
+ "from sklearn.model_selection import train_test_split, GridSearchCV\n",
+ "from sklearn.svm import SVC\n",
+ "\n",
+ "citrus_df = pd.read_csv('citrus.csv', header=0)\n",
+ "citrus_df['is_orange'] = citrus_df['name'].apply(lambda name: name == 'orange')\n",
+ "\n",
+ "target_column = 'is_orange'\n",
+ "feature_columns = ['diameter', 'weight', 'red', 'green', 'blue']\n",
+ "\n",
+ "X_train, X_validate, y_train, y_validate = train_test_split(\n",
+ " citrus_df[feature_columns],\n",
+ " citrus_df[target_column],\n",
+ " test_size=0.2,\n",
+ " random_state=42,\n",
+ " shuffle=True,\n",
+ " stratify=citrus_df[target_column])\n",
+ "\n",
+ "model = LogisticRegression(\n",
+ " random_state=2020,\n",
+ ")\n",
+ "\n",
+ "param_grid = {'tol': [0.1, 1, 10, 100], \n",
+ " 'C': [1, 0.1, 0.01, 0.001, 0.0001], \n",
+ " 'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],\n",
+ " 'max_iter': [100]}\n",
+ "\n",
+ "search = GridSearchCV(model, param_grid)\n",
+ "\n",
+ "search.fit(X_train, y_train)\n",
+ "\n",
+ "print(search.best_estimator_)"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "QfyXfK6qmFHR"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3FI4zQuar69J"
+ },
+ "source": [
+ "#### Question 2: Validate the Model\n",
+ "\n",
+ "Now that we have found a model that scored the highest in a cross-validation grid search, let's validate the model to see if it generalizes well on our validation data.\n",
+ "\n",
+ "We held out validation data in the `X_validate` and `y_validate` variables. Use that data to calculate the accuracy, precision, recall, and F1 scores for the model."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vQ49ehXl4chc"
+ },
+ "source": [
+ "##### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "4RpgwipWsmB8"
+ },
+ "source": [
+ "# Your Solution Goes Here\n",
+ "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n",
+ "\n",
+ "\n",
+ "predictions = search.predict(X_validate)\n",
+ "\n",
+ "print('Accuracy: ', round(accuracy_score(predictions, y_validate), 3))\n",
+ "print('Precision: ', round(precision_score(predictions, y_validate), 3))\n",
+ "print('Recall: ', round(recall_score(predictions, y_validate), 3))\n",
+ "print('F1: ', round(f1_score(predictions, y_validate), 3))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Gu0xUpHimgTs"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Cdl_tvGZsuVJ"
+ },
+ "source": [
+ "#### Question 3: Relative Model Quality\n",
+ "\n",
+ "Now that we have the scores for our model on our validation set, is the version found by grid search notably better? Discuss the difference in scores, if any, between our base model and the model selected by grid search."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YL74G6Jk4d9C"
+ },
+ "source": [
+ "##### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "SNzRzEB8tHaZ"
+ },
+ "source": [
+ "* Not-Found-by-Grid: \n",
+ " * Accuracy: 0.953\n",
+ " * Precision: 0.941\n",
+ " * Recall: 0.964\n",
+ " * F1: 0.952\n",
+ "* Found-by-Grid: \n",
+ " * Accuracy: 0.957\n",
+ " * Precision: 0.940\n",
+ " * Recall: 0.973\n",
+ " * F1: 0.956\n",
+ "\n",
+ "* Conclusion: The version found by grid search notably is slightly better. The accuracy, Recall, and F1 from the model selected by grid search are better than the ones from the base model*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DWcPKf6TmobG"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_hphJaGFtvtk"
+ },
+ "source": [
+ "## Exercise 10: Final Model Assessment\n",
+ "\n",
+ "Given our model performance, is this machine learning model a good fit for the problem?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fwVtY7eb4gr8"
+ },
+ "source": [
+ "##### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "vIR8qHyyuVPy"
+ },
+ "source": [
+ "*Response: Since the difference between the accuracy for human sorted and the accuracy for the model is only 0.007, 0.7% more, and training a good model required many data and it's time consuming, thus, the machine learning model is not a good fit for the problem.*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BY3VkqcYm2i3"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "QT8bPMU0xBzV"
+ },
+ "source": [
+ "## Challenge"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "8g2r7I4y0sPv"
+ },
+ "source": [
+ "### Question 1\n",
+ "\n",
+ "Normalization and standardization of data is not strictly required for performing logistic regression. It is, however, suggested in some cases. Research reasons why you might want (or not want) to normalize or standardize your input data to a logistic regression.\n",
+ "\n",
+ "Explain your findings and link to any relevant articles."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RKe9gOYY4jQa"
+ },
+ "source": [
+ "#### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tibOqWcM1UIP"
+ },
+ "source": [
+ "*Response: Normalization and standardization are not necessary for logistric regression besides if there are large amount of data to be processed, then normalization and standardization will be good for optimization. [Link](https://stats.stackexchange.com/questions/48360/is-standardization-needed-before-fitting-logistic-regression)*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "4iKsRd7UnfkH"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "v56Y-Ix93JfE"
+ },
+ "source": [
+ "### Question 2\n",
+ "\n",
+ "Use the [sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to scale the feature data before training a logistic model on our oranges dataset. Use [sklearn.model_selection.GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to iterate through hyperparameters to find an optimal model."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Nmv-j6oR4lFK"
+ },
+ "source": [
+ "#### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "metadata": {
+ "id": "YLTLFHgZyP_r"
+ },
+ "source": [
+ "# Your code goes here\n",
+ "from sklearn.preprocessing import StandardScaler\n",
+ "from sklearn.model_selection import GridSearchCV\n",
+ "\n",
+ "\n",
+ "citrus_df = pd.read_csv('citrus.csv', header=0)\n",
+ "citrus_df['is_orange'] = citrus_df['name'].apply(lambda name: name == 'orange')\n",
+ "\n",
+ "target_column = 'is_orange'\n",
+ "feature_columns = ['diameter', 'weight', 'red', 'green', 'blue']\n",
+ "\n",
+ "citrus_df[feature_columns] = StandardScaler().fit_transform(citrus_df[feature_columns])\n",
+ "\n",
+ "X_train, X_validate, y_train, y_validate = train_test_split(\n",
+ " citrus_df[feature_columns],\n",
+ " citrus_df[target_column],\n",
+ " test_size=0.2,\n",
+ " random_state=42,\n",
+ " shuffle=True,\n",
+ " stratify=citrus_df[target_column])\n",
+ "\n",
+ "model = LogisticRegression(\n",
+ " random_state=2020,\n",
+ ")\n",
+ "\n",
+ "\n",
+ "param_grid = {'tol': [0.1, 1, 10, 100], \n",
+ " 'C': [1, 0.1, 0.01, 0.001, 0.0001], \n",
+ " 'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],\n",
+ " 'max_iter': [100]}\n",
+ "\n",
+ "search = GridSearchCV(model, param_grid)\n",
+ "\n",
+ "search.fit(X_train, y_train)\n",
+ "\n",
+ "print(_search.best_estimator_)\n",
+ "\n",
+ "predictions = search.predict(X_validate)\n",
+ "\n",
+ "print('Accuracy: ', round(accuracy_score(predictions, y_validate), 3))\n",
+ "print('Precision: ', round(precision_score(predictions, y_validate), 3))\n",
+ "print('Recall: ', round(recall_score(predictions, y_validate), 3))\n",
+ "print('F1: ', round(f1_score(predictions, y_validate), 3))"
+ ],
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "BnfmK-w6oAii"
+ },
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "WBRhFJkf3x_t"
+ },
+ "source": [
+ "### Question 3\n",
+ "\n",
+ "Are the optimal hyperparameters the same for the logistic regression model before and after scaling the data? Why or why not? Did you notice any other differences?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jS4EWkOS3-MH"
+ },
+ "source": [
+ "#### **Student Solution**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XBy3g7pz4win"
+ },
+ "source": [
+ "*Response: The hyperparameters are the same for the logistic regression model before and after scaling the data; however, the accuracy, precision, recall, and F1 are all decreased.*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YtGj2vIKoK3b"
+ },
+ "source": [
+ "---"
+ ]
+ }
+ ]
+}
\ No newline at end of file