Skip to content

Commit 5f79dd5

Browse files
authored
Merge pull request #1002 from microsoft/copilot/add-data-visualization-notebook
[Regression] Add Seaborn data visualization examples to the Data lesson
2 parents a4336c4 + 83e22ba commit 5f79dd5

6 files changed

Lines changed: 139 additions & 3 deletions

File tree

2-Regression/2-Data/README.md

Lines changed: 76 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ In this lesson, you will learn:
1616

1717
- How to prepare your data for model-building.
1818
- How to use Matplotlib for data visualization.
19+
- How to use Seaborn for more expressive data visualization.
1920

2021
## Asking the right question of your data
2122

@@ -194,11 +195,85 @@ To get charts to display useful data, you usually need to group the data somehow
194195

195196
This is a more useful data visualization! It seems to indicate that the highest price for pumpkins occurs in September and October. Does that meet your expectation? Why or why not?
196197

198+
## Exercise - experiment with Seaborn
199+
200+
Matplotlib is powerful, but it can take a lot of code to produce a polished chart. [Seaborn](https://seaborn.pydata.org/) is a library built _on top of_ Matplotlib that is designed for statistical data visualization. It works directly with Pandas dataframes, applies attractive default styles, and lets you create informative plots with far less code. Because Seaborn returns Matplotlib objects, you can still use everything you already know about Matplotlib to fine-tune the result.
201+
202+
> If you don't already have Seaborn installed, install it with `pip install seaborn`.
203+
204+
1. Import Seaborn at the top of the notebook, under the other imports. It is conventionally imported as `sns`:
205+
206+
```python
207+
import seaborn as sns
208+
```
209+
210+
### Scatter plots to show relationships
211+
212+
A big part of exploring data before building a model is looking for _relationships_ between variables. A [scatter plot](https://en.wikipedia.org/wiki/Scatter_plot) is one of the best tools for this: if the points seem to follow a line, the two variables may be correlated, which is a good sign that a linear regression model could work.
213+
214+
1. Recreate the price-to-month scatter plot from before, this time using Seaborn's [`relplot()`](https://seaborn.pydata.org/generated/seaborn.relplot.html) (relational plot), which works directly with your dataframe columns:
215+
216+
```python
217+
sns.relplot(x="Price", y="Month", data=new_pumpkins)
218+
```
219+
220+
![A Seaborn scatterplot showing price to month relationship](./images/relplot.png)
221+
222+
Notice how you pass the _column names_ and the dataframe, and Seaborn takes care of the axis labels for you.
223+
224+
2. You can switch to a line plot by passing `kind="line"`. Seaborn even draws a shaded band showing the confidence interval around the line:
225+
226+
```python
227+
sns.relplot(x="Price", y="Month", kind="line", data=new_pumpkins)
228+
```
229+
230+
![A Seaborn line plot showing price to month relationship](./images/lineplot.png)
231+
232+
This particular data is quite noisy, so a line plot isn't the clearest choice here — but it shows how easily you can change chart types in Seaborn.
233+
234+
### Bar charts to show distributions
235+
236+
Earlier you grouped the data by hand to create a bar chart with Matplotlib. Seaborn's [`catplot()`](https://seaborn.pydata.org/generated/seaborn.catplot.html) (categorical plot) can do the grouping and aggregation for you. By default `kind="bar"` shows the mean of each category along with a black line indicating the confidence interval.
237+
238+
1. Create a bar chart of average price per month:
239+
240+
```python
241+
sns.catplot(x="Month", y="Price", data=new_pumpkins, kind="bar")
242+
```
243+
244+
![A Seaborn bar chart showing the price distribution per month](./images/catplot.png)
245+
246+
This confirms what you saw with Matplotlib — prices peak around September and October — but Seaborn also visualizes how much the price _varies_ within each month.
247+
248+
### Heatmaps to show correlations
249+
250+
Scatter plots compare two variables at a time. When you have several numeric columns, a [heatmap](https://en.wikipedia.org/wiki/Heat_map) lets you view the strength of the relationship between _every_ pair of columns at once. This is a common way to spot which features are most correlated before choosing what to feed into a model (and the same kind of chart is later used to display confusion matrices in classification).
251+
252+
1. Build a correlation matrix with Pandas, then draw it with Seaborn's [`heatmap()`](https://seaborn.pydata.org/generated/seaborn.heatmap.html). The `annot=True` option prints the correlation values on each cell:
253+
254+
```python
255+
correlations = new_pumpkins[['Month', 'Low Price', 'High Price', 'Price']].corr()
256+
sns.heatmap(correlations, annot=True, cmap="coolwarm")
257+
```
258+
259+
![A Seaborn heatmap showing correlations between the numeric columns](./images/heatmap.png)
260+
261+
Values close to `1` (or `-1`) mean the columns are strongly _linearly_ correlated. Notice how `Low Price` and `High Price` are almost perfectly correlated. `Month`, on the other hand, shows only a weak linear correlation with price — even though the bar chart above revealed a clear seasonal peak in September and October. That's an important lesson: the correlation coefficient only measures _straight-line_ relationships, so it can miss seasonal or otherwise non-linear patterns. ✅ Why is it useful to look at both a heatmap *and* charts like the bar chart before deciding which columns to use?
262+
263+
### Matplotlib or Seaborn?
264+
265+
Both libraries are worth knowing:
266+
267+
- **Matplotlib** gives you fine-grained control over every element of a chart and is the foundation almost every other Python plotting library builds on.
268+
- **Seaborn** provides higher-level functions and attractive defaults for statistical charts, works directly with dataframes, and is often quicker for exploratory data analysis.
269+
270+
A common workflow is to reach for Seaborn to explore your data quickly, then drop down to Matplotlib when you need to customize the details.
271+
197272
---
198273

199274
## 🚀Challenge
200275

201-
Explore the different types of visualization that Matplotlib offers. Which types are most appropriate for regression problems?
276+
Explore the different types of visualization that Matplotlib and Seaborn offer. Which types are most appropriate for regression problems?
202277

203278
## [Post-lecture quiz](https://ff-quizzes.netlify.app/en/ml/)
204279

7.92 KB
Loading
20.6 KB
Loading
37.5 KB
Loading
16 KB
Loading

2-Regression/2-Data/solution/notebook.ipynb

Lines changed: 63 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -179,7 +179,7 @@
179179
" </tr>\n",
180180
" </tbody>\n",
181181
"</table>\n",
182-
"<p>5 rows × 26 columns</p>\n",
182+
"<p>5 rows \u00d7 26 columns</p>\n",
183183
"</div>"
184184
],
185185
"text/plain": [
@@ -222,6 +222,7 @@
222222
"source": [
223223
"import pandas as pd\n",
224224
"import matplotlib.pyplot as plt\n",
225+
"import seaborn as sns\n",
225226
"pumpkins = pd.read_csv('../../data/US-pumpkins.csv')\n",
226227
"\n",
227228
"pumpkins = pumpkins[pumpkins['Package'].str.contains('bushel', case=True, regex=True)]\n",
@@ -385,12 +386,72 @@
385386
"plt.ylabel(\"Pumpkin Price\")"
386387
]
387388
},
389+
{
390+
"cell_type": "markdown",
391+
"metadata": {},
392+
"source": [
393+
"## Visualizing with Seaborn\n",
394+
"\n",
395+
"[Seaborn](https://seaborn.pydata.org/) is built on top of Matplotlib and works directly with dataframes, making it quick to create attractive statistical plots with very little code."
396+
]
397+
},
398+
{
399+
"cell_type": "markdown",
400+
"metadata": {},
401+
"source": [
402+
"### Scatter plots to show relationships"
403+
]
404+
},
388405
{
389406
"cell_type": "code",
390407
"execution_count": null,
391408
"metadata": {},
392409
"outputs": [],
393-
"source": []
410+
"source": [
411+
"sns.relplot(x=\"Price\", y=\"Month\", data=new_pumpkins)"
412+
]
413+
},
414+
{
415+
"cell_type": "code",
416+
"execution_count": null,
417+
"metadata": {},
418+
"outputs": [],
419+
"source": [
420+
"sns.relplot(x=\"Price\", y=\"Month\", kind=\"line\", data=new_pumpkins)"
421+
]
422+
},
423+
{
424+
"cell_type": "markdown",
425+
"metadata": {},
426+
"source": [
427+
"### Bar charts to show distributions"
428+
]
429+
},
430+
{
431+
"cell_type": "code",
432+
"execution_count": null,
433+
"metadata": {},
434+
"outputs": [],
435+
"source": [
436+
"sns.catplot(x=\"Month\", y=\"Price\", data=new_pumpkins, kind=\"bar\")"
437+
]
438+
},
439+
{
440+
"cell_type": "markdown",
441+
"metadata": {},
442+
"source": [
443+
"### Heatmaps to show correlations"
444+
]
445+
},
446+
{
447+
"cell_type": "code",
448+
"execution_count": null,
449+
"metadata": {},
450+
"outputs": [],
451+
"source": [
452+
"correlations = new_pumpkins[['Month', 'Low Price', 'High Price', 'Price']].corr()\n",
453+
"sns.heatmap(correlations, annot=True, cmap=\"coolwarm\")"
454+
]
394455
}
395456
],
396457
"metadata": {

0 commit comments

Comments
 (0)