Skip to content

Latest commit

 

History

History
304 lines (215 loc) · 12.8 KB

File metadata and controls

304 lines (215 loc) · 12.8 KB

Data Visualization - 1

After finishing data pivoting, we can present the results in a visual way. Simply speaking, we turn data into good-looking statistical charts, because humans are more sensitive to color and shape, and then we further interpret the business value hidden behind the data. In earlier lessons, we already showed how to use the plot method of Series and DataFrame objects to generate charts. In this chapter, we explain the foundation behind that plotting method, which is the famous matplotlib library.

Before talking about matplotlib, please first look at the chart below. It shows common chart types and their usage scenarios. When we do not know which chart is the best choice, this picture can help a lot. Simply speaking: use a line chart to see trends, a bar chart to compare data, a scatter plot to determine relationships, a pie chart to check proportions, a histogram to see distribution, and a box plot to find outliers.

Importing and Configuration

In previous lessons, we explained how to install and import the matplotlib library. If you are not sure whether matplotlib is already installed, you can try the following magic command to install or upgrade it.

%pip install -U matplotlib

To solve the problem of displaying Chinese in matplotlib charts, we need to modify the rcParams configuration of the pyplot module.

import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'].insert(0, 'SimHei')
plt.rcParams['axes.unicode_minus'] = False

Note: SimHei in the code above is a font name. You can also try other Chinese fonts. After installing a font, if you do not know its name, you can look at the file fontlist-v330.json inside the .matplotlib folder in the user home directory. One more thing to notice is that after using Chinese fonts, the minus sign on the axes may not display correctly, so we need to set axes.unicode_minus to False.

With the following magic command, we can generate vector graphics (SVG - Scalable Vector Graphics) while plotting. The advantage of vector graphics is that they do not become distorted when zoomed, shrunk, or rotated.

%config InlineBackend.figure_format='svg'

Creating a Figure

The figure function in pyplot can be used to create a figure. When creating a figure, we can use the figsize parameter to specify the size of the figure, whose default value is [6.4, 4.8]; we can use the dpi parameter to set the drawing resolution, because dpi means the number of pixels per inch. Besides that, we can use the facecolor parameter to set the background color of the figure. The return value of figure is a Figure object, which represents the canvas used for plotting. Based on the figure, we can create axes for drawing.

plt.figure(figsize=(8, 4), dpi=120, facecolor='darkgray')

Creating Axes

We can directly use the subplot function of pyplot to create axes. This function returns an Axes object. The first three parameters of subplot specify how many rows and columns the whole figure is split into and the index of the current axes. Their default values are all 1. If we do not explicitly create axes, plotting will use the default and only axes on the figure. If we need multiple axes on one figure, then we can use this function. Of course, we can also create axes through the add_subplot or add_axes methods of the Figure object. The former works like subplot, while the latter can produce nested axes.

plt.subplot(2, 2, 1)

Drawing Charts

Line charts

If we do not call figure or subplot first, plotting uses the default figure and axes. To draw a line chart, we can use the plot function of pyplot and specify data for the horizontal and vertical axes. Line charts are most suitable for observing trends, especially when the horizontal axis represents time. We can use the color parameter to customize the line color, marker to customize point markers, linestyle to customize the line style, and linewidth to customize line thickness.

The code below draws a sine curve. Here marker='*' sets the point marker to a star shape, and color='red' draws the line in red.

import numpy as np

x = np.linspace(-2 * np.pi, 2 * np.pi, 120)
y = np.sin(x)

plt.figure(figsize=(8, 4), dpi=120)
plt.plot(x, y, linewidth=2, marker='*', color='red')
plt.show()

Output:

If we want to draw both sine and cosine curves on the same axes, we can slightly modify the code above.

x = np.linspace(-2 * np.pi, 2 * np.pi, 120)
y1, y2 = np.sin(x), np.cos(x)

plt.figure(figsize=(8, 4), dpi=120)
plt.plot(x, y1, linewidth=2, marker='*', color='red')
plt.plot(x, y2, linewidth=2, marker='^', color='blue')
plt.annotate('sin(x)', xytext=(0.5, -0.75), xy=(0, -0.25), fontsize=12, arrowprops={
    'arrowstyle': '->', 'color': 'darkgreen', 'connectionstyle': 'angle3, angleA=90, angleB=0'
})
plt.annotate('cos(x)', xytext=(-3, 0.75), xy=(-1.25, 0.5), fontsize=12, arrowprops={
    'arrowstyle': '->', 'color': 'darkgreen', 'connectionstyle': 'arc3, rad=0.35'
})
plt.show()

Output:

If we want to use two axes to draw sine and cosine separately, we can use the subplot function.

plt.figure(figsize=(8, 4), dpi=120)
plt.subplot(2, 1, 1)
plt.plot(x, y1, linewidth=2, marker='*', color='red')
plt.subplot(2, 1, 2)
plt.plot(x, y2, linewidth=2, marker='^', color='blue')
plt.show()

Output:

Of course, we can also do it like this:

plt.figure(figsize=(8, 4), dpi=120)
plt.subplot(1, 2, 1)
plt.plot(x, y1, linewidth=2, marker='*', color='red')
plt.subplot(1, 2, 2)
plt.plot(x, y2, linewidth=2, marker='^', color='blue')
plt.show()

Then try the following code and see how it works.

fig = plt.figure(figsize=(10, 4), dpi=120)
plt.plot(x, y1, linewidth=2, marker='*', color='red')
ax = fig.add_axes((0.595, 0.6, 0.3, 0.25))
ax.plot(x, y2, marker='^', color='blue')
ax = fig.add_axes((0.155, 0.2, 0.3, 0.25))
ax.plot(x, y2, marker='^', color='green')
plt.show()

Note: The four-tuple passed to add_axes represents the position of the new axes inside the original axes. The first two values are the lower-left position, and the last two values are the width and height of the axes.

Scatter plots

A scatter plot can help us understand the relationship between two variables. If we want to understand the relationship between three variables, we can upgrade the scatter plot into a bubble chart. In the code below, arrays x and y represent monthly income and monthly online-shopping expense. If we want to know whether there is a correlation between them, we can draw a scatter plot like this.

x = np.array([5550, 7500, 10500, 15000, 20000, 25000, 30000, 40000])
y = np.array([800, 1800, 1250, 2000, 1800, 2100, 2500, 3500])

plt.figure(figsize=(6, 4), dpi=120)
plt.scatter(x, y)
plt.show()

Output:

Bar charts

When comparing differences in data, bar charts are a very good choice. We can use the bar function in pyplot to generate a bar chart, and the barh function to generate a horizontal bar chart. First let us prepare some data.

x = np.arange(4)
y1 = np.random.randint(20, 50, 4)
y2 = np.random.randint(10, 60, 4)

The code for drawing a bar chart is:

plt.figure(figsize=(6, 4), dpi=120)
plt.bar(x - 0.1, y1, width=0.2, label='Sales Group A')
plt.bar(x + 0.1, y2, width=0.2, label='Sales Group B')
plt.xticks(x, labels=['Q1', 'Q2', 'Q3', 'Q4'])
plt.legend()
plt.show()

Output:

If we want to draw a stacked bar chart, we can slightly modify the code above.

labels = ['Q1', 'Q2', 'Q3', 'Q4']
plt.figure(figsize=(6, 4), dpi=120)
plt.bar(labels, y1, width=0.4, label='Sales Group A')
plt.bar(labels, y2, width=0.4, bottom=y1, label='Sales Group B')
plt.legend(loc='lower right')
plt.show()

Output:

Pie charts

A pie chart is a statistical chart that divides data into several fan-shaped areas. It is mainly used to describe relative relationships among quantities and frequencies. The size of each sector in a pie chart represents the proportion of the quantity it stands for. When we need to show composition, pie charts, treemaps, and waterfall charts are all not bad choices. We can use the pie function in pyplot to draw a pie chart.

data = np.random.randint(100, 500, 7)
labels = ['Apple', 'Banana', 'Peach', 'Lychee', 'Pomegranate', 'Mangosteen', 'Durian']

plt.figure(figsize=(5, 5), dpi=120)
plt.pie(
    data,
    autopct='%.1f%%',
    radius=1,
    pctdistance=0.8,
    colors=np.random.rand(7, 3),
    textprops=dict(fontsize=8, color='black'),
    wedgeprops=dict(linewidth=1, width=0.35),
    labels=labels
)
plt.title('Share of Fruit Sales')
plt.show()

Output:

Note: You can try restoring the commented-out parts in the code above and see what effect they produce.

Histograms

In statistics, a histogram is a graph used to show the distribution of data. The data below is the height of 100 male students in a school. If we want to know the distribution, we can use a histogram.

heights = np.array([
    170, 163, 174, 164, 159, 168, 165, 171, 171, 167,
    165, 161, 175, 170, 174, 170, 174, 170, 173, 173,
    167, 169, 173, 153, 165, 169, 158, 166, 164, 173,
    162, 171, 173, 171, 165, 152, 163, 170, 171, 163,
    165, 166, 155, 155, 171, 161, 167, 172, 164, 155,
    168, 171, 173, 169, 165, 162, 168, 177, 174, 178,
    161, 180, 155, 155, 166, 175, 159, 169, 165, 174,
    175, 160, 152, 168, 164, 175, 168, 183, 166, 166,
    182, 174, 167, 168, 176, 170, 169, 173, 177, 168,
    172, 159, 173, 185, 161, 170, 170, 184, 171, 172
])

We can use the hist function of pyplot to draw the histogram. The bins parameter represents the binning scheme we use.

plt.figure(figsize=(6, 4), dpi=120)
plt.hist(heights, bins=np.arange(145, 196, 5), color='darkcyan')
plt.xlabel('Height')
plt.ylabel('Probability Density')
plt.show()

Output:

When drawing a histogram, if we change the density parameter of hist to True and also set cumulative=True, then the vertical axis will display probability density, and the chart will show the cumulative distribution of probability.

plt.figure(figsize=(6, 4), dpi=120)
plt.hist(heights, bins=np.arange(145, 196, 5), color='darkcyan', density=True, cumulative=True)
plt.xlabel('Height')
plt.ylabel('Probability')
plt.show()

Output:

Box plots

A box plot is a statistical chart used to show the spread of a set of data. The upper edge of the box is the upper quartile, the lower edge is the lower quartile, the line in the middle is the median, and the height of the box is the interquartile range. The whiskers extend to the maximum and minimum values, and the points outside the whiskers are outliers.

We can use the boxplot function of pyplot to draw a box plot.

data = np.random.randint(0, 100, 47)
data = np.append(data, 160)
data = np.append(data, 200)
data = np.append(data, -50)

plt.figure(figsize=(6, 4), dpi=120)
plt.boxplot(data, whis=1.5, showmeans=True, notch=True)
plt.ylim([-100, 250])
plt.xticks([1], labels=['data'])
plt.show()

Output:

Note: Because the data is generated randomly, the chart you get when you run the code may not be exactly the same as the chart shown here. The actual result you run is the one that matters.

Displaying and Saving Charts

We can use the show function of pyplot to display a chart, which we already did above. If we want to save a chart, we can use the savefig function. One thing to note is that if we want to both save and display a chart, we should call savefig first and then show, because after show, the figure has already been released.

plt.savefig('chart.png')
plt.show()

Other Charts

With matplotlib, we can also draw other statistical charts, such as radar charts, rose charts, and heat maps. But in actual work, the chart types we use most often have already been fully shown above. In addition, matplotlib has many details for deeply customizing charts, such as customizing axes, text, and labels. If you want to know more about drawing and customizing charts with matplotlib, you can directly read the documentation and examples on the official matplotlib website. In the next lesson, we will briefly introduce some of those higher-level charts.