Skip to content

Latest commit

 

History

History
309 lines (223 loc) · 12.1 KB

File metadata and controls

309 lines (223 loc) · 12.1 KB

Data Visualization - 3

Through the previous lessons, we already have a first understanding of the data-visualization tool matplotlib. You may also have noticed that although matplotlib provides powerful functions, it has too many parameters, and if we want to deeply customize a chart, we need to modify a long list of them. This is not very friendly to beginners. On the other hand, statistical charts customized with matplotlib are static charts, which may not be suitable in some situations that require interaction. To solve these two problems, we introduce two new visualization tools here: seaborn and pyecharts.

Seaborn

Seaborn is a data-visualization tool built on top of matplotlib. It can be understood as a higher-level wrapper over matplotlib. Seaborn also integrates very well with pandas, so we can build better statistical charts with less code and use them to explore and understand data.

Seaborn includes, but is not limited to, the following capabilities:

  1. A dataset-oriented API for examining relationships among multiple variables.
  2. Support for using categorical variables to show observations or summary statistics.
  3. Visualization of univariate or bivariate distributions and comparison across subsets of data.
  4. Automatic estimation and plotting of linear regression models.
  5. Built-in palettes and themes for easily customizing the visual effect of statistical charts.

We can use pip to install seaborn.

pip install seaborn

In Jupyter, we can also directly use a magic command:

%pip install seaborn

Below, we use a built-in seaborn dataset to briefly show its usage and strengths. Readers who want to study seaborn more deeply can read the official tutorial and the official examples. Following the official examples is a good choice. Simply speaking, keep the official code and replace the data with your own.

The figure below shows the kinds of chart functions seaborn provides. We can see that these functions mainly help us explore relationships, distributions, and categories through charts.

When using seaborn, we first need to import the library and set a theme.

import seaborn as sns

sns.set_theme()

If we need Chinese text to display on the chart, we also need to modify the matplotlib configuration as we did before.

import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'].insert(0, 'SimHei')
plt.rcParams['axes.unicode_minus'] = False

Notice: The code above must be placed after calling set_theme, otherwise set_theme will modify the matplotlib font configuration again.

Load the official Tips dataset.

tips_df = sns.load_dataset('tips')
tips_df.info()

The output is shown below. Here total_bill means the total bill amount, tip means the amount of the tip, sex is the customer's gender, smoker shows whether the customer smokes, day means the weekday, time means lunch or dinner, and size is the number of diners.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB

Since the dataset is loaded through the network, the code above may fail because of SSL. In that case, we can try the following code first, and then load the dataset again.

import ssl

ssl._create_default_https_context = ssl._create_unverified_context

If we want to understand the distribution of bill amount, we can use the following code to draw a distribution chart.

sns.histplot(data=tips_df, x='total_bill', kde=True)

If we want to understand pairwise relationships among variables, we can draw a pair plot.

sns.pairplot(data=tips_df, hue='sex')

If you are not satisfied with the chart colors above, you can also use the palette parameter to choose a built-in seaborn palette.

The figure below shows some built-in seaborn palettes.

sns.pairplot(data=tips_df, hue='sex', palette='Dark2')

Next, let us draw a joint distribution chart for total_bill and tip.

sns.jointplot(data=tips_df, x='total_bill', y='tip', hue='sex')

The chart clearly shows that there is a positive correlation between total_bill and tip. We can also verify this through the corr method of a DataFrame. Next, we can build a regression model to fit these data points, and seaborn's regression plot already helps us do this.

sns.lmplot(data=tips_df, x='total_bill', y='tip', hue='sex')

If we want to understand the central tendency and dispersion of bill amount, we can draw a box plot or violin plot. Here we display the data separately by Thursday, Friday, Saturday, and Sunday.

sns.boxplot(data=tips_df, x='day', y='total_bill')

sns.violinplot(data=tips_df, x='day', y='total_bill')

Note: Compared with a box plot, a violin plot does not mark outliers, but instead shows the whole range of the data. On the other hand, the violin plot displays the distribution of the data very well.

Pyecharts

ECharts was originally a frontend chart library developed by Baidu. In 2018, ECharts entered Apache Incubator, and now it is already a top-level project of the Apache Software Foundation. Because of its good interactivity and refined chart design, ECharts has been recognized by many developers. pyecharts is a Python wrapper around ECharts, so Python developers can also use ECharts to draw beautiful and highly interactive statistical charts.

We can install pyecharts with pip.

pip install pyecharts

In JupyterLab, we can directly use a magic command:

%pip install pyecharts

If we want to use pyecharts in JupyterLab, we still need to do a little preparation work, mainly by modifying the pyecharts configuration.

from pyecharts.globals import CurrentConfig, NotebookType

CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_LAB

Next, let us look at a simple example adapted from the official pyecharts beginner tutorial.

from pyecharts.charts import Bar
from pyecharts import options as opts

bar_chart = Bar(init_opts=opts.InitOpts(width='600px', height='450px'))
bar_chart.add_xaxis(["Shirt", "Sweater", "Chiffon Shirt", "Pants", "High Heels", "Socks"])
bar_chart.add_yaxis("Merchant A", [25, 20, 36, 10, 75, 90])
bar_chart.add_yaxis("Merchant B", [15, 12, 30, 20, 45, 60])
bar_chart.add_yaxis("Merchant C", [12, 32, 40, 52, 35, 26])
bar_chart.set_global_opts(
    xaxis_opts=opts.AxisOpts(
        axislabel_opts=opts.LabelOpts(color='navy')
    ),
    yaxis_opts=opts.AxisOpts(
        axislabel_opts=opts.LabelOpts(color='navy'),
        min_=0,
        max_=100,
        interval=10
    ),
    title_opts=opts.TitleOpts(
        title='2022 Sales Data Display',
        pos_left='2%',
        title_textstyle_opts=opts.TextStyleOpts(
            color='navy',
            font_size=16,
            font_family='PingFang SC',
            font_weight='bold'
        )
    ),
    toolbox_opts=opts.ToolboxOpts(
        orient='vertical',
        pos_left='right'
    )
)
bar_chart.load_javascript()

After running the code above, we can render the chart by calling a method of the bar_chart object. If we directly use the render method, the chart will be saved into an HTML file. If we use render_notebook, the chart will be rendered in the browser window.

bar_chart.render_notebook()

The effect of the code above is shown below. It is worth mentioning that the title, legend, and toolbox on the right side in the figure below can all be clicked. You can click them and see what happens. The charm of ECharts is exactly its interactivity.

Next, let us look at how to draw a pie chart, also adapted from an official example.

import pyecharts.options as opts
from pyecharts.charts import Pie

x_data = ["Direct Access", "Email Marketing", "Affiliate Ads", "Video Ads", "Search Engine"]
y_data = [335, 310, 234, 135, 1548]
data = [(x, y) for x, y in zip(x_data, y_data)]

pie_chart = Pie(init_opts=opts.InitOpts(width="800px", height="400px"))
pie_chart.add(
    '',
    data_pair=data,
    radius=["50%", "75%"],
    label_opts=opts.LabelOpts(is_show=False),
)
pie_chart.set_global_opts(
    legend_opts=opts.LegendOpts(
        pos_left="left",
        orient="vertical"
    )
)
pie_chart.set_series_opts(
    tooltip_opts=opts.TooltipOpts(is_show=False),
    label_opts=opts.LabelOpts(formatter="{b}({c}): {d}%")
)
pie_chart.load_javascript()
pie_chart.render_notebook()

Running the code above gives the effect shown below.

One thing to remind you of is that pyecharts cannot directly use NumPy ndarray or pandas Series and DataFrame as input data. It needs native Python data types. As you may have already noticed in the code above, we used lists and tuples.

Finally, let us look at how to draw a map. To draw a map, we first need to install extra dependency packages that provide geographic data.

pip install echarts-countries-pypkg echarts-china-provinces-pypkg echarts-china-cities-pypkg echarts-china-counties-pypkg

In Jupyter, we can also use magic commands:

%pip install echarts-countries-pypkg
%pip install echarts-china-provinces-pypkg
%pip install echarts-china-cities-pypkg
%pip install echarts-china-counties-pypkg

Note: These four libraries contain data for countries of the world, provincial-level regions of China, city-level regions of China, and district/county-level regions of China.

Then we can place the data of all provinces in a list.

data = [
    ('Guangdong', 594), ('Zhejiang', 438), ('Sichuan', 316), ('Beijing', 269), ('Shandong', 248),
    ('Jiangsu', 234), ('Hunan', 196), ('Fujian', 166), ('Henan', 153), ('Liaoning', 152),
    ('Shanghai', 138), ('Hebei', 86), ('Anhui', 79), ('Hubei', 75), ('Heilongjiang', 70),
    ('Shaanxi', 63), ('Jilin', 59), ('Jiangxi', 56), ('Chongqing', 46), ('Guizhou', 39),
    ('Shanxi', 37), ('Yunnan', 33), ('Guangxi', 24), ('Tianjin', 22), ('Xinjiang', 21),
    ('Hainan', 18), ('Inner Mongolia', 14), ('Taiwan', 11), ('Gansu', 7), ('Guangxi Zhuang Autonomous Region', 4),
    ('Hong Kong', 4), ('Qinghai', 3), ('Xinjiang Uygur Autonomous Region', 3), ('Inner Mongolia Autonomous Region', 3), ('Ningxia', 1)
]

Next, use pyecharts to mark the number of big Douyin influencers in each province on a map.

import pyecharts.options as opts
from pyecharts.charts import Map

map_chart = Map(init_opts=opts.InitOpts(width='1000px', height='1000px'))
map_chart.add('', data, 'china', is_roam=False)
map_chart.load_javascript()
map_chart.render_notebook()

The effect of the code above is shown below. When you move the mouse over the map, the corresponding province will be highlighted and you can see related information.

Like seaborn, we recommend that you learn pyecharts by referring to the official examples. On the left navigation bar of the pyecharts official website, you can find the option "Chart Types". Under each chart type, there are corresponding official examples, and many of those code samples can be used directly. What we usually need to do is only replace the data with our own.