Data Visualization¶

In this lesson, we'll learn two data visualization libraries matplotlib and seaborn. By the end of this lesson, students will be able to:

  • Skim library documentation to identify relevant examples and usage information.
  • Apply seaborn and matplotlib to create and customize relational and regression plots.
  • Describe data visualization principles as they relate the effectiveness of a plot.

Just like how we like to import pandas as pd, we'll import matplotlib.pyplot as plt and seaborn as sns.

Seaborn is a Python data visualization library based on matplotlib. Behind the scenes, seaborn uses matplotlib to draw its plots. When importing seaborn, it is recommended to call sns.set_theme() to apply the recommended seaborn visual style instead of the default matplotlib theme.

In [6]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

sns.set_theme()

Let's load this uniquely-formatted pokemon dataset.

In [7]:
pokemon = pd.read_csv("pokemon_viz.csv", index_col="Num")
pokemon
Out[7]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Stage Legendary
Num
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 2 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 3 False
4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
5 Charmeleon Fire NaN 405 58 64 58 80 65 80 2 False
... ... ... ... ... ... ... ... ... ... ... ... ...
147 Dratini Dragon NaN 300 41 64 45 50 50 50 1 False
148 Dragonair Dragon NaN 420 61 84 65 70 70 70 2 False
149 Dragonite Dragon Flying 600 91 134 95 100 100 80 3 False
150 Mewtwo Psychic NaN 680 106 110 90 154 90 130 1 True
151 Mew Psychic NaN 600 100 100 100 100 100 100 1 False

151 rows × 12 columns

Figure-level versus axes-level functions¶

One way to draw a scatter plot comparing every pokemon's Attack and Defense stats is by calling sns.scatterplot. Because this plotting function has so many parameters, it's good practice to specify keyword arguments that tell Python which argument should go to which parameter.

In [9]:
sns.scatterplot(pokemon, x="Attack", y="Defense", hue="Stage")
Out[9]:
<Axes: xlabel='Attack', ylabel='Defense'>
No description has been provided for this image

The return type of sns.scatterplot is a matplotlib feature called axes that can be used to compose multiple plots into a single visualization. We can show two plots side-by-side by placing them on the same axes. For example, we could compare the attack and defense stats for two different groups of pokemon: not-Legendary and Legendary.

In [12]:
pokemon["Legendary"]
Out[12]:
Num
1      False
2      False
3      False
4      False
5      False
       ...  
147    False
148    False
149    False
150     True
151    False
Name: Legendary, Length: 151, dtype: bool
In [11]:
pokemon[pokemon["Legendary"]]
Out[11]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Stage Legendary
Num
144 Articuno Ice Flying 580 90 85 100 95 125 85 1 True
145 Zapdos Electric Flying 580 90 90 85 125 90 100 1 True
146 Moltres Fire Flying 580 90 100 90 125 85 90 1 True
150 Mewtwo Psychic NaN 680 106 110 90 154 90 130 1 True
In [15]:
pokemon[pokemon["Stage"] == 2]
Out[15]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Stage Legendary
Num
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 2 False
5 Charmeleon Fire NaN 405 58 64 58 80 65 80 2 False
8 Wartortle Water NaN 405 59 63 80 65 80 58 2 False
11 Metapod Bug NaN 205 50 20 55 25 25 30 2 False
14 Kakuna Bug Poison 205 45 25 50 25 25 35 2 False
17 Pidgeotto Normal Flying 349 63 60 55 50 50 71 2 False
20 Raticate Normal NaN 413 55 81 60 50 70 97 2 False
22 Fearow Normal Flying 442 65 90 65 61 61 100 2 False
24 Arbok Poison NaN 438 60 85 69 65 79 80 2 False
26 Raichu Electric NaN 485 60 90 55 90 80 110 2 False
28 Sandslash Ground NaN 450 75 100 110 45 55 65 2 False
30 Nidorina Poison NaN 365 70 62 67 55 55 56 2 False
33 Nidorino Poison NaN 365 61 72 57 55 55 65 2 False
36 Clefable Fairy NaN 483 95 70 73 95 90 60 2 False
38 Ninetales Fire NaN 505 73 76 75 81 100 100 2 False
40 Wigglytuff Normal Fairy 435 140 70 45 85 50 45 2 False
42 Golbat Poison Flying 455 75 80 70 65 75 90 2 False
44 Gloom Grass Poison 395 60 65 70 85 75 40 2 False
47 Parasect Bug Grass 405 60 95 80 60 80 30 2 False
49 Venomoth Bug Poison 450 70 65 60 90 75 90 2 False
51 Dugtrio Ground NaN 405 35 80 50 50 70 120 2 False
53 Persian Normal NaN 440 65 70 60 65 65 115 2 False
55 Golduck Water NaN 500 80 82 78 95 80 85 2 False
57 Primeape Fighting NaN 455 65 105 60 60 70 95 2 False
59 Arcanine Fire NaN 555 90 110 80 100 80 95 2 False
61 Poliwhirl Water NaN 385 65 65 65 50 50 90 2 False
64 Kadabra Psychic NaN 400 40 35 30 120 70 105 2 False
67 Machoke Fighting NaN 405 80 100 70 50 60 45 2 False
70 Weepinbell Grass Poison 390 65 90 50 85 45 55 2 False
73 Tentacruel Water Poison 515 80 70 65 80 120 100 2 False
75 Graveler Rock Ground 390 55 95 115 45 45 35 2 False
78 Rapidash Fire NaN 500 65 100 70 80 80 105 2 False
80 Slowbro Water Psychic 490 95 75 110 100 80 30 2 False
82 Magneton Electric Steel 465 50 60 95 120 70 70 2 False
85 Dodrio Normal Flying 460 60 110 70 60 60 100 2 False
87 Dewgong Water Ice 475 90 70 80 70 95 70 2 False
89 Muk Poison NaN 500 105 105 75 65 100 50 2 False
91 Cloyster Water Ice 525 50 95 180 85 45 70 2 False
93 Haunter Ghost Poison 405 45 50 45 115 55 95 2 False
97 Hypno Psychic NaN 483 85 73 70 73 115 67 2 False
99 Kingler Water NaN 475 55 130 115 50 50 75 2 False
101 Electrode Electric NaN 480 60 50 70 80 80 140 2 False
103 Exeggutor Grass Psychic 520 95 95 85 125 65 55 2 False
105 Marowak Ground NaN 425 60 80 110 50 80 45 2 False
110 Weezing Poison NaN 490 65 90 120 85 70 60 2 False
112 Rhydon Ground Rock 485 105 130 120 45 45 40 2 False
117 Seadra Water NaN 440 55 65 95 95 45 85 2 False
119 Seaking Water NaN 450 80 92 65 65 80 68 2 False
121 Starmie Water Psychic 520 60 75 85 100 85 115 2 False
130 Gyarados Water Flying 540 95 125 79 60 100 81 2 False
134 Vaporeon Water NaN 525 130 65 60 110 95 65 2 False
135 Jolteon Electric NaN 525 65 65 60 110 95 130 2 False
136 Flareon Fire NaN 525 65 130 60 95 110 65 2 False
139 Omastar Rock Water 495 70 60 125 115 70 55 2 False
141 Kabutops Rock Water 495 60 115 105 65 70 80 2 False
148 Dragonair Dragon NaN 420 61 84 65 70 70 70 2 False
In [17]:
pokemon[~pokemon["Legendary"]]
Out[17]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Stage Legendary
Num
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 2 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 3 False
4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
5 Charmeleon Fire NaN 405 58 64 58 80 65 80 2 False
... ... ... ... ... ... ... ... ... ... ... ... ...
143 Snorlax Normal NaN 540 160 110 65 65 110 30 1 False
147 Dratini Dragon NaN 300 41 64 45 50 50 50 1 False
148 Dragonair Dragon NaN 420 61 84 65 70 70 70 2 False
149 Dragonite Dragon Flying 600 91 134 95 100 100 80 3 False
151 Mew Psychic NaN 600 100 100 100 100 100 100 1 False

147 rows × 12 columns

In [19]:
# Nested tuple unpacking!
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2)

ax1.set_title("Not Legendary")
ax1.set_ylim(top=200, bottom=0)
sns.scatterplot(pokemon[~pokemon["Legendary"]], x="Attack", y="Defense", ax=ax1)

ax2.set_title("Legendary")
ax2.set_ylim(top=200, bottom=0)
sns.scatterplot(pokemon[pokemon["Legendary"]], x="Attack", y="Defense", ax=ax2)


fig.show()
No description has been provided for this image

Each problem in the plot above can be fixed manually by repeatedly editing and running the code until you get a satisfactory result, but it's a tedious process. Seaborn was invented to make our data visualization experience less tedious. Methods like sns.scatterplot are considered axes-level functions designed for interoperability with the rest of matplotlib, but they come at the cost of forcing you to deal with the tediousness of tweaking matplotlib.

Instead, the recommended way to create plots in seaborn is to use figure-level functions like sns.relplot as in relational plot. Figure-level functions return specialized seaborn objects (such as FacetGrid) that are intended to provide more usable results without tweaking.

In [29]:
sns.relplot(pokemon, x="Attack", y="Defense")
Out[29]:
<seaborn.axisgrid.FacetGrid at 0x7e0f4010ad50>
No description has been provided for this image

By default, relational plots produce scatter plots but they can also produce line plots by specifying the keyword argument kind="line".

Alongside relplot, seaborn provides several other useful figure-level plotting functions:

  • relplot for relational plots, such as scatter plots and line plots.
  • catplot for categorical plots, such as strip plots, box plots, violin plots, and bar plots.
  • lmplot for relational plots with a regression fit, such as the scatter plot with regression fit below.

When reading documentation online, it is important to remember that we will only use figure-level plots in this course because they are the recommended approach. On the relative merits of figure-level functions in the seaborn documentation:

On balance, the figure-level functions add some additional complexity that can make things more confusing for beginners, but their distinct features give them additional power. The tutorial documentation mostly uses the figure-level functions, because they produce slightly cleaner plots, and we generally recommend their use for most applications. The one situation where they are not a good choice is when you need to make a complex, standalone figure that composes multiple different plot kinds. At this point, it’s recommended to set up the figure using matplotlib directly and to fill in the individual components using axes-level functions.

In [25]:
sns.catplot(pokemon, x="Stage", y="Attack")
Out[25]:
<seaborn.axisgrid.FacetGrid at 0x7e0f403ba950>
No description has been provided for this image
In [26]:
sns.lmplot(pokemon, x="Attack", y="Defense", col="Legendary")
Out[26]:
<seaborn.axisgrid.FacetGrid at 0x7e0f40421a90>
No description has been provided for this image

Customizing a FacetGrid plot¶

relplot, displot, catplot, and lmplot all return a FacetGrid, a specialized seaborn object that represents a data visualization canvas. As we've seen above, a FacetGrid can put two plots side-by-side and manage their axes by removing the y-axis labels on the right plot because they are the same as the plot on the left.

However, there are still many instances where we might want to customize a plot by changing labels or adding titles. We might want to create a bar plot to count the number of each type of pokemon.

In [ ]:
sns.catplot(pokemon, x="Type 1", kind="count")

The pokemon types on the x-axis are hardly readable, the y-axis label "count" could use capitalization, and the plot could use a title. To modify the attributes of a plot, we can assign the returned FacetGrid to a variable like grid and then call tick_params or set.

In [ ]:
grid = sns.catplot(pokemon, x="Type 1", kind="count")
grid.tick_params(axis="x", rotation=60)
grid.set(title="Count of each primary pokemon type", xlabel="Primary Type", ylabel="Count")

Practice: Life expectancy versus health expenditure¶

Seaborn includes a repository of example datasets that we can load into a DataFrame by calling sns.load_dataset. Let's examine the Life expectancy vs. health expenditure, 1970 to 2015 dataset that combines two data sources:

  1. The Life expectancy at birth dataset from the UN World Population Prospects (2022): "For a given year, it represents the average lifespan for a hypothetical group of people, if they experienced the same age-specific death rates throughout their lives as the age-specific death rates seen in that particular year."
  2. The Health expenditure (2010 int.-$) dataset from OECD.stat. "Per capita health expenditure and financing in OECD countries, measured in 2010 international dollars."
In [4]:
life_expectancy = sns.load_dataset("healthexp", index_col=["Year", "Country"])
life_expectancy
Out[4]:
Spending_USD Life_Expectancy
Year Country
1970 Germany 252.311 70.6
France 192.143 72.2
Great Britain 123.993 71.9
Japan 150.437 72.0
USA 326.961 70.9
... ... ... ...
2020 Germany 6938.983 81.1
France 5468.418 82.3
Great Britain 5018.700 80.4
Japan 4665.641 84.7
USA 11859.179 77.0

274 rows × 2 columns

Write a seaborn expression to create a line plot comparing the Year (x-axis) to the Life_Expectancy (y-axis) colored with hue="Country".

In [32]:
sns.lineplot(life_expectancy, x="Year", y="Life_Expectancy", hue="Country")
Out[32]:
<Axes: xlabel='Year', ylabel='Life_Expectancy'>
No description has been provided for this image
In [31]:
sns.relplot(life_expectancy, x="Year", y="Life_Expectancy", hue="Country", kind="line")
Out[31]:
<seaborn.axisgrid.FacetGrid at 0x7e0f3fe381d0>
No description has been provided for this image

What makes bad figures bad?¶

In chapter 1 of Data Visualization, Kieran Hiely explains how data visualization is about communication and rhetoric.

While it is tempting to simply start laying down the law about what works and what doesn't, the process of making a really good or really useful graph cannot be boiled down to a list of simple rules to be followed without exception in all circumstances. The graphs you make are meant to be looked at by someone. The effectiveness of any particular graph is not just a matter of how it looks in the abstract, but also a question of who is looking at it, and why. An image intended for an audience of experts reading a professional journal may not be readily interpretable by the general public. A quick visualization of a dataset you are currently exploring might not be of much use to your peers or your students.

Bad taste¶

Kieran identifies three problems, the first of which is bad taste.

3-d horizontal bar chart comparing life expectancy across continents with Papyrus font and cute visual style

Kieran draws on Edward Tufte's principles (all quoted from Tufte 1983):

  • have a properly chosen format and design
  • use words, numbers, and drawing together
  • display an accessible complexity of detail
  • avoid content-free decoration, including chartjunk

In essence, these principles amount to "an encouragement to maximize the 'data-to-ink' ratio." In practice, our plotting libraries like seaborn do a fairly good job of providing defaults that generally follow these principles.

Bad data¶

The second problem is bad data, which can involve either cherry-picking data or presenting information in a misleading way.

In November of 2016, The New York Times reported on some research on people's confidence in the institutions of democracy. It had been published in an academic journal by the political scientist Yascha Mounk. The headline in the Times ran, "How Stable Are Democracies? ‘Warning Signs Are Flashing Red’” (Taub, 2016). The graph accompanying the article

6-way line plot comparing Percentage of people who say it is 'essential' to live in a democracy (New York Times)

This plot is one that is well-produced, and that we could reproduce by calling sns.relplot like we learned above. The x-axis shows the decade of birth for people all surveyed in the research study.

[But] scholars who knew the World Values Survey data underlying the graph noticed something else. The graph reads as though people were asked to say whether they thought it was essential to live in a democracy, and the results plotted show the percentage of respondents who said "Yes", presumably in contrast to those who said "No". But in fact the survey question asked respondents to rate the importance of living in a democracy on a ten point scale, with 1 being "Not at all Important" and 10 being "Absolutely Important". The graph showed the difference across ages of people who had given a score of "10" only, not changes in the average score on the question. As it turns out, while there is some variation by year of birth, most people in these countries tend to rate the importance of living in a democracy very highly, even if they do not all score it as "Absolutely Important". The political scientist Erik Voeten redrew the figure based using the average response.

5-way line plot comparing by Erik Voeten showing Average importance of democracy for each Decade of birth

Bad perception¶

The third problem is bad perception, which refers to how humans process the information contained in a visualization. Let's walk through section 1.3 on "Perception and data visualization".