Data Visualization¶

In this lesson, we'll learn two data visualization libraries matplotlib and seaborn. By the end of this lesson, students will be able to:

Skim library documentation to identify relevant examples and usage information.
Apply seaborn and matplotlib to create and customize relational and regression plots.
Describe data visualization principles as they relate the effectiveness of a plot.

Just like how we like to import pandas as pd, we'll import matplotlib.pyplot as plt and seaborn as sns.

Seaborn is a Python data visualization library based on matplotlib. Behind the scenes, seaborn uses matplotlib to draw its plots. When importing seaborn, it is recommended to call sns.set_theme() to apply the recommended seaborn visual style instead of the default matplotlib theme.

In [6]:

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

sns.set_theme()

Let's load this uniquely-formatted pokemon dataset.

In [7]:

pokemon = pd.read_csv("pokemon_viz.csv", index_col="Num")
pokemon

Out[7]:

	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Stage	Legendary
Num
1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	2	False
3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	3	False
4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False
5	Charmeleon	Fire	NaN	405	58	64	58	80	65	80	2	False
...	...	...	...	...	...	...	...	...	...	...	...	...
147	Dratini	Dragon	NaN	300	41	64	45	50	50	50	1	False
148	Dragonair	Dragon	NaN	420	61	84	65	70	70	70	2	False
149	Dragonite	Dragon	Flying	600	91	134	95	100	100	80	3	False
150	Mewtwo	Psychic	NaN	680	106	110	90	154	90	130	1	True
151	Mew	Psychic	NaN	600	100	100	100	100	100	100	1	False

151 rows × 12 columns

Figure-level versus axes-level functions¶

One way to draw a scatter plot comparing every pokemon's Attack and Defense stats is by calling sns.scatterplot. Because this plotting function has so many parameters, it's good practice to specify keyword arguments that tell Python which argument should go to which parameter.

In [9]:

sns.scatterplot(pokemon, x="Attack", y="Defense", hue="Stage")

Out[9]:

<Axes: xlabel='Attack', ylabel='Defense'>

No description has been provided for this image

The return type of sns.scatterplot is a matplotlib feature called axes that can be used to compose multiple plots into a single visualization. We can show two plots side-by-side by placing them on the same axes. For example, we could compare the attack and defense stats for two different groups of pokemon: not-Legendary and Legendary.

In [12]:

pokemon["Legendary"]

Out[12]:

Num
1      False
2      False
3      False
4      False
5      False
       ...  
147    False
148    False
149    False
150     True
151    False
Name: Legendary, Length: 151, dtype: bool

In [11]:

pokemon[pokemon["Legendary"]]

Out[11]:

	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Stage	Legendary
Num
144	Articuno	Ice	Flying	580	90	85	100	95	125	85	1	True
145	Zapdos	Electric	Flying	580	90	90	85	125	90	100	1	True
146	Moltres	Fire	Flying	580	90	100	90	125	85	90	1	True
150	Mewtwo	Psychic	NaN	680	106	110	90	154	90	130	1	True

In [15]:

pokemon[pokemon["Stage"] == 2]

Out[15]:

	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Stage	Legendary
Num
2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	2	False
5	Charmeleon	Fire	NaN	405	58	64	58	80	65	80	2	False
8	Wartortle	Water	NaN	405	59	63	80	65	80	58	2	False
11	Metapod	Bug	NaN	205	50	20	55	25	25	30	2	False
14	Kakuna	Bug	Poison	205	45	25	50	25	25	35	2	False
17	Pidgeotto	Normal	Flying	349	63	60	55	50	50	71	2	False
20	Raticate	Normal	NaN	413	55	81	60	50	70	97	2	False
22	Fearow	Normal	Flying	442	65	90	65	61	61	100	2	False
24	Arbok	Poison	NaN	438	60	85	69	65	79	80	2	False
26	Raichu	Electric	NaN	485	60	90	55	90	80	110	2	False
28	Sandslash	Ground	NaN	450	75	100	110	45	55	65	2	False
30	Nidorina	Poison	NaN	365	70	62	67	55	55	56	2	False
33	Nidorino	Poison	NaN	365	61	72	57	55	55	65	2	False
36	Clefable	Fairy	NaN	483	95	70	73	95	90	60	2	False
38	Ninetales	Fire	NaN	505	73	76	75	81	100	100	2	False
40	Wigglytuff	Normal	Fairy	435	140	70	45	85	50	45	2	False
42	Golbat	Poison	Flying	455	75	80	70	65	75	90	2	False
44	Gloom	Grass	Poison	395	60	65	70	85	75	40	2	False
47	Parasect	Bug	Grass	405	60	95	80	60	80	30	2	False
49	Venomoth	Bug	Poison	450	70	65	60	90	75	90	2	False
51	Dugtrio	Ground	NaN	405	35	80	50	50	70	120	2	False
53	Persian	Normal	NaN	440	65	70	60	65	65	115	2	False
55	Golduck	Water	NaN	500	80	82	78	95	80	85	2	False
57	Primeape	Fighting	NaN	455	65	105	60	60	70	95	2	False
59	Arcanine	Fire	NaN	555	90	110	80	100	80	95	2	False
61	Poliwhirl	Water	NaN	385	65	65	65	50	50	90	2	False
64	Kadabra	Psychic	NaN	400	40	35	30	120	70	105	2	False
67	Machoke	Fighting	NaN	405	80	100	70	50	60	45	2	False
70	Weepinbell	Grass	Poison	390	65	90	50	85	45	55	2	False
73	Tentacruel	Water	Poison	515	80	70	65	80	120	100	2	False
75	Graveler	Rock	Ground	390	55	95	115	45	45	35	2	False
78	Rapidash	Fire	NaN	500	65	100	70	80	80	105	2	False
80	Slowbro	Water	Psychic	490	95	75	110	100	80	30	2	False
82	Magneton	Electric	Steel	465	50	60	95	120	70	70	2	False
85	Dodrio	Normal	Flying	460	60	110	70	60	60	100	2	False
87	Dewgong	Water	Ice	475	90	70	80	70	95	70	2	False
89	Muk	Poison	NaN	500	105	105	75	65	100	50	2	False
91	Cloyster	Water	Ice	525	50	95	180	85	45	70	2	False
93	Haunter	Ghost	Poison	405	45	50	45	115	55	95	2	False
97	Hypno	Psychic	NaN	483	85	73	70	73	115	67	2	False
99	Kingler	Water	NaN	475	55	130	115	50	50	75	2	False
101	Electrode	Electric	NaN	480	60	50	70	80	80	140	2	False
103	Exeggutor	Grass	Psychic	520	95	95	85	125	65	55	2	False
105	Marowak	Ground	NaN	425	60	80	110	50	80	45	2	False
110	Weezing	Poison	NaN	490	65	90	120	85	70	60	2	False
112	Rhydon	Ground	Rock	485	105	130	120	45	45	40	2	False
117	Seadra	Water	NaN	440	55	65	95	95	45	85	2	False
119	Seaking	Water	NaN	450	80	92	65	65	80	68	2	False
121	Starmie	Water	Psychic	520	60	75	85	100	85	115	2	False
130	Gyarados	Water	Flying	540	95	125	79	60	100	81	2	False
134	Vaporeon	Water	NaN	525	130	65	60	110	95	65	2	False
135	Jolteon	Electric	NaN	525	65	65	60	110	95	130	2	False
136	Flareon	Fire	NaN	525	65	130	60	95	110	65	2	False
139	Omastar	Rock	Water	495	70	60	125	115	70	55	2	False
141	Kabutops	Rock	Water	495	60	115	105	65	70	80	2	False
148	Dragonair	Dragon	NaN	420	61	84	65	70	70	70	2	False

In [17]:

pokemon[~pokemon["Legendary"]]

Out[17]:

	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Stage	Legendary
Num
1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	2	False
3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	3	False
4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False
5	Charmeleon	Fire	NaN	405	58	64	58	80	65	80	2	False
...	...	...	...	...	...	...	...	...	...	...	...	...
143	Snorlax	Normal	NaN	540	160	110	65	65	110	30	1	False
147	Dratini	Dragon	NaN	300	41	64	45	50	50	50	1	False
148	Dragonair	Dragon	NaN	420	61	84	65	70	70	70	2	False
149	Dragonite	Dragon	Flying	600	91	134	95	100	100	80	3	False
151	Mew	Psychic	NaN	600	100	100	100	100	100	100	1	False

147 rows × 12 columns

In [19]:

# Nested tuple unpacking!
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2)

ax1.set_title("Not Legendary")
ax1.set_ylim(top=200, bottom=0)
sns.scatterplot(pokemon[~pokemon["Legendary"]], x="Attack", y="Defense", ax=ax1)

ax2.set_title("Legendary")
ax2.set_ylim(top=200, bottom=0)
sns.scatterplot(pokemon[pokemon["Legendary"]], x="Attack", y="Defense", ax=ax2)


fig.show()

Each problem in the plot above can be fixed manually by repeatedly editing and running the code until you get a satisfactory result, but it's a tedious process. Seaborn was invented to make our data visualization experience less tedious. Methods like sns.scatterplot are considered axes-level functions designed for interoperability with the rest of matplotlib, but they come at the cost of forcing you to deal with the tediousness of tweaking matplotlib.

Instead, the recommended way to create plots in seaborn is to use figure-level functions like sns.relplot as in relational plot. Figure-level functions return specialized seaborn objects (such as FacetGrid) that are intended to provide more usable results without tweaking.

In [29]:

sns.relplot(pokemon, x="Attack", y="Defense")

Out[29]:

<seaborn.axisgrid.FacetGrid at 0x7e0f4010ad50>

By default, relational plots produce scatter plots but they can also produce line plots by specifying the keyword argument kind="line".

Alongside relplot, seaborn provides several other useful figure-level plotting functions:

relplot for relational plots, such as scatter plots and line plots.
catplot for categorical plots, such as strip plots, box plots, violin plots, and bar plots.
lmplot for relational plots with a regression fit, such as the scatter plot with regression fit below.

When reading documentation online, it is important to remember that we will only use figure-level plots in this course because they are the recommended approach. On the relative merits of figure-level functions in the seaborn documentation:

On balance, the figure-level functions add some additional complexity that can make things more confusing for beginners, but their distinct features give them additional power. The tutorial documentation mostly uses the figure-level functions, because they produce slightly cleaner plots, and we generally recommend their use for most applications. The one situation where they are not a good choice is when you need to make a complex, standalone figure that composes multiple different plot kinds. At this point, it’s recommended to set up the figure using matplotlib directly and to fill in the individual components using axes-level functions.

In [25]:

sns.catplot(pokemon, x="Stage", y="Attack")

Out[25]:

<seaborn.axisgrid.FacetGrid at 0x7e0f403ba950>

In [26]:

sns.lmplot(pokemon, x="Attack", y="Defense", col="Legendary")

Out[26]:

<seaborn.axisgrid.FacetGrid at 0x7e0f40421a90>

Customizing a `FacetGrid` plot¶

relplot, displot, catplot, and lmplot all return a FacetGrid, a specialized seaborn object that represents a data visualization canvas. As we've seen above, a FacetGrid can put two plots side-by-side and manage their axes by removing the y-axis labels on the right plot because they are the same as the plot on the left.

However, there are still many instances where we might want to customize a plot by changing labels or adding titles. We might want to create a bar plot to count the number of each type of pokemon.

In [ ]:

sns.catplot(pokemon, x="Type 1", kind="count")

The pokemon types on the x-axis are hardly readable, the y-axis label "count" could use capitalization, and the plot could use a title. To modify the attributes of a plot, we can assign the returned FacetGrid to a variable like grid and then call tick_params or set.

In [ ]:

grid = sns.catplot(pokemon, x="Type 1", kind="count")
grid.tick_params(axis="x", rotation=60)
grid.set(title="Count of each primary pokemon type", xlabel="Primary Type", ylabel="Count")

Practice: Life expectancy versus health expenditure¶

Seaborn includes a repository of example datasets that we can load into a DataFrame by calling sns.load_dataset. Let's examine the Life expectancy vs. health expenditure, 1970 to 2015 dataset that combines two data sources:

The Life expectancy at birth dataset from the UN World Population Prospects (2022): "For a given year, it represents the average lifespan for a hypothetical group of people, if they experienced the same age-specific death rates throughout their lives as the age-specific death rates seen in that particular year."
The Health expenditure (2010 int.-$) dataset from OECD.stat. "Per capita health expenditure and financing in OECD countries, measured in 2010 international dollars."

In [4]:

life_expectancy = sns.load_dataset("healthexp", index_col=["Year", "Country"])
life_expectancy

Out[4]:

		Spending_USD	Life_Expectancy
Year	Country
1970	Germany	252.311	70.6
	France	192.143	72.2
	Great Britain	123.993	71.9
	Japan	150.437	72.0
	USA	326.961	70.9
...	...	...	...
2020	Germany	6938.983	81.1
	France	5468.418	82.3
	Great Britain	5018.700	80.4
	Japan	4665.641	84.7
	USA	11859.179	77.0

274 rows × 2 columns

Write a seaborn expression to create a line plot comparing the Year (x-axis) to the Life_Expectancy (y-axis) colored with hue="Country".

In [32]:

sns.lineplot(life_expectancy, x="Year", y="Life_Expectancy", hue="Country")

Out[32]:

<Axes: xlabel='Year', ylabel='Life_Expectancy'>

In [31]:

sns.relplot(life_expectancy, x="Year", y="Life_Expectancy", hue="Country", kind="line")

Out[31]:

<seaborn.axisgrid.FacetGrid at 0x7e0f3fe381d0>

What makes bad figures bad?¶

In chapter 1 of Data Visualization, Kieran Hiely explains how data visualization is about communication and rhetoric.

While it is tempting to simply start laying down the law about what works and what doesn't, the process of making a really good or really useful graph cannot be boiled down to a list of simple rules to be followed without exception in all circumstances. The graphs you make are meant to be looked at by someone. The effectiveness of any particular graph is not just a matter of how it looks in the abstract, but also a question of who is looking at it, and why. An image intended for an audience of experts reading a professional journal may not be readily interpretable by the general public. A quick visualization of a dataset you are currently exploring might not be of much use to your peers or your students.

Bad taste¶

Kieran identifies three problems, the first of which is bad taste.

3-d horizontal bar chart comparing life expectancy across continents with Papyrus font and cute visual style

Kieran draws on Edward Tufte's principles (all quoted from Tufte 1983):

have a properly chosen format and design
use words, numbers, and drawing together
display an accessible complexity of detail
avoid content-free decoration, including chartjunk

In essence, these principles amount to "an encouragement to maximize the 'data-to-ink' ratio." In practice, our plotting libraries like seaborn do a fairly good job of providing defaults that generally follow these principles.

Bad data¶

The second problem is bad data, which can involve either cherry-picking data or presenting information in a misleading way.

In November of 2016, The New York Times reported on some research on people's confidence in the institutions of democracy. It had been published in an academic journal by the political scientist Yascha Mounk. The headline in the Times ran, "How Stable Are Democracies? ‘Warning Signs Are Flashing Red’” (Taub, 2016). The graph accompanying the article

6-way line plot comparing Percentage of people who say it is 'essential' to live in a democracy (New York Times)

This plot is one that is well-produced, and that we could reproduce by calling sns.relplot like we learned above. The x-axis shows the decade of birth for people all surveyed in the research study.

[But] scholars who knew the World Values Survey data underlying the graph noticed something else. The graph reads as though people were asked to say whether they thought it was essential to live in a democracy, and the results plotted show the percentage of respondents who said "Yes", presumably in contrast to those who said "No". But in fact the survey question asked respondents to rate the importance of living in a democracy on a ten point scale, with 1 being "Not at all Important" and 10 being "Absolutely Important". The graph showed the difference across ages of people who had given a score of "10" only, not changes in the average score on the question. As it turns out, while there is some variation by year of birth, most people in these countries tend to rate the importance of living in a democracy very highly, even if they do not all score it as "Absolutely Important". The political scientist Erik Voeten redrew the figure based using the average response.

5-way line plot comparing by Erik Voeten showing Average importance of democracy for each Decade of birth

Bad perception¶

The third problem is bad perception, which refers to how humans process the information contained in a visualization. Let's walk through section 1.3 on "Perception and data visualization".