Spreadsheets¶
By the end of this lesson, students will be able to:
- Design spreadsheet data models that enable reproducible data analysis.
- Convert a pivot table operation to
pandas
groupby
and vice versa. - Write spreadsheet formulas that apply a function over many cells.
In [17]:
import pandas as pd
from seaborn import load_dataset
Pivot tables¶
In the next lecture, we'll learn about seaborn
, a data visualization library. It includes several interesting datasets, such as this dataset on life expectancy for several countries from 1970 through 2020.
In [18]:
life_expectancy = load_dataset("healthexp", index_col=["Year", "Country"])
life_expectancy
Out[18]:
Spending_USD | Life_Expectancy | ||
---|---|---|---|
Year | Country | ||
1970 | Germany | 252.311 | 70.6 |
France | 192.143 | 72.2 | |
Great Britain | 123.993 | 71.9 | |
Japan | 150.437 | 72.0 | |
USA | 326.961 | 70.9 | |
... | ... | ... | ... |
2020 | Germany | 6938.983 | 81.1 |
France | 5468.418 | 82.3 | |
Great Britain | 5018.700 | 80.4 | |
Japan | 4665.641 | 84.7 | |
USA | 11859.179 | 77.0 |
274 rows × 2 columns
What's a pivot table? Let's try pivoting the table about the Country
column.
In [19]:
pivoted_table = life_expectancy.pivot_table(index="Year", columns="Country", values="Life_Expectancy")
pivoted_table
# Pivot tables can also let you aggregate (like a groupby)
Out[19]:
Country | Canada | France | Germany | Great Britain | Japan | USA |
---|---|---|---|---|---|---|
Year | ||||||
1970 | NaN | 72.2 | 70.6 | 71.9 | 72.0 | 70.9 |
1971 | 72.8 | NaN | 70.8 | 71.9 | 72.9 | 71.2 |
1972 | NaN | NaN | 71.0 | NaN | 73.2 | 71.2 |
1973 | NaN | NaN | 71.3 | NaN | 73.4 | 71.4 |
1974 | NaN | NaN | 71.5 | NaN | 73.7 | 72.0 |
1975 | NaN | 73.0 | 71.4 | NaN | 74.3 | 72.7 |
1976 | 73.8 | NaN | 71.8 | NaN | 74.8 | 72.9 |
1977 | NaN | NaN | 72.5 | NaN | 75.3 | 73.3 |
1978 | NaN | NaN | 72.4 | NaN | 75.7 | 73.5 |
1979 | 75.1 | NaN | 72.8 | NaN | 76.2 | 73.9 |
1980 | 75.2 | 74.3 | 72.9 | 73.2 | 76.1 | 73.7 |
1981 | 75.5 | NaN | 73.2 | 73.8 | 76.5 | 74.1 |
1982 | 75.6 | NaN | 73.5 | 74.1 | 76.9 | 74.5 |
1983 | 75.9 | NaN | 73.8 | 74.3 | 77.0 | 74.6 |
1984 | 76.2 | NaN | 74.3 | 74.5 | 77.4 | 74.7 |
1985 | 76.3 | 75.4 | 75.0 | 74.7 | 77.6 | 74.7 |
1986 | 76.5 | NaN | 75.2 | 74.8 | 78.1 | 74.7 |
1987 | 76.7 | NaN | 75.6 | 75.2 | 78.5 | 74.9 |
1988 | 76.8 | NaN | 75.9 | 75.3 | 78.4 | 74.9 |
1989 | 77.1 | NaN | 76.0 | 75.4 | 78.8 | 75.1 |
1990 | 77.3 | 77.0 | 77.3 | 75.7 | 78.9 | 75.3 |
1991 | 77.6 | 77.2 | NaN | 75.9 | 79.1 | 75.5 |
1992 | 77.8 | 77.5 | 76.1 | 76.3 | 79.2 | 75.7 |
1993 | 77.8 | 77.5 | 76.1 | 76.2 | 79.4 | 75.5 |
1994 | 77.9 | 78.0 | 76.5 | 76.8 | 79.8 | 75.7 |
1995 | 78.0 | 78.1 | 76.6 | 76.7 | 79.6 | 75.7 |
1996 | 78.2 | 78.2 | 76.9 | 76.9 | 80.3 | 76.1 |
1997 | 78.3 | 78.6 | 77.3 | 77.2 | 80.5 | 76.5 |
1998 | 78.6 | 78.8 | 77.7 | 77.3 | 80.6 | 76.7 |
1999 | 78.8 | 78.9 | 77.9 | 77.5 | 80.5 | 76.7 |
2000 | 79.1 | 79.2 | 78.2 | 77.9 | 81.2 | 76.7 |
2001 | 79.3 | 79.3 | 78.5 | 78.2 | 81.5 | 76.9 |
2002 | 79.5 | 79.4 | 78.5 | 78.3 | 81.8 | 77.0 |
2003 | 79.6 | 79.3 | 78.6 | 78.4 | 81.8 | 77.1 |
2004 | 79.9 | 80.3 | 79.2 | 79.0 | 82.1 | 77.6 |
2005 | 80.0 | 80.4 | 79.4 | 79.2 | 82.0 | 77.6 |
2006 | 80.4 | 81.0 | 79.8 | 79.5 | 82.4 | 77.8 |
2007 | 80.5 | 81.2 | 80.1 | 79.7 | 82.6 | 78.1 |
2008 | 80.7 | 81.4 | 80.2 | 79.8 | 82.7 | 78.1 |
2009 | 80.9 | 81.5 | 80.3 | 80.4 | 83.0 | 78.5 |
2010 | 81.2 | 81.8 | 80.5 | 80.6 | 82.9 | 78.6 |
2011 | 81.4 | 82.3 | 80.5 | 81.0 | 82.7 | 78.7 |
2012 | 81.6 | 82.1 | 80.6 | 81.0 | 83.2 | 78.8 |
2013 | 81.7 | 82.3 | 80.6 | 81.1 | 83.4 | 78.8 |
2014 | 81.8 | 82.8 | 81.2 | 81.4 | 83.7 | 78.9 |
2015 | 81.9 | 82.4 | 80.7 | 81.0 | 83.9 | 78.7 |
2016 | 82.0 | 82.7 | 81.0 | 81.2 | 84.1 | 78.7 |
2017 | 81.9 | 82.7 | 81.1 | 81.3 | 84.2 | 78.6 |
2018 | 82.0 | 82.8 | 81.0 | 81.3 | 84.3 | 78.7 |
2019 | 82.2 | 82.9 | 81.3 | 81.4 | 84.4 | 78.8 |
2020 | 81.7 | 82.3 | 81.1 | 80.4 | 84.7 | 77.0 |
How would we write this as a groupby
operation?
In [23]:
# Show the life expectancy per country across all the years
life_expectancy.groupby(["Country", "Year"])["Life_Expectancy"].sum()
Out[23]:
Country Year Canada 1971 72.8 1976 73.8 1979 75.1 1980 75.2 1981 75.5 ... USA 2016 78.7 2017 78.6 2018 78.7 2019 78.8 2020 77.0 Name: Life_Expectancy, Length: 274, dtype: float64
In [24]:
life_expectancy["Life_Expectancy"]
Out[24]:
Year Country 1970 Germany 70.6 France 72.2 Great Britain 71.9 Japan 72.0 USA 70.9 ... 2020 Germany 81.1 France 82.3 Great Britain 80.4 Japan 84.7 USA 77.0 Name: Life_Expectancy, Length: 274, dtype: float64
It turns out that spreadsheets are a powerful tool for data analysis too, and we'll spend some time now exploring how we might implement data frame operations using spreadsheets and learn more about when we might prefer to use spreadsheets for data analysis instead of Python.