Spreadsheets¶

By the end of this lesson, students will be able to:

  • Design spreadsheet data models that enable reproducible data analysis.
  • Convert a pivot table operation to pandas groupby and vice versa.
  • Write spreadsheet formulas that apply a function over many cells.
In [17]:
import pandas as pd
from seaborn import load_dataset

Pivot tables¶

In the next lecture, we'll learn about seaborn, a data visualization library. It includes several interesting datasets, such as this dataset on life expectancy for several countries from 1970 through 2020.

In [18]:
life_expectancy = load_dataset("healthexp", index_col=["Year", "Country"])
life_expectancy
Out[18]:
Spending_USD Life_Expectancy
Year Country
1970 Germany 252.311 70.6
France 192.143 72.2
Great Britain 123.993 71.9
Japan 150.437 72.0
USA 326.961 70.9
... ... ... ...
2020 Germany 6938.983 81.1
France 5468.418 82.3
Great Britain 5018.700 80.4
Japan 4665.641 84.7
USA 11859.179 77.0

274 rows × 2 columns

What's a pivot table? Let's try pivoting the table about the Country column.

In [19]:
pivoted_table = life_expectancy.pivot_table(index="Year", columns="Country", values="Life_Expectancy")
pivoted_table
# Pivot tables can also let you aggregate (like a groupby)
Out[19]:
Country Canada France Germany Great Britain Japan USA
Year
1970 NaN 72.2 70.6 71.9 72.0 70.9
1971 72.8 NaN 70.8 71.9 72.9 71.2
1972 NaN NaN 71.0 NaN 73.2 71.2
1973 NaN NaN 71.3 NaN 73.4 71.4
1974 NaN NaN 71.5 NaN 73.7 72.0
1975 NaN 73.0 71.4 NaN 74.3 72.7
1976 73.8 NaN 71.8 NaN 74.8 72.9
1977 NaN NaN 72.5 NaN 75.3 73.3
1978 NaN NaN 72.4 NaN 75.7 73.5
1979 75.1 NaN 72.8 NaN 76.2 73.9
1980 75.2 74.3 72.9 73.2 76.1 73.7
1981 75.5 NaN 73.2 73.8 76.5 74.1
1982 75.6 NaN 73.5 74.1 76.9 74.5
1983 75.9 NaN 73.8 74.3 77.0 74.6
1984 76.2 NaN 74.3 74.5 77.4 74.7
1985 76.3 75.4 75.0 74.7 77.6 74.7
1986 76.5 NaN 75.2 74.8 78.1 74.7
1987 76.7 NaN 75.6 75.2 78.5 74.9
1988 76.8 NaN 75.9 75.3 78.4 74.9
1989 77.1 NaN 76.0 75.4 78.8 75.1
1990 77.3 77.0 77.3 75.7 78.9 75.3
1991 77.6 77.2 NaN 75.9 79.1 75.5
1992 77.8 77.5 76.1 76.3 79.2 75.7
1993 77.8 77.5 76.1 76.2 79.4 75.5
1994 77.9 78.0 76.5 76.8 79.8 75.7
1995 78.0 78.1 76.6 76.7 79.6 75.7
1996 78.2 78.2 76.9 76.9 80.3 76.1
1997 78.3 78.6 77.3 77.2 80.5 76.5
1998 78.6 78.8 77.7 77.3 80.6 76.7
1999 78.8 78.9 77.9 77.5 80.5 76.7
2000 79.1 79.2 78.2 77.9 81.2 76.7
2001 79.3 79.3 78.5 78.2 81.5 76.9
2002 79.5 79.4 78.5 78.3 81.8 77.0
2003 79.6 79.3 78.6 78.4 81.8 77.1
2004 79.9 80.3 79.2 79.0 82.1 77.6
2005 80.0 80.4 79.4 79.2 82.0 77.6
2006 80.4 81.0 79.8 79.5 82.4 77.8
2007 80.5 81.2 80.1 79.7 82.6 78.1
2008 80.7 81.4 80.2 79.8 82.7 78.1
2009 80.9 81.5 80.3 80.4 83.0 78.5
2010 81.2 81.8 80.5 80.6 82.9 78.6
2011 81.4 82.3 80.5 81.0 82.7 78.7
2012 81.6 82.1 80.6 81.0 83.2 78.8
2013 81.7 82.3 80.6 81.1 83.4 78.8
2014 81.8 82.8 81.2 81.4 83.7 78.9
2015 81.9 82.4 80.7 81.0 83.9 78.7
2016 82.0 82.7 81.0 81.2 84.1 78.7
2017 81.9 82.7 81.1 81.3 84.2 78.6
2018 82.0 82.8 81.0 81.3 84.3 78.7
2019 82.2 82.9 81.3 81.4 84.4 78.8
2020 81.7 82.3 81.1 80.4 84.7 77.0

How would we write this as a groupby operation?

In [23]:
# Show the life expectancy per country across all the years
life_expectancy.groupby(["Country", "Year"])["Life_Expectancy"].sum()
Out[23]:
Country  Year
Canada   1971    72.8
         1976    73.8
         1979    75.1
         1980    75.2
         1981    75.5
                 ... 
USA      2016    78.7
         2017    78.6
         2018    78.7
         2019    78.8
         2020    77.0
Name: Life_Expectancy, Length: 274, dtype: float64
In [24]:
life_expectancy["Life_Expectancy"]
Out[24]:
Year  Country      
1970  Germany          70.6
      France           72.2
      Great Britain    71.9
      Japan            72.0
      USA              70.9
                       ... 
2020  Germany          81.1
      France           82.3
      Great Britain    80.4
      Japan            84.7
      USA              77.0
Name: Life_Expectancy, Length: 274, dtype: float64

It turns out that spreadsheets are a powerful tool for data analysis too, and we'll spend some time now exploring how we might implement data frame operations using spreadsheets and learn more about when we might prefer to use spreadsheets for data analysis instead of Python.