Spreadsheets¶

By the end of this lesson, students will be able to:

  • Design spreadsheet data models that enable reproducible data analysis.
  • Convert a pivot table operation to pandas groupby and vice versa.
  • Write spreadsheet formulas that apply a function over many cells.

For this lesson, we'll spend most of our time in the preceding notebook on groupby-and-indexing.ipynb.

Later, we'll download the earthquakes.csv file and use it to create a spreadsheet. In lecture, we will visit sheets.new to create a new Google Sheet.

In [1]:
import pandas as pd
import seaborn as sns

sns.set_theme()

What is pivot table?¶

Let's first revisit the life expectancy dataset and use this as an example of showing what it is in pandas.

In [5]:
life_expectancy = sns.load_dataset("healthexp", index_col=["Year", "Country"])
life_expectancy
Out[5]:
Spending_USD Life_Expectancy
Year Country
1970 Germany 252.311 70.6
France 192.143 72.2
Great Britain 123.993 71.9
Japan 150.437 72.0
USA 326.961 70.9
... ... ... ...
2020 Germany 6938.983 81.1
France 5468.418 82.3
Great Britain 5018.700 80.4
Japan 4665.641 84.7
USA 11859.179 77.0

274 rows × 2 columns

Let's try pivoting the table about the "Country" column. We can read the documentation of pivot_table here.

In [6]:
pivoted_table = life_expectancy.pivot_table(index="Year", columns="Country", values="Life_Expectancy")
pivoted_table.head()
Out[6]:
Country Canada France Germany Great Britain Japan USA
Year
1970 NaN 72.2 70.6 71.9 72.0 70.9
1971 72.8 NaN 70.8 71.9 72.9 71.2
1972 NaN NaN 71.0 NaN 73.2 71.2
1973 NaN NaN 71.3 NaN 73.4 71.4
1974 NaN NaN 71.5 NaN 73.7 72.0
In [7]:
life_expectancy.columns, pivoted_table.columns
Out[7]:
(Index(['Spending_USD', 'Life_Expectancy'], dtype='object'),
 Index(['Canada', 'France', 'Germany', 'Great Britain', 'Japan', 'USA'], dtype='object', name='Country'))
In [11]:
sns.relplot(pivoted_table, kind="line", dashes=False)
# pretty much the same as
# sns.relplot(life_expectancy, x="Year", y="Life_Expectancy", hue="Country", kind="line")
Out[11]:
<seaborn.axisgrid.FacetGrid at 0x7fdfad43db40>
No description has been provided for this image

Check the pandas version of earthquakes pivot table¶

In [2]:
earthquakes = pd.read_csv("earthquakes.csv")
In [ ]:
earthquakes.groupby(["year", "month", "day"])["name"].max("magnitude")
In [4]:
earthquakes.groupby(["year", "month", "day", "name"])["magnitude"].max()
Out[4]:
year  month  day  name                     
2016  7      27   Afghanistan                  4.50
                  Alaska                       3.60
                  Argentina                    4.10
                  Arkansas                     2.76
                  British Virgin Islands       2.70
                                               ... 
      8      25   Oregon                       2.83
                  Puerto Rico                  2.50
                  Russia                       4.70
                  South of the Fiji Islands    4.70
                  Washington                   2.19
Name: magnitude, Length: 866, dtype: float64
In [ ]: