Spreadsheets¶
By the end of this lesson, students will be able to:
- Design spreadsheet data models that enable reproducible data analysis.
- Convert a pivot table operation to
pandas
groupby
and vice versa. - Write spreadsheet formulas that apply a function over many cells.
For this lesson, we'll spend most of our time in the preceding notebook on groupby-and-indexing.ipynb.
Later, we'll download the earthquakes.csv
file and use it to create a spreadsheet. In lecture, we will visit sheets.new to create a new Google Sheet.
In [1]:
import pandas as pd
import seaborn as sns
sns.set_theme()
What is pivot table?¶
Let's first revisit the life expectancy dataset and use this as an example of showing what it is in pandas.
In [5]:
life_expectancy = sns.load_dataset("healthexp", index_col=["Year", "Country"])
life_expectancy
Out[5]:
Spending_USD | Life_Expectancy | ||
---|---|---|---|
Year | Country | ||
1970 | Germany | 252.311 | 70.6 |
France | 192.143 | 72.2 | |
Great Britain | 123.993 | 71.9 | |
Japan | 150.437 | 72.0 | |
USA | 326.961 | 70.9 | |
... | ... | ... | ... |
2020 | Germany | 6938.983 | 81.1 |
France | 5468.418 | 82.3 | |
Great Britain | 5018.700 | 80.4 | |
Japan | 4665.641 | 84.7 | |
USA | 11859.179 | 77.0 |
274 rows × 2 columns
Let's try pivoting the table about the "Country" column. We can read the documentation of pivot_table
here.
In [6]:
pivoted_table = life_expectancy.pivot_table(index="Year", columns="Country", values="Life_Expectancy")
pivoted_table.head()
Out[6]:
Country | Canada | France | Germany | Great Britain | Japan | USA |
---|---|---|---|---|---|---|
Year | ||||||
1970 | NaN | 72.2 | 70.6 | 71.9 | 72.0 | 70.9 |
1971 | 72.8 | NaN | 70.8 | 71.9 | 72.9 | 71.2 |
1972 | NaN | NaN | 71.0 | NaN | 73.2 | 71.2 |
1973 | NaN | NaN | 71.3 | NaN | 73.4 | 71.4 |
1974 | NaN | NaN | 71.5 | NaN | 73.7 | 72.0 |
In [7]:
life_expectancy.columns, pivoted_table.columns
Out[7]:
(Index(['Spending_USD', 'Life_Expectancy'], dtype='object'), Index(['Canada', 'France', 'Germany', 'Great Britain', 'Japan', 'USA'], dtype='object', name='Country'))
In [11]:
sns.relplot(pivoted_table, kind="line", dashes=False)
# pretty much the same as
# sns.relplot(life_expectancy, x="Year", y="Life_Expectancy", hue="Country", kind="line")
Out[11]:
<seaborn.axisgrid.FacetGrid at 0x7fdfad43db40>
Check the pandas version of earthquakes pivot table¶
In [2]:
earthquakes = pd.read_csv("earthquakes.csv")
In [ ]:
earthquakes.groupby(["year", "month", "day"])["name"].max("magnitude")
In [4]:
earthquakes.groupby(["year", "month", "day", "name"])["magnitude"].max()
Out[4]:
year month day name 2016 7 27 Afghanistan 4.50 Alaska 3.60 Argentina 4.10 Arkansas 2.76 British Virgin Islands 2.70 ... 8 25 Oregon 2.83 Puerto Rico 2.50 Russia 4.70 South of the Fiji Islands 4.70 Washington 2.19 Name: magnitude, Length: 866, dtype: float64
In [ ]: