Spreadsheets¶
By the end of this lesson, students will be able to:
- Design spreadsheet data models that enable reproducible data analysis.
- Convert a pivot table operation to
pandas
groupby
and vice versa. - Write spreadsheet formulas that apply a function over many cells.
For this lesson, we'll spend most of our time in the preceding notebook on groupby-and-indexing.ipynb.
Later, we'll download the earthquakes.csv
file and use it to create a spreadsheet. In lecture, we will visit sheets.new to create a new Google Sheet.
In [1]:
import pandas as pd
earthquakes = pd.read_csv("earthquakes.csv")
In the spreadsheet, we created a pivot table on the rows year, month, day, and name with the maximum magnitude value. How do we express this computation using Pandas?
In [4]:
# Option 1
earthquakes.groupby(["year", "month", "day"])["name"].max("magnitude")
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[4], line 2 1 # Option 1 ----> 2 earthquakes.groupby(["year", "month", "day"])["name"].max("magnitude") File /opt/conda/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:3247, in GroupBy.max(self, numeric_only, min_count, engine, engine_kwargs) 3239 return self._numba_agg_general( 3240 grouped_min_max, 3241 executor.identity_dtype_mapping, (...) 3244 is_max=True, 3245 ) 3246 else: -> 3247 return self._agg_general( 3248 numeric_only=numeric_only, 3249 min_count=min_count, 3250 alias="max", 3251 npfunc=np.max, 3252 ) File /opt/conda/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1839, in GroupBy._agg_general(self, numeric_only, min_count, alias, npfunc) 1830 @final 1831 def _agg_general( 1832 self, (...) 1837 npfunc: Callable, 1838 ): -> 1839 result = self._cython_agg_general( 1840 how=alias, 1841 alt=npfunc, 1842 numeric_only=numeric_only, 1843 min_count=min_count, 1844 ) 1845 return result.__finalize__(self.obj, method="groupby") File /opt/conda/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1901, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs) 1889 @final 1890 def _cython_agg_general( 1891 self, (...) 1898 # Note: we never get here with how="ohlc" for DataFrameGroupBy; 1899 # that goes through SeriesGroupBy -> 1901 data = self._get_data_to_aggregate(numeric_only=numeric_only, name=how) 1903 def array_func(values: ArrayLike) -> ArrayLike: 1904 try: File /opt/conda/lib/python3.10/site-packages/pandas/core/groupby/generic.py:166, in SeriesGroupBy._get_data_to_aggregate(self, numeric_only, name) 163 if numeric_only and not is_numeric_dtype(ser.dtype): 164 # GH#41291 match Series behavior 165 kwd_name = "numeric_only" --> 166 raise TypeError( 167 f"Cannot use {kwd_name}=True with " 168 f"{type(self).__name__}.{name} and non-numeric dtypes." 169 ) 170 return single TypeError: Cannot use numeric_only=True with SeriesGroupBy.max and non-numeric dtypes.
In [5]:
# Option 2
earthquakes.groupby(["year", "month", "day", "name"])["magnitude"].max()
Out[5]:
year month day name 2016 7 27 Afghanistan 4.50 Alaska 3.60 Argentina 4.10 Arkansas 2.76 British Virgin Islands 2.70 ... 8 25 Oregon 2.83 Puerto Rico 2.50 Russia 4.70 South of the Fiji Islands 4.70 Washington 2.19 Name: magnitude, Length: 866, dtype: float64