Spreadsheets¶

By the end of this lesson, students will be able to:

  • Design spreadsheet data models that enable reproducible data analysis.
  • Convert a pivot table operation to pandas groupby and vice versa.
  • Write spreadsheet formulas that apply a function over many cells.

For this lesson, we'll spend most of our time in the preceding notebook on groupby-and-indexing.ipynb.

Later, we'll download the earthquakes.csv file and use it to create a spreadsheet. In lecture, we will visit sheets.new to create a new Google Sheet.

In [1]:
import pandas as pd
earthquakes = pd.read_csv("earthquakes.csv")

In the spreadsheet, we created a pivot table on the rows year, month, day, and name with the maximum magnitude value. How do we express this computation using Pandas?

In [4]:
# Option 1
earthquakes.groupby(["year", "month", "day"])["name"].max("magnitude")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 2
      1 # Option 1
----> 2 earthquakes.groupby(["year", "month", "day"])["name"].max("magnitude")

File /opt/conda/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:3247, in GroupBy.max(self, numeric_only, min_count, engine, engine_kwargs)
   3239     return self._numba_agg_general(
   3240         grouped_min_max,
   3241         executor.identity_dtype_mapping,
   (...)
   3244         is_max=True,
   3245     )
   3246 else:
-> 3247     return self._agg_general(
   3248         numeric_only=numeric_only,
   3249         min_count=min_count,
   3250         alias="max",
   3251         npfunc=np.max,
   3252     )

File /opt/conda/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1839, in GroupBy._agg_general(self, numeric_only, min_count, alias, npfunc)
   1830 @final
   1831 def _agg_general(
   1832     self,
   (...)
   1837     npfunc: Callable,
   1838 ):
-> 1839     result = self._cython_agg_general(
   1840         how=alias,
   1841         alt=npfunc,
   1842         numeric_only=numeric_only,
   1843         min_count=min_count,
   1844     )
   1845     return result.__finalize__(self.obj, method="groupby")

File /opt/conda/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1901, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
   1889 @final
   1890 def _cython_agg_general(
   1891     self,
   (...)
   1898     # Note: we never get here with how="ohlc" for DataFrameGroupBy;
   1899     #  that goes through SeriesGroupBy
-> 1901     data = self._get_data_to_aggregate(numeric_only=numeric_only, name=how)
   1903     def array_func(values: ArrayLike) -> ArrayLike:
   1904         try:

File /opt/conda/lib/python3.10/site-packages/pandas/core/groupby/generic.py:166, in SeriesGroupBy._get_data_to_aggregate(self, numeric_only, name)
    163 if numeric_only and not is_numeric_dtype(ser.dtype):
    164     # GH#41291 match Series behavior
    165     kwd_name = "numeric_only"
--> 166     raise TypeError(
    167         f"Cannot use {kwd_name}=True with "
    168         f"{type(self).__name__}.{name} and non-numeric dtypes."
    169     )
    170 return single

TypeError: Cannot use numeric_only=True with SeriesGroupBy.max and non-numeric dtypes.
In [5]:
# Option 2
earthquakes.groupby(["year", "month", "day", "name"])["magnitude"].max()
Out[5]:
year  month  day  name                     
2016  7      27   Afghanistan                  4.50
                  Alaska                       3.60
                  Argentina                    4.10
                  Arkansas                     2.76
                  British Virgin Islands       2.70
                                               ... 
      8      25   Oregon                       2.83
                  Puerto Rico                  2.50
                  Russia                       4.70
                  South of the Fiji Islands    4.70
                  Washington                   2.19
Name: magnitude, Length: 866, dtype: float64