Groupby and Indexing¶

In this lesson, we'll learn about an important DataFrame operation called groupby. Along the way, we'll also discuss how this groupby operation introduces an extra level of complexity toward indexing and slicing values. By the end of this lesson, students will be able to:

  • Apply the groupby operation to a list of dictionaries and to a pandas DataFrame.
  • Select values from a hierarchical index using tuples and the slice object as keys.
  • Apply the apply operation to a list of dictionaries and to a pandas DataFrame.

Previously, we learned how to find the largest earthquake in a dataset using both a list of dictionaries and using a pandas DataFrame. How about finding the largest earthquake for each place in the dataset?

In [1]:
import doctest
import io
import pandas as pd

To help visualize our work, the following dataset contains the first 12 rows from earthquakes.csv.

In [25]:
csv = """
id,year,month,day,latitude,longitude,name,magnitude
nc72666881,2016,7,27,37.6723333,-121.619,California,1.43
us20006i0y,2016,7,27,21.5146,94.5721,Burma,4.9
nc72666891,2016,7,27,37.5765,-118.85916670000002,California,0.06
nc72666896,2016,7,27,37.5958333,-118.99483329999998,California,0.4
nn00553447,2016,7,27,39.3775,-119.845,Nevada,0.3
ak13805337,2016,7,27,61.2963,-152.46,Alaska,1.8
hv61354276,2016,7,27,19.4235,-155.60983330000005,Hawaii,1.0
ak13805339,2016,7,27,61.3019,-152.4507,Alaska,2.0
ci37640584,2016,7,27,35.503,-118.40583329999998,California,1.2
nc72666901,2016,7,27,37.673,-121.6133333,California,1.67
ci37640592,2016,7,27,33.5888333,-116.8165,California,0.48
nn00553416,2016,7,27,38.2638,-118.7351,Nevada,0.9
"""

earthquakes = pd.read_csv(io.StringIO(csv), index_col="id")
earthquakes
Out[25]:
year month day latitude longitude name magnitude
id
nc72666881 2016 7 27 37.672333 -121.619000 California 1.43
us20006i0y 2016 7 27 21.514600 94.572100 Burma 4.90
nc72666891 2016 7 27 37.576500 -118.859167 California 0.06
nc72666896 2016 7 27 37.595833 -118.994833 California 0.40
nn00553447 2016 7 27 39.377500 -119.845000 Nevada 0.30
ak13805337 2016 7 27 61.296300 -152.460000 Alaska 1.80
hv61354276 2016 7 27 19.423500 -155.609833 Hawaii 1.00
ak13805339 2016 7 27 61.301900 -152.450700 Alaska 2.00
ci37640584 2016 7 27 35.503000 -118.405833 California 1.20
nc72666901 2016 7 27 37.673000 -121.613333 California 1.67
ci37640592 2016 7 27 33.588833 -116.816500 California 0.48
nn00553416 2016 7 27 38.263800 -118.735100 Nevada 0.90

Groupby in plain Python¶

Let's first see how we can solve this problem using the list of dictionaries approach.

In [3]:
# Create a dictionary of all the place names, where each place is associated with the maximum magnitude number
max_mag_by_place = {}
for earthquake in earthquakes.to_dict("records"): # Convert to list of dictionaries
    if earthquake["name"] not in max_mag_by_place or earthquake["magnitude"] > max_mag_by_place[earthquake["name"]]:
        max_mag_by_place[earthquake["name"]] = earthquake["magnitude"]

max_mag_by_place
Out[3]:
{'California': 1.67, 'Burma': 4.9, 'Nevada': 0.9, 'Alaska': 2.0, 'Hawaii': 1.0}

Groupby in Pandas¶

The inventors of pandas defined a DataFrame function called groupby to streamline this operation into a single expression.

In [5]:
# What's the maximum magnitude earthquake across the entire dataset?
earthquakes["magnitude"].max()
Out[5]:
4.9
In [4]:
# When you're writing Pandas code, probably don't write a for loop.
earthquakes.groupby("name")["magnitude"].max()
Out[4]:
name
Alaska        2.00
Burma         4.90
California    1.67
Hawaii        1.00
Nevada        0.90
Name: magnitude, dtype: float64

What's going on here? We can take a closer view at each step of the process in PandasTutor. In summary, this expression:

  1. Calls earthquakes.groupby("name") to split the earthquakes into groups by "name".
  2. For each group, selects the column "magnitude" indicated in square brackets.
  3. Combines (summarizes) each group on the selected column using the max() function.

groupby help us quickly answer questions involving "grouping by" one or more columns and then summarizing data in another column.

The best part about pandas groupby is that it allows us to quickly answer many different kinds of questions following the same format. For example, suppose we want to compute descriptive statistics for all the earthquake magnitudes that occurred on each day. Let's read the full dataset and try it out.

In [6]:
earthquakes = pd.read_csv("earthquakes.csv", index_col="id")
earthquakes
Out[6]:
year month day latitude longitude name magnitude
id
nc72666881 2016 7 27 37.672333 -121.619000 California 1.43
us20006i0y 2016 7 27 21.514600 94.572100 Burma 4.90
nc72666891 2016 7 27 37.576500 -118.859167 California 0.06
nc72666896 2016 7 27 37.595833 -118.994833 California 0.40
nn00553447 2016 7 27 39.377500 -119.845000 Nevada 0.30
... ... ... ... ... ... ... ...
nc72685246 2016 8 25 36.515499 -121.099831 California 2.42
ak13879193 2016 8 25 61.498400 -149.862700 Alaska 1.40
nc72685251 2016 8 25 38.805000 -122.821503 California 1.06
ci37672328 2016 8 25 34.308000 -118.635333 California 1.55
ci37672360 2016 8 25 34.119167 -116.933667 California 0.89

8394 rows × 7 columns

In [7]:
magnitudes_per_day = earthquakes.groupby(["year", "month", "day"])["magnitude"].describe()
magnitudes_per_day
Out[7]:
count mean std min 25% 50% 75% max
year month day
2016 7 27 272.0 1.617574 1.095349 0.06 0.9000 1.400 2.0000 5.60
28 308.0 1.448149 0.896851 0.10 0.8775 1.240 1.8000 5.10
29 309.0 1.640129 1.165952 0.01 0.8800 1.450 1.9000 7.70
30 329.0 1.615076 1.262618 0.03 0.7000 1.240 2.0000 5.70
31 278.0 1.750827 1.261577 0.10 0.9000 1.500 2.1475 5.90
8 1 356.0 1.520056 1.157326 0.04 0.8000 1.245 1.8025 6.10
2 292.0 1.539418 1.089946 0.05 0.8000 1.300 1.9000 5.50
3 294.0 1.556327 1.147365 0.01 0.8300 1.200 1.8150 5.10
4 420.0 1.249190 1.034738 0.05 0.6000 1.000 1.5825 6.30
5 256.0 1.428789 1.144244 0.10 0.6200 1.185 1.7150 5.70
6 316.0 1.313228 1.065587 0.09 0.5600 1.100 1.6200 5.40
7 316.0 1.356994 1.078556 0.10 0.6000 1.120 1.7425 5.10
8 335.0 1.484925 1.131495 0.02 0.6300 1.200 1.9900 5.10
9 272.0 1.614779 1.164186 0.10 0.8075 1.300 1.9000 5.09
10 329.0 1.404742 1.038701 0.02 0.7700 1.170 1.7400 5.40
11 356.0 1.390534 1.159147 0.04 0.6775 1.100 1.7000 7.20
12 326.0 1.533282 1.158696 0.04 0.7400 1.200 1.9000 5.40
13 284.0 1.421901 1.080338 0.10 0.7000 1.105 1.7775 5.20
14 231.0 1.692684 1.372191 0.05 0.8200 1.200 1.9000 5.90
15 222.0 1.583964 1.157553 0.07 0.8550 1.300 1.8400 5.40
16 223.0 1.629910 1.223131 0.10 0.7300 1.300 2.0000 5.10
17 220.0 1.583682 1.203617 0.10 0.8000 1.210 1.9675 5.20
18 219.0 1.499772 1.159497 0.04 0.7600 1.200 1.8050 5.90
19 226.0 1.819469 1.416291 0.11 0.8925 1.340 2.1900 7.40
20 237.0 1.553207 1.296262 0.02 0.7000 1.200 2.0000 6.40
21 266.0 1.332368 1.032210 0.03 0.6050 1.140 1.6925 5.10
22 215.0 1.451488 1.185657 0.01 0.7000 1.180 1.8000 5.60
23 233.0 1.643391 1.245661 0.04 0.8200 1.300 2.1000 6.20
24 216.0 1.553194 1.144054 0.03 0.8375 1.290 1.9000 6.80
25 238.0 1.519328 0.926028 0.10 0.8900 1.400 1.9950 5.90
In [12]:
earthquakes.groupby(["day", "month", "year"])["magnitude"].describe()
Out[12]:
count mean std min 25% 50% 75% max
day month year
1 8 2016 356.0 1.520056 1.157326 0.04 0.8000 1.245 1.8025 6.10
2 8 2016 292.0 1.539418 1.089946 0.05 0.8000 1.300 1.9000 5.50
3 8 2016 294.0 1.556327 1.147365 0.01 0.8300 1.200 1.8150 5.10
4 8 2016 420.0 1.249190 1.034738 0.05 0.6000 1.000 1.5825 6.30
5 8 2016 256.0 1.428789 1.144244 0.10 0.6200 1.185 1.7150 5.70
6 8 2016 316.0 1.313228 1.065587 0.09 0.5600 1.100 1.6200 5.40
7 8 2016 316.0 1.356994 1.078556 0.10 0.6000 1.120 1.7425 5.10
8 8 2016 335.0 1.484925 1.131495 0.02 0.6300 1.200 1.9900 5.10
9 8 2016 272.0 1.614779 1.164186 0.10 0.8075 1.300 1.9000 5.09
10 8 2016 329.0 1.404742 1.038701 0.02 0.7700 1.170 1.7400 5.40
11 8 2016 356.0 1.390534 1.159147 0.04 0.6775 1.100 1.7000 7.20
12 8 2016 326.0 1.533282 1.158696 0.04 0.7400 1.200 1.9000 5.40
13 8 2016 284.0 1.421901 1.080338 0.10 0.7000 1.105 1.7775 5.20
14 8 2016 231.0 1.692684 1.372191 0.05 0.8200 1.200 1.9000 5.90
15 8 2016 222.0 1.583964 1.157553 0.07 0.8550 1.300 1.8400 5.40
16 8 2016 223.0 1.629910 1.223131 0.10 0.7300 1.300 2.0000 5.10
17 8 2016 220.0 1.583682 1.203617 0.10 0.8000 1.210 1.9675 5.20
18 8 2016 219.0 1.499772 1.159497 0.04 0.7600 1.200 1.8050 5.90
19 8 2016 226.0 1.819469 1.416291 0.11 0.8925 1.340 2.1900 7.40
20 8 2016 237.0 1.553207 1.296262 0.02 0.7000 1.200 2.0000 6.40
21 8 2016 266.0 1.332368 1.032210 0.03 0.6050 1.140 1.6925 5.10
22 8 2016 215.0 1.451488 1.185657 0.01 0.7000 1.180 1.8000 5.60
23 8 2016 233.0 1.643391 1.245661 0.04 0.8200 1.300 2.1000 6.20
24 8 2016 216.0 1.553194 1.144054 0.03 0.8375 1.290 1.9000 6.80
25 8 2016 238.0 1.519328 0.926028 0.10 0.8900 1.400 1.9950 5.90
27 7 2016 272.0 1.617574 1.095349 0.06 0.9000 1.400 2.0000 5.60
28 7 2016 308.0 1.448149 0.896851 0.10 0.8775 1.240 1.8000 5.10
29 7 2016 309.0 1.640129 1.165952 0.01 0.8800 1.450 1.9000 7.70
30 7 2016 329.0 1.615076 1.262618 0.03 0.7000 1.240 2.0000 5.70
31 7 2016 278.0 1.750827 1.261577 0.10 0.9000 1.500 2.1475 5.90

Explain in your own words the result of the following code snippet.

In [10]:
# For each unique place name, what is the maximum latitude of any earthquake that occurred there?
earthquakes.groupby("name")["latitude"].max()
Out[10]:
name
Afghanistan                       36.634500
Alaska                            70.778700
Anguilla                          18.435800
Argentina                        -22.394200
Arizona                           36.811667
                                    ...    
Washington                        48.965667
West Virginia                     37.863000
Western Indian-Antarctic Ridge   -49.281000
Western Xizang                    34.444600
Wyoming                           44.749000
Name: latitude, Length: 118, dtype: float64
In [11]:
# For each unique place name, what is the label for the earthquake with the max latitude that occurred there?
earthquakes.groupby("name")["latitude"].idxmax()
Out[11]:
name
Afghanistan                       us10006djl
Alaska                            ak13841660
Anguilla                          pr16211003
Argentina                         us10006a1d
Arizona                           uu60159042
                                     ...    
Washington                        uw61189756
West Virginia                     se60029113
Western Indian-Antarctic Ridge    us10006cyi
Western Xizang                    us10006d57
Wyoming                           uu60157532
Name: latitude, Length: 118, dtype: object

Hierarchical indexing¶

If you look closely at the magnitudes_per_day DataFrame, you'll notice something interesting: there are three index columns in bold on the left to denote each year, month, and day group. In pandas, a DataFrame can have a hierarchical (aka multi-level) index called a MultiIndex.

In [13]:
magnitudes_per_day.index
Out[13]:
MultiIndex([(2016, 7, 27),
            (2016, 7, 28),
            (2016, 7, 29),
            (2016, 7, 30),
            (2016, 7, 31),
            (2016, 8,  1),
            (2016, 8,  2),
            (2016, 8,  3),
            (2016, 8,  4),
            (2016, 8,  5),
            (2016, 8,  6),
            (2016, 8,  7),
            (2016, 8,  8),
            (2016, 8,  9),
            (2016, 8, 10),
            (2016, 8, 11),
            (2016, 8, 12),
            (2016, 8, 13),
            (2016, 8, 14),
            (2016, 8, 15),
            (2016, 8, 16),
            (2016, 8, 17),
            (2016, 8, 18),
            (2016, 8, 19),
            (2016, 8, 20),
            (2016, 8, 21),
            (2016, 8, 22),
            (2016, 8, 23),
            (2016, 8, 24),
            (2016, 8, 25)],
           names=['year', 'month', 'day'])

A MultiIndex is .loc-accessible with Python tuples. However, the syntax is somewhat unusual, particularly when combined with slicing due to limitations in the Python programming language. For each example below, predict the output type (single value, 1-d Series, or 2-d DataFrame) as well as the contents of the output before running it.

In [14]:
# How many earthquakes occurred on 2016-07-27?
# Is this a single value, a series, or a dataframe?
magnitudes_per_day.loc[(2016, 7, 27), "count"]
Out[14]:
272.0
In [18]:
# Get only the count column from (all the rows) the magnitudes per day.
magnitudes_per_day.loc[:, "count"]
Out[18]:
year  month  day
2016  7      27     272.0
             28     308.0
             29     309.0
             30     329.0
             31     278.0
      8      1      356.0
             2      292.0
             3      294.0
             4      420.0
             5      256.0
             6      316.0
             7      316.0
             8      335.0
             9      272.0
             10     329.0
             11     356.0
             12     326.0
             13     284.0
             14     231.0
             15     222.0
             16     223.0
             17     220.0
             18     219.0
             19     226.0
             20     237.0
             21     266.0
             22     215.0
             23     233.0
             24     216.0
             25     238.0
Name: count, dtype: float64
In [19]:
# How many earthquakes occurred between August 10-15, 2016?
# Will this include or exclude the endpoint?
# In Pandas, loc is label based! With label based operations, it's hard to always know what's the
#   next label after your endpoint. Therefore, Pandas should include the endpoint even when slicing.
# Will this be a single value, a series, or a dataframe?
magnitudes_per_day.loc[(2016, 8, 10:15), "count"]

# Python does not allow the colon for slicing inside the context of a tuple!
  Cell In[19], line 6
    magnitudes_per_day.loc[(2016, 8, 10:15), "count"]
                                       ^
SyntaxError: invalid syntax
In [20]:
(2016, 8, 10:15)
  Cell In[20], line 1
    (2016, 8, 10:15)
                ^
SyntaxError: invalid syntax
In [21]:
(2016, 8, slice(10, 15))
Out[21]:
(2016, 8, slice(10, 15, None))
In [22]:
# How many earthquakes occurred between August 10-15, 2016?
# Here is the corrected version using the built-in slice function! Python should be happy!
magnitudes_per_day.loc[(2016, 8, slice(10, 15)), "count"]
Out[22]:
year  month  day
2016  8      10     329.0
             11     356.0
             12     326.0
             13     284.0
             14     231.0
             15     222.0
Name: count, dtype: float64
In [23]:
# How many earthquakes occurred in August 2016?
magnitudes_per_day.loc[(2016, 8), "count"]
Out[23]:
day
1     356.0
2     292.0
3     294.0
4     420.0
5     256.0
6     316.0
7     316.0
8     335.0
9     272.0
10    329.0
11    356.0
12    326.0
13    284.0
14    231.0
15    222.0
16    223.0
17    220.0
18    219.0
19    226.0
20    237.0
21    266.0
22    215.0
23    233.0
24    216.0
25    238.0
Name: count, dtype: float64
In [24]:
# How many earthquakes occurred in August 2016? (But keeping the year and month levels!)
magnitudes_per_day.loc[(2016, 8, slice()), "count"]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[24], line 2
      1 # How many earthquakes occurred in August 2016?
----> 2 magnitudes_per_day.loc[(2016, 8, slice()), "count"]

TypeError: slice expected at least 1 argument, got 0
In [25]:
# How many earthquakes occurred in August 2016? (But keeping the year and month levels!)
magnitudes_per_day.loc[(2016, 8, slice(None)), "count"]
Out[25]:
year  month  day
2016  8      1      356.0
             2      292.0
             3      294.0
             4      420.0
             5      256.0
             6      316.0
             7      316.0
             8      335.0
             9      272.0
             10     329.0
             11     356.0
             12     326.0
             13     284.0
             14     231.0
             15     222.0
             16     223.0
             17     220.0
             18     219.0
             19     226.0
             20     237.0
             21     266.0
             22     215.0
             23     233.0
             24     216.0
             25     238.0
Name: count, dtype: float64
In [26]:
# How many earthquakes occurred on these two dates?
magnitudes_per_day.loc[[(2016, 8, 1), (2016, 8, 15)], "count"]
Out[26]:
year  month  day
2016  8      1      356.0
             15     222.0
Name: count, dtype: float64
In [27]:
# On the days where there were fewer than 220 earthquakes, how many earthquakes were there?
magnitudes_per_day.loc[magnitudes_per_day["count"] < 220, "count"]
Out[27]:
year  month  day
2016  8      18     219.0
             22     215.0
             24     216.0
Name: count, dtype: float64
In [28]:
magnitudes_per_day.loc[(2016, 8, slice(10)), "count"]
Out[28]:
year  month  day
2016  8      1      356.0
             2      292.0
             3      294.0
             4      420.0
             5      256.0
             6      316.0
             7      316.0
             8      335.0
             9      272.0
             10     329.0
Name: count, dtype: float64
In [29]:
magnitudes_per_day.loc[(2016, 8, slice(10, None)), "count"]
Out[29]:
year  month  day
2016  8      10     329.0
             11     356.0
             12     326.0
             13     284.0
             14     231.0
             15     222.0
             16     223.0
             17     220.0
             18     219.0
             19     226.0
             20     237.0
             21     266.0
             22     215.0
             23     233.0
             24     216.0
             25     238.0
Name: count, dtype: float64
In [32]:
magnitudes_per_day.loc[(2016, slice(7, 8), slice(10, None)), "count"]
Out[32]:
year  month  day
2016  7      27     272.0
             28     308.0
             29     309.0
             30     329.0
             31     278.0
      8      10     329.0
             11     356.0
             12     326.0
             13     284.0
             14     231.0
             15     222.0
             16     223.0
             17     220.0
             18     219.0
             19     226.0
             20     237.0
             21     266.0
             22     215.0
             23     233.0
             24     216.0
             25     238.0
Name: count, dtype: float64
In [30]:
magnitudes_per_day.loc[(2016, 8, 10:), "count"]
  Cell In[30], line 1
    magnitudes_per_day.loc[(2016, 8, 10:), "count"]
                                       ^
SyntaxError: invalid syntax
In [31]:
# Main problem is that we're passing too many arguments to loc[rows, columns]
magnitudes_per_day.loc[2016, 8, 10:, "count"]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[31], line 1
----> 1 magnitudes_per_day.loc[2016, 8, 10:, "count"]

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1184, in _LocationIndexer.__getitem__(self, key)
   1182     if self._is_scalar_access(key):
   1183         return self.obj._get_value(*key, takeable=self._takeable)
-> 1184     return self._getitem_tuple(key)
   1185 else:
   1186     # we by definition only have the 0th axis
   1187     axis = self.axis or 0

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1368, in _LocIndexer._getitem_tuple(self, tup)
   1366 with suppress(IndexingError):
   1367     tup = self._expand_ellipsis(tup)
-> 1368     return self._getitem_lowerdim(tup)
   1370 # no multi-index, so validate all of the indexers
   1371 tup = self._validate_tuple_indexer(tup)

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1041, in _LocationIndexer._getitem_lowerdim(self, tup)
   1039 # we may have a nested tuples indexer here
   1040 if self._is_nested_tuple_indexer(tup):
-> 1041     return self._getitem_nested_tuple(tup)
   1043 # we maybe be using a tuple to represent multiple dimensions here
   1044 ax0 = self.obj._get_axis(0)

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1140, in _LocationIndexer._getitem_nested_tuple(self, tup)
   1137     # this is a series with a multi-index specified a tuple of
   1138     # selectors
   1139     axis = self.axis or 0
-> 1140     return self._getitem_axis(tup, axis=axis)
   1142 # handle the multi-axis by taking sections and reducing
   1143 # this is iterative
   1144 obj = self.obj

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexing.py:1424, in _LocIndexer._getitem_axis(self, key, axis)
   1422 # nested tuple slicing
   1423 if is_nested_tuple(key, labels):
-> 1424     locs = labels.get_locs(key)
   1425     indexer: list[slice | npt.NDArray[np.intp]] = [slice(None)] * self.ndim
   1426     indexer[axis] = locs

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/multi.py:3536, in MultiIndex.get_locs(self, seq)
   3532     continue
   3534 else:
   3535     # a slice or a single label
-> 3536     lvl_indexer = self._get_level_indexer(k, level=i, indexer=indexer)
   3538 # update indexer
   3539 lvl_indexer = _to_bool_indexer(lvl_indexer)

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/multi.py:3316, in MultiIndex._get_level_indexer(self, key, level, indexer)
   3308 def _get_level_indexer(
   3309     self, key, level: int = 0, indexer: npt.NDArray[np.bool_] | None = None
   3310 ):
   (...)
   3313     # in the totality of values
   3314     # if the indexer is provided, then use this
-> 3316     level_index = self.levels[level]
   3317     level_codes = self.codes[level]
   3319     def convert_indexer(start, stop, step, indexer=indexer, codes=level_codes):
   3320         # Compute a bool indexer to identify the positions to take.
   3321         # If we have an existing indexer, we only need to examine the
   3322         # subset of positions where the existing indexer is True.

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/frozen.py:79, in FrozenList.__getitem__(self, n)
     77 if isinstance(n, slice):
     78     return type(self)(super().__getitem__(n))
---> 79 return super().__getitem__(n)

IndexError: list index out of range

Practice: UFO sightings¶

UFO (unidentified flying object) sightings have received attention from US Congress in the past couple years. We've collected a public dataset consisting of 1001 reported UFO sightings around the world to help us practice groupby operations.

In [33]:
ufos = pd.read_csv("ufos.csv", index_col="datetime")
ufos
Out[33]:
city state country shape duration (seconds) duration (hours/min) comments date posted latitude longitude
datetime
10/10/1949 20:30 san marcos tx us cylinder 2700.0 45 minutes This event took place in early fall around 194... 4/27/2004 29.883056 -97.941111
10/10/1949 21:00 lackland afb tx NaN light 7200.0 1-2 hrs 1949 Lackland AFB&#44 TX. Lights racing acros... 12/16/2005 29.384210 -98.581082
10/10/1955 17:00 chester (uk/england) NaN gb circle 20.0 20 seconds Green/Orange circular disc over Chester&#44 En... 1/21/2008 53.200000 -2.916667
10/10/1956 21:00 edna tx us circle 20.0 1/2 hour My older brother and twin sister were leaving ... 1/17/2004 28.978333 -96.645833
10/10/1960 20:00 kaneohe hi us light 900.0 15 minutes AS a Marine 1st Lt. flying an FJ4B fighter/att... 1/22/2004 21.418056 -157.803611
... ... ... ... ... ... ... ... ... ... ...
10/12/1982 05:30 pearlington ms us light 29.0 29 seconds Round light which was observed going into spac... 12/12/2013 30.246389 -89.611111
10/12/1985 23:45 swansboro nc us disk 300.0 23:45 - 23:50 My sister and I observed a disk for about 5 mi... 12/2/2000 34.687500 -77.119444
10/12/1988 16:30 melbourne (vic&#44 australia) NaN au cigar 900.0 15 min&#39s Large cigar shaped craft&#44flying sideways&#4... 9/29/2004 -37.813938 144.963425
10/12/1994 11:55 schenectady ny us other 120.0 a few minutes I had just picked up my Daugther (2yrs old) fr... 4/1/2001 42.814167 -73.940000
10/12/1994 15:00 monticello ky us chevron 120.0 1-2 minutes Triangular/chevron small object with fixed lig... 10/30/2006 36.829722 -84.849167

1001 rows × 10 columns

Compute the average (mean) "duration (seconds)" for each UFO "shape".

In [35]:
ufos.groupby("shape")["duration (seconds)"].mean()
Out[35]:
shape
changing     9.265600e+02
chevron      3.111250e+02
cigar        8.217407e+02
circle       6.804353e+02
cone         3.000000e+02
cross        6.000000e+01
cylinder     1.499556e+03
delta        1.440000e+04
diamond      1.140300e+03
disk         1.143329e+03
egg          3.088000e+03
fireball     3.467656e+02
flash        4.639091e+02
formation    1.913088e+03
light        1.122005e+03
other        9.259301e+04
oval         1.425136e+03
rectangle    2.285882e+02
sphere       1.506268e+06
teardrop     1.397143e+02
triangle     7.352900e+02
unknown      1.207963e+03
Name: duration (seconds), dtype: float64

Since we're focusing on US Congress, identify the UFO sighting with the longest "duration (seconds)" for each "city" in the US ("us"). Do not include any cities outside the US.

In [37]:
"country" == "us"
Out[37]:
False
In [36]:
ufos["country" == "us"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: False

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[36], line 1
----> 1 ufos["country" == "us"]

File /opt/conda/lib/python3.11/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File /opt/conda/lib/python3.11/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: False
In [38]:
ufos[ufos["country"] == "us"].groupby("city")["duration (seconds)"].max()
Out[38]:
city
acton                                   180.0
addison (i-355 and us 20 (lake st.)     600.0
albany                                  120.0
albuquerque                            3600.0
algona                                 3600.0
                                        ...  
wolfforth                               300.0
worcester                                 4.0
yakima                                  240.0
york                                     15.0
yuma                                    900.0
Name: duration (seconds), Length: 627, dtype: float64

What is the name of the "city" that has the largest count of UFO sightings?

In [39]:
ufos.groupby("city")["duration (seconds)"].count()
Out[39]:
city
acton                                         1
addison (i-355 and us 20 (lake st.)           1
adelaide  (south australia)                   1
adelaide (pt. wakefield) (south australia)    1
albany                                        4
                                             ..
yangtzee river (china)                        1
yerevan (armenia)                             1
york                                          2
yuma                                          1
zlatoust (russia)                             1
Name: duration (seconds), Length: 794, dtype: int64
In [24]:
ufos.groupby("city")["state"].count().idxmax()
Out[24]:
'seattle'

String accessor functions¶

In data science, many tasks involve string data. In plain Python, we know that we can call string functions like split() to split a string on whitespace or find(target) to find the index that a target appears in a string.

To help improve readability of code, the inventors of pandas provide these functions as element-wise operations but hide them behind a special .str string accessor such as s.str.split().

In [3]:
ufos
Out[3]:
city state country shape duration (seconds) duration (hours/min) comments date posted latitude longitude
datetime
10/10/1949 20:30 san marcos tx us cylinder 2700.0 45 minutes This event took place in early fall around 194... 4/27/2004 29.883056 -97.941111
10/10/1949 21:00 lackland afb tx NaN light 7200.0 1-2 hrs 1949 Lackland AFB&#44 TX. Lights racing acros... 12/16/2005 29.384210 -98.581082
10/10/1955 17:00 chester (uk/england) NaN gb circle 20.0 20 seconds Green/Orange circular disc over Chester&#44 En... 1/21/2008 53.200000 -2.916667
10/10/1956 21:00 edna tx us circle 20.0 1/2 hour My older brother and twin sister were leaving ... 1/17/2004 28.978333 -96.645833
10/10/1960 20:00 kaneohe hi us light 900.0 15 minutes AS a Marine 1st Lt. flying an FJ4B fighter/att... 1/22/2004 21.418056 -157.803611
... ... ... ... ... ... ... ... ... ... ...
10/12/1982 05:30 pearlington ms us light 29.0 29 seconds Round light which was observed going into spac... 12/12/2013 30.246389 -89.611111
10/12/1985 23:45 swansboro nc us disk 300.0 23:45 - 23:50 My sister and I observed a disk for about 5 mi... 12/2/2000 34.687500 -77.119444
10/12/1988 16:30 melbourne (vic&#44 australia) NaN au cigar 900.0 15 min&#39s Large cigar shaped craft&#44flying sideways&#4... 9/29/2004 -37.813938 144.963425
10/12/1994 11:55 schenectady ny us other 120.0 a few minutes I had just picked up my Daugther (2yrs old) fr... 4/1/2001 42.814167 -73.940000
10/12/1994 15:00 monticello ky us chevron 120.0 1-2 minutes Triangular/chevron small object with fixed lig... 10/30/2006 36.829722 -84.849167

1001 rows × 10 columns

In [5]:
ufos["comments"].str.split()
Out[5]:
datetime
10/10/1949 20:30    [This, event, took, place, in, early, fall, ar...
10/10/1949 21:00    [1949, Lackland, AFB&#44, TX., Lights, racing,...
10/10/1955 17:00    [Green/Orange, circular, disc, over, Chester&#...
10/10/1956 21:00    [My, older, brother, and, twin, sister, were, ...
10/10/1960 20:00    [AS, a, Marine, 1st, Lt., flying, an, FJ4B, fi...
                                          ...                        
10/12/1982 05:30    [Round, light, which, was, observed, going, in...
10/12/1985 23:45    [My, sister, and, I, observed, a, disk, for, a...
10/12/1988 16:30    [Large, cigar, shaped, craft&#44flying, sidewa...
10/12/1994 11:55    [I, had, just, picked, up, my, Daugther, (2yrs...
10/12/1994 15:00    [Triangular/chevron, small, object, with, fixe...
Name: comments, Length: 1001, dtype: object

The above expression splits each comment by whitespace. This isn't too useful on its own, but we can then compute the length of each list to find the number of words in each comment.

In [7]:
ufos["comments"].str.split().str.len()
Out[7]:
datetime
10/10/1949 20:30    24
10/10/1949 21:00    17
10/10/1955 17:00     6
10/10/1956 21:00    26
10/10/1960 20:00    25
                    ..
10/12/1982 05:30    18
10/12/1985 23:45    16
10/12/1988 16:30     8
10/12/1994 11:55    28
10/12/1994 15:00    10
Name: comments, Length: 1001, dtype: int64

These functions don't modify the original DataFrame. To add the result as a new column in the original DataFrame, use an assignment statement.

In [9]:
ufos["word count"] = ufos["comments"].str.split().str.len()
ufos
Out[9]:
city state country shape duration (seconds) duration (hours/min) comments date posted latitude longitude word count
datetime
10/10/1949 20:30 san marcos tx us cylinder 2700.0 45 minutes This event took place in early fall around 194... 4/27/2004 29.883056 -97.941111 24
10/10/1949 21:00 lackland afb tx NaN light 7200.0 1-2 hrs 1949 Lackland AFB&#44 TX. Lights racing acros... 12/16/2005 29.384210 -98.581082 17
10/10/1955 17:00 chester (uk/england) NaN gb circle 20.0 20 seconds Green/Orange circular disc over Chester&#44 En... 1/21/2008 53.200000 -2.916667 6
10/10/1956 21:00 edna tx us circle 20.0 1/2 hour My older brother and twin sister were leaving ... 1/17/2004 28.978333 -96.645833 26
10/10/1960 20:00 kaneohe hi us light 900.0 15 minutes AS a Marine 1st Lt. flying an FJ4B fighter/att... 1/22/2004 21.418056 -157.803611 25
... ... ... ... ... ... ... ... ... ... ... ...
10/12/1982 05:30 pearlington ms us light 29.0 29 seconds Round light which was observed going into spac... 12/12/2013 30.246389 -89.611111 18
10/12/1985 23:45 swansboro nc us disk 300.0 23:45 - 23:50 My sister and I observed a disk for about 5 mi... 12/2/2000 34.687500 -77.119444 16
10/12/1988 16:30 melbourne (vic&#44 australia) NaN au cigar 900.0 15 min&#39s Large cigar shaped craft&#44flying sideways&#4... 9/29/2004 -37.813938 144.963425 8
10/12/1994 11:55 schenectady ny us other 120.0 a few minutes I had just picked up my Daugther (2yrs old) fr... 4/1/2001 42.814167 -73.940000 28
10/12/1994 15:00 monticello ky us chevron 120.0 1-2 minutes Triangular/chevron small object with fixed lig... 10/30/2006 36.829722 -84.849167 10

1001 rows × 11 columns

Apply your own functions¶

So what if you want to call your own functions on each element? Call the apply(...) function on a Series or DataFrame and pass in another function as an argument. Let's try writing a program that can remove the trailing parentheticals in the city name for the UFO dataset.

In [10]:
def clean_city_name(s):
    """
    Returns all the characters in the given string with trailing parentheticals removed.

    >>> clean_city_name("seattle (ballard area)")
    'seattle'
    >>> clean_city_name("seattle (west)")
    'seattle'
    >>> clean_city_name("melbourne (vic&#44 australia)")
    'melbourne'
    >>> clean_city_name("chester (uk/england)")
    'chester'
    >>> clean_city_name("carrieres sous poissy (france)")
    'carrieres sous poissy'
    >>> clean_city_name("seattle")
    'seattle'
    """
    index = s.find("(")
    if index == -1:
        return s
    return s[:index].rstrip()


doctest.run_docstring_examples(clean_city_name, globals())
In [14]:
ufos["city"].apply(clean_city_name).value_counts()
Out[14]:
city
new york city     12
seattle           12
tinley park        8
oak forest         8
las vegas          7
                  ..
holiday            1
eagan              1
siloam sprngs      1
canyonlands np     1
monticello         1
Name: count, Length: 774, dtype: int64
In [58]:
# Alternatively, we can use groupby on a Series.
# Kind of bizarre, but the argument to groupby is the Series itself, which Pandas uses to form groups
ufos["city"].apply(clean_city_name).groupby(ufos["city"].apply(clean_city_name)).count()
Out[58]:
city
acton             1
addison           1
adelaide          2
albany            4
albuquerque       5
                 ..
yangtzee river    1
yerevan           1
york              2
yuma              1
zlatoust          1
Name: city, Length: 774, dtype: int64
In [49]:
# Alternatively, we can make a DataFrame selection and then map the function to each element in the single-column DataFrame
ufos[["city"]].map(clean_city_name).groupby("city")["city"].count().idxmax()
Out[49]:
'new york city'

In practice, this can be useful for carrying-out data cleaning tasks such as removing punctuation or converting special characters. apply lets us write and test a function that achieves our task on a single string, and then apply that function to every string in a dataset.