In this lesson, we’ll learn about an important DataFrame operation called groupby. Along the way, we’ll also discuss how this groupby operation introduces an extra level of complexity toward indexing and slicing values. By the end of this lesson, students will be able to:
Apply the groupby operation to a list of dictionaries and to a
pandasDataFrame.Select values from a hierarchical index using tuples and the
sliceobject as keys.Apply the apply operation to a list of dictionaries and to a
pandasDataFrame.
Previously, we learned how to find the largest earthquake in a dataset using both a list of dictionaries and using a pandas DataFrame. How about finding the largest earthquake for each place in the dataset?
import doctest
import io
import pandas as pdTo help visualize our work, the following dataset contains the first 12 rows from earthquakes.csv.
csv = """
id,year,month,day,latitude,longitude,name,magnitude
nc72666881,2016,7,27,37.6723333,-121.619,California,1.43
us20006i0y,2016,7,27,21.5146,94.5721,Burma,4.9
nc72666891,2016,7,27,37.5765,-118.85916670000002,California,0.06
nc72666896,2016,7,27,37.5958333,-118.99483329999998,California,0.4
nn00553447,2016,7,27,39.3775,-119.845,Nevada,0.3
ak13805337,2016,7,27,61.2963,-152.46,Alaska,1.8
hv61354276,2016,7,27,19.4235,-155.60983330000005,Hawaii,1.0
ak13805339,2016,7,27,61.3019,-152.4507,Alaska,2.0
ci37640584,2016,7,27,35.503,-118.40583329999998,California,1.2
nc72666901,2016,7,27,37.673,-121.6133333,California,1.67
ci37640592,2016,7,27,33.5888333,-116.8165,California,0.48
nn00553416,2016,7,27,38.2638,-118.7351,Nevada,0.9
"""
earthquakes = pd.read_csv(io.StringIO(csv), index_col="id")
earthquakesGroupby in plain Python¶
Let’s first see how we can solve this problem using the list of dictionaries approach.
max_mag_by_place = {}
for earthquake in earthquakes.to_dict("records"): # Convert to list of dictionaries
if earthquake["name"] not in max_mag_by_place or earthquake["magnitude"] > max_mag_by_place[earthquake["name"]]:
max_mag_by_place[earthquake["name"]] = earthquake["magnitude"]
max_mag_by_placeGroupby in Pandas¶
The inventors of pandas defined a DataFrame function called groupby to streamline this operation into a single expression.
earthquakes.groupby("name")["magnitude"].max()What’s going on here? We can take a closer view at each step of the process in PandasTutor. In summary, this expression:
Calls
earthquakes.groupby("name")to split theearthquakesinto groups by"name".For each group, selects the column
"magnitude"indicated in square brackets.Combines (summarizes) each group on the selected column using the
max()function.
groupby help us quickly answer questions involving “grouping by” one or more columns and then summarizing data in another column.
The best part about pandas groupby is that it allows us to quickly answer many different kinds of questions following the same format. For example, suppose we want to compute descriptive statistics for all the earthquake magnitudes that occurred on each day. Let’s read the full dataset and try it out.
earthquakes = pd.read_csv("earthquakes.csv", index_col="id")
earthquakesmagnitudes_per_day = earthquakes.groupby(["year", "month", "day"])["magnitude"].describe()
magnitudes_per_dayExplain in your own words the result of the following code snippet.
earthquakes.groupby("name")["latitude"].max()Practice: UFO sightings¶
UFO (unidentified flying object) sightings have received attention from US Congress in the past couple years. We’ve collected a public dataset consisting of 1001 reported UFO sightings around the world to help us practice groupby operations.
ufos = pd.read_csv("ufos.csv", index_col="datetime")
ufosCompute the average (mean) "duration (seconds)" for each UFO "shape".
Since we’re focusing on US Congress, identify the UFO sighting with the longest "duration (seconds)" for each "city" in the US ("us"). Do not include any cities outside the US.
What is the name of the "city" that has the largest count of UFO sightings?
Apply your own functions¶
I have a hard time believing it’s really Seattle, so let’s investigate this more. Call the apply(...) function on a Series or DataFrame and pass in another function as an argument. Let’s try writing a program that can remove the trailing parentheticals in the city name for the UFO dataset.
def clean_city_name(s):
"""
Returns all the characters in the given string with trailing parentheticals removed.
>>> clean_city_name("seattle (ballard area)")
'seattle'
>>> clean_city_name("seattle (west)")
'seattle'
>>> clean_city_name("melbourne (vic, australia)")
'melbourne'
>>> clean_city_name("chester (uk/england)")
'chester'
>>> clean_city_name("carrieres sous poissy (france)")
'carrieres sous poissy'
>>> clean_city_name("seattle")
'seattle'
"""
index = s.find("(")
if index == -1: # .find() will return -1 if the given argument isn't found in the string
return s
return s[:index].rstrip()
doctest.run_docstring_examples(clean_city_name, globals())ufos["city"].apply(clean_city_name).to_frame().groupby("city").size()In practice, this can be useful for carrying-out data cleaning tasks such as removing punctuation or converting special characters. apply lets us write and test a function that achieves our task on a single string, and then apply that function to every string in a dataset.
Hierarchical indexing¶
If you look closely at the magnitudes_per_day DataFrame, you’ll notice something interesting: there are three index columns in bold on the left to denote each year, month, and day group. In pandas, a DataFrame can have a hierarchical (aka multi-level) index called a MultiIndex.
magnitudes_per_day.indexA MultiIndex is .loc-accessible with Python tuples. However, the syntax is somewhat unusual, particularly when combined with slicing due to limitations in the Python programming language. For each example below, predict the output type (single value, 1-d Series, or 2-d DataFrame) as well as the contents of the output before running it.
magnitudes_per_day.loc[(2016, 7, 27), "count"]magnitudes_per_day.loc[:, "count"]magnitudes_per_day.loc[(2016, 8, 10:15), "count"]magnitudes_per_day.loc[[(2016, 8, 1), (2016, 8, 15)], "count"]magnitudes_per_day.loc[magnitudes_per_day["count"] < 220, "count"]